Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(142): separate out metrics and restructure go routines and functions #137

Merged
merged 14 commits into from
Oct 18, 2023

Conversation

rhoofard
Copy link
Contributor

@rhoofard rhoofard commented Oct 3, 2023

No description provided.

@github-actions github-actions bot added the go label Oct 3, 2023
@rhoofard rhoofard marked this pull request as ready for review October 3, 2023 18:01
@rhoofard rhoofard requested a review from a team as a code owner October 3, 2023 18:01
@codecov
Copy link

codecov bot commented Oct 3, 2023

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (064251d) 27.08% compared to head (f21909d) 39.34%.

Additional details and impacted files
@@             Coverage Diff             @@
##             main     #137       +/-   ##
===========================================
+ Coverage   27.08%   39.34%   +12.25%     
===========================================
  Files           5        6        +1     
  Lines         443      488       +45     
===========================================
+ Hits          120      192       +72     
+ Misses        316      285       -31     
- Partials        7       11        +4     
Files Coverage Δ
...ver/gitproviderreceiver/internal/common/helpers.go 100.00% <100.00%> (ø)
.../internal/scraper/githubscraper/graphql_helpers.go 59.72% <87.50%> (+9.72%) ⬆️
...r/internal/scraper/githubscraper/github_scraper.go 17.52% <63.15%> (+14.99%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rhoofard and others added 6 commits October 3, 2023 11:12
Signed-off-by: Ryan Hoofard <42755382+rhoofard@users.noreply.github.com>
Signed-off-by: Ryan Hoofard <42755382+rhoofard@users.noreply.github.com>
Also cleaned up some of the redundant interface conversions for SearchNode as well as what looks like a bug when collecting all the repo data.  "data" was being set on each iteration through the pages for repo data but  "searchData" was being used.
ghs *githubScraper,
ctx context.Context,
client graphql.Client,
repos []SearchNodeRepository,
now pcommon.Timestamp,
pullRequestCh chan []PullRequestNode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pull request channel actually need to be passed in? Or can it locally be declared within the function, thereby making the call within the function the async call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking these can be changed iteratively

ghs *githubScraper,
ctx context.Context,
client graphql.Client,
repos []SearchNodeRepository,
now pcommon.Timestamp,
pullRequestCh chan []PullRequestNode,
waitGroup *sync.WaitGroup,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question with waitGroup

pullRequests = append(pullRequests, pr.Repository.PullRequests.Nodes...)

prCursor = &pr.Repository.PullRequests.PageInfo.EndCursor
ghs.logger.Sugar().Errorf("error getting pr data", zap.Error(err))
}
pullRequestCh <- pullRequests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPullRequests by name should return pull requests as a value, not a channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this defeat the point of the go routines? Or do you mean keep the function basic but then just add that to the channel in the go routine outside the function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this in the tech sync today

@@ -194,21 +224,16 @@ func (ghs *githubScraper) processCommits(
func (ghs *githubScraper) getBranches(
ctx context.Context,
client graphql.Client,
repos []SearchNode,
repos []SearchNodeRepository,
now pcommon.Timestamp,
branchCh chan []BranchNode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about the channels and waitGroup being passed in instead of running an async select within the function and returning the actual branch data.

Copy link
Contributor Author

@rhoofard rhoofard Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by an async select here but i get the moving out the channel and wait group

if ghs.cfg.MetricsBuilderConfig.Metrics.GitRepositoryContributorCount.Enabled {
wg1.Add(1)
go ghs.getContributorCount(ctx, genClient, work[i], now, &wg1)
}
}

for i := 0; i < opBuf; i++ {
go ghs.processPullRequests(ctx, genClient, now, pullRequestCh)
go processPullRequests(ghs, ctx, genClient, now, pullRequestCh)
Copy link
Collaborator

@adrielp adrielp Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These calls feel off to me and inconsistent with each other. I don't think is actually buffering the channel. https://go.dev/doc/effective_go#channels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to pass the scraper in and they'd look the same. Not sure how the channel wouldn't be buffered though since we are setting it to a size > 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice code coverage improvements here 👍🏼

data, err := getRepoDataBySearch(ctx, client, searchQuery, repoCursor)
if err != nil {
return nil, err
}
return data, nil
}

type mockClient struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are providing mockClient to the MakeRequest function? This is for testing only correct? Should go in helpers_test.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for testing yeah. I thought helpers_test.go was just the same thing but on the gitlab side no?

@@ -194,21 +224,16 @@ func (ghs *githubScraper) processCommits(
func (ghs *githubScraper) getBranches(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance that this function never completes if there are no branches or page numbers?

pullRequests = append(pullRequests, pr.Repository.PullRequests.Nodes...)

prCursor = &pr.Repository.PullRequests.PageInfo.EndCursor
ghs.logger.Sugar().Errorf("error getting pr data", zap.Error(err))
}
pullRequestCh <- pullRequests

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a chance that this function blocks infinitely if there are no PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made a test for that and while it doesn't block, it adds an empty object to the channel which shouldn't affect anything since where they're processed from the channel there's a loop that is based on the size of the object that would be 0. Should still probably just avoid placing empty objects in there though

@@ -368,10 +389,14 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {

for i := 0; i < pages; i++ {
results := searchData.GetSearch()
searchRepos = append(searchRepos, results.Nodes...)
for _, repo := range results.Nodes {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this iterator over results.Nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a quick way to append everything in a slice without needing a for-loop


repoCursor = &searchData.Search.PageInfo.EndCursor
data, err = getRepoData(ctx, genClient, sq, ownertype, repoCursor)
searchData, err = getRepoData(ctx, genClient, sq, ownertype, repoCursor)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we overwriting the existing searchData value that gets used on line 391? Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this section is weird I just hadn't bothered seeing how to change it since that's how it was when i started working on it. The first bit of this data we grab is outside the for loop and we use that to get the amount of pages we need to iterate through but we also still just have the first page of data, so it gets processed first, then we grab the next page to be processed on the next loop at the end. On the other functions like this, we have a graphql query to get the count first to avoid this so that might be an option here as well, just havnt checked it out yet.

@@ -381,13 +406,16 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {

}

if _, ok := data.(*getRepoDataBySearchResponse); ok {
if searchRepos != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is len() a more accurate check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a slice of size zero is nil so they're equivalent

@@ -381,13 +406,16 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {

}

if _, ok := data.(*getRepoDataBySearchResponse); ok {
if searchRepos != nil {

var wg1 sync.WaitGroup
var opBuf int = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this opBuf variable intended to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the size of the buffered channel which here is also meant to be the limit for the amount of go routines pushing or pulling data from it. Its not perfect but if you have equal producers and consumers then they should in theory most of the time have their own space in the channel to push and pull from keeping the throughput as high as possible.

@@ -381,13 +406,16 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {

}

if _, ok := data.(*getRepoDataBySearchResponse); ok {
if searchRepos != nil {

var wg1 sync.WaitGroup
var opBuf int = 3

chunkSize := (len(searchRepos) + opBuf - 1) / opBuf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the algorithm here. Can we add a comment explaining the intended behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is just checking if the previous section that gets us all of the repos we want to process data from actually returned anything and if so, continue but otherwise dont bother with the go routine stuff

@@ -402,15 +430,15 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {

wg1.Add(2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a stupid question, but why are we using Add(2) instead of Add(1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each additional Add counter corresponds to a Done coming from one of the go routines. So there's two that are guaranteed to happen on each iteration there which is why we do Add(2). If you were to do any less it panicks on a negative waitgroup counter once Done is called one too many times.

@rhoofard rhoofard changed the title refactor: updated gitlab go routines refactor(142): updated gitlab go routines Oct 16, 2023
@4lch4 4lch4 linked an issue Oct 16, 2023 that may be closed by this pull request
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 17, 2023
Copy link
Collaborator

@adrielp adrielp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is decent enough to merge & then iterate on the improvements. Let’s get issues created from the comments that need to be addressed through future iterations and then merge this. Sound good?

@adrielp
Copy link
Collaborator

adrielp commented Oct 17, 2023

This is more than a refactor, so when squashing the title needs to be updated to reflect that it’s a feature change with new metrics. @rhoofard I went ahead and updated the title.

@adrielp adrielp changed the title refactor(142): updated gitlab go routines feat(142): separate out metrics and restructure go routines and functions Oct 17, 2023
@rhoofard rhoofard merged commit a68d9cf into main Oct 18, 2023
10 checks passed
@rhoofard rhoofard deleted the refactor-go-routines branch October 18, 2023 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation go
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tests to github_scraper.go
3 participants