-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(142): separate out metrics and restructure go routines and functions #137
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #137 +/- ##
===========================================
+ Coverage 27.08% 39.34% +12.25%
===========================================
Files 5 6 +1
Lines 443 488 +45
===========================================
+ Hits 120 192 +72
+ Misses 316 285 -31
- Partials 7 11 +4
☔ View full report in Codecov by Sentry. |
Signed-off-by: Ryan Hoofard <42755382+rhoofard@users.noreply.github.com>
Signed-off-by: Ryan Hoofard <42755382+rhoofard@users.noreply.github.com>
Also cleaned up some of the redundant interface conversions for SearchNode as well as what looks like a bug when collecting all the repo data. "data" was being set on each iteration through the pages for repo data but "searchData" was being used.
pkg/receiver/gitproviderreceiver/internal/scraper/githubscraper/github_scraper.go
Outdated
Show resolved
Hide resolved
pkg/receiver/gitproviderreceiver/internal/scraper/githubscraper/github_scraper.go
Outdated
Show resolved
Hide resolved
ghs *githubScraper, | ||
ctx context.Context, | ||
client graphql.Client, | ||
repos []SearchNodeRepository, | ||
now pcommon.Timestamp, | ||
pullRequestCh chan []PullRequestNode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this pull request channel actually need to be passed in? Or can it locally be declared within the function, thereby making the call within the function the async call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking these can be changed iteratively
ghs *githubScraper, | ||
ctx context.Context, | ||
client graphql.Client, | ||
repos []SearchNodeRepository, | ||
now pcommon.Timestamp, | ||
pullRequestCh chan []PullRequestNode, | ||
waitGroup *sync.WaitGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question with waitGroup
pkg/receiver/gitproviderreceiver/internal/scraper/githubscraper/github_scraper.go
Show resolved
Hide resolved
pullRequests = append(pullRequests, pr.Repository.PullRequests.Nodes...) | ||
|
||
prCursor = &pr.Repository.PullRequests.PageInfo.EndCursor | ||
ghs.logger.Sugar().Errorf("error getting pr data", zap.Error(err)) | ||
} | ||
pullRequestCh <- pullRequests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getPullRequests
by name should return pull requests as a value, not a channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this defeat the point of the go routines? Or do you mean keep the function basic but then just add that to the channel in the go routine outside the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we talked about this in the tech sync today
@@ -194,21 +224,16 @@ func (ghs *githubScraper) processCommits( | |||
func (ghs *githubScraper) getBranches( | |||
ctx context.Context, | |||
client graphql.Client, | |||
repos []SearchNode, | |||
repos []SearchNodeRepository, | |||
now pcommon.Timestamp, | |||
branchCh chan []BranchNode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about the channels and waitGroup being passed in instead of running an async select within the function and returning the actual branch data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by an async select here but i get the moving out the channel and wait group
if ghs.cfg.MetricsBuilderConfig.Metrics.GitRepositoryContributorCount.Enabled { | ||
wg1.Add(1) | ||
go ghs.getContributorCount(ctx, genClient, work[i], now, &wg1) | ||
} | ||
} | ||
|
||
for i := 0; i < opBuf; i++ { | ||
go ghs.processPullRequests(ctx, genClient, now, pullRequestCh) | ||
go processPullRequests(ghs, ctx, genClient, now, pullRequestCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These calls feel off to me and inconsistent with each other. I don't think is actually buffering the channel. https://go.dev/doc/effective_go#channels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to pass the scraper in and they'd look the same. Not sure how the channel wouldn't be buffered though since we are setting it to a size > 0
pkg/receiver/gitproviderreceiver/internal/scraper/githubscraper/github_scraper_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice code coverage improvements here 👍🏼
data, err := getRepoDataBySearch(ctx, client, searchQuery, repoCursor) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return data, nil | ||
} | ||
|
||
type mockClient struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are providing mockClient
to the MakeRequest
function? This is for testing only correct? Should go in helpers_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for testing yeah. I thought helpers_test.go was just the same thing but on the gitlab side no?
pkg/receiver/gitproviderreceiver/internal/scraper/gitlabscraper/gitlab_scraper.go
Outdated
Show resolved
Hide resolved
pkg/receiver/gitproviderreceiver/internal/scraper/gitlabscraper/helpers.go
Outdated
Show resolved
Hide resolved
@@ -194,21 +224,16 @@ func (ghs *githubScraper) processCommits( | |||
func (ghs *githubScraper) getBranches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance that this function never completes if there are no branches or page numbers?
pullRequests = append(pullRequests, pr.Repository.PullRequests.Nodes...) | ||
|
||
prCursor = &pr.Repository.PullRequests.PageInfo.EndCursor | ||
ghs.logger.Sugar().Errorf("error getting pr data", zap.Error(err)) | ||
} | ||
pullRequestCh <- pullRequests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a chance that this function blocks infinitely if there are no PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made a test for that and while it doesn't block, it adds an empty object to the channel which shouldn't affect anything since where they're processed from the channel there's a loop that is based on the size of the object that would be 0. Should still probably just avoid placing empty objects in there though
@@ -368,10 +389,14 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) { | |||
|
|||
for i := 0; i < pages; i++ { | |||
results := searchData.GetSearch() | |||
searchRepos = append(searchRepos, results.Nodes...) | |||
for _, repo := range results.Nodes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind this iterator over results.Nodes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a quick way to append everything in a slice without needing a for-loop
|
||
repoCursor = &searchData.Search.PageInfo.EndCursor | ||
data, err = getRepoData(ctx, genClient, sq, ownertype, repoCursor) | ||
searchData, err = getRepoData(ctx, genClient, sq, ownertype, repoCursor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we overwriting the existing searchData
value that gets used on line 391? Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this section is weird I just hadn't bothered seeing how to change it since that's how it was when i started working on it. The first bit of this data we grab is outside the for loop and we use that to get the amount of pages we need to iterate through but we also still just have the first page of data, so it gets processed first, then we grab the next page to be processed on the next loop at the end. On the other functions like this, we have a graphql query to get the count first to avoid this so that might be an option here as well, just havnt checked it out yet.
@@ -381,13 +406,16 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) { | |||
|
|||
} | |||
|
|||
if _, ok := data.(*getRepoDataBySearchResponse); ok { | |||
if searchRepos != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is len()
a more accurate check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a slice of size zero is nil so they're equivalent
@@ -381,13 +406,16 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) { | |||
|
|||
} | |||
|
|||
if _, ok := data.(*getRepoDataBySearchResponse); ok { | |||
if searchRepos != nil { | |||
|
|||
var wg1 sync.WaitGroup | |||
var opBuf int = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this opBuf
variable intended to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the size of the buffered channel which here is also meant to be the limit for the amount of go routines pushing or pulling data from it. Its not perfect but if you have equal producers and consumers then they should in theory most of the time have their own space in the channel to push and pull from keeping the throughput as high as possible.
@@ -381,13 +406,16 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) { | |||
|
|||
} | |||
|
|||
if _, ok := data.(*getRepoDataBySearchResponse); ok { | |||
if searchRepos != nil { | |||
|
|||
var wg1 sync.WaitGroup | |||
var opBuf int = 3 | |||
|
|||
chunkSize := (len(searchRepos) + opBuf - 1) / opBuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused about the algorithm here. Can we add a comment explaining the intended behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is just checking if the previous section that gets us all of the repos we want to process data from actually returned anything and if so, continue but otherwise dont bother with the go routine stuff
@@ -402,15 +430,15 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) { | |||
|
|||
wg1.Add(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a stupid question, but why are we using Add(2)
instead of Add(1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each additional Add
counter corresponds to a Done
coming from one of the go routines. So there's two that are guaranteed to happen on each iteration there which is why we do Add(2)
. If you were to do any less it panicks on a negative waitgroup counter once Done
is called one too many times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is decent enough to merge & then iterate on the improvements. Let’s get issues created from the comments that need to be addressed through future iterations and then merge this. Sound good?
|
No description provided.