-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GitHub client is fragile with recent GitHub API flakiness #1728
Comments
I plan to propose a patch for this as well but going to let this bug report get reviewed before I do anything. |
I have no objection to tuning the retries via system properties. Still, I'm surprised that that the retries are needed - the whole retry functionality was originally added to try to deal with flakiness in the Java 8 |
I’ve been running similar Jenkins infrastructure for the past 6 years without issue. But the past 4 months seems like GitHub is changing something on their end to manage the outages. Jenkins reliability has gone down in my instance for processing builds. REST usage has been a challenge as well. We actually get dropped builds pretty regularly due to the amount of fragility of relying on the REST APIs. Through debugging I’ve created a diagram to highlight problematic areas with red arrows. I’ve been rolling out internal patches to address it and some I have surfaced in plugins. I am going to open more patches as I find time (off hours). |
Here’s some new articles; we’re heavily hitting secondary rate limit with github autostatus plugin (I am removing next maintenance) and with github branch source plugin.
We’ve recently started getting shared pipeline clone failures too (Git is now getting more limited) but I was able to solve that with the caching feature of groovy shared libraries config. All that’s to say, the need for tuning against GitHub seems like it is going up. /rate_limit appears to be hit a lot (several times a second) by GitHub branch source plugin on our instance. I haven’t graphed the frequency. |
@samrocketman I'm particularly surprised that |
I could implement debug logging again and try to get a feel for the frequency. I'll let you know this coming week. I will try to get you some hard stats (number of jobs, frequency of incoming hooks, and try to get a feel for frequency) Rate limit API is always hit in two scenarios. Where there is legitimate rate limit checking and on the GitHub App credential type when it checks to see if the credential is valid. It is the second type that is most frequent because Jenkins appears to check if the credential is valid a lot. |
Here's another example of flakiness which happens quite often and before you run out of API limit. You have exceeded a secondary rate limit (click to see exception)
|
Our Jenkins instance is also particularly impacted by this new Github rate limiting since 1 month. We already applied some workarounds like the one suggested in #1728 (comment) to cache shared libraries, but even with that, around 15/20% of our builds still fail. Did you make any progress on this or did you find a workaround ? Anyway, we're ready to help on the investigation & fix, if you have any clue ! Thanks a lot, |
@samrocketman @KeepItSimpleStupid |
@KeepItSimpleStupid you don't want to use |
Thanks for the explanation @samrocketman :) @bitwiseman : here are some logs extracted from our Jenkins Controller Jenkins controller logs
On the Jenkins agent side, it materializes as possible timeouts or errors for all actions involving our GitHub app credentials and the GitHub API : checkouts, publishing github checks, tagging a commit, etc Jenkins agent git checkout failures
Jenkins agent publishing github checks
Jenkins agent git tag with GitHub API
It should be possible to have more verbose logs for both this lib and the Jenkins GitHub Branch Source plugin, let me know ! |
@KeepItSimpleStupid I posted a diagram here #1728 (comment)
For unknown reason, the app continuously checks its own validity using the rate limit API and gets throttled by GitHub. Method I am considering releasing an internal fork of GHBS to hardcode return true for this. |
@samrocketman If a consumer queries those endpoints more than once per second, the client is doing something wrong and the Separately, if possible, whatever functionality GHBS is trying to provide with its |
@bitwiseman this would definitely improve things. I think being able to tweak the retries is important too because once you get to a certain size (10s of thousands of jobs) I've found that inevitably I need to tweak how Jenkins interacts with GitHub from an API client perspective. I've been overhauling internally what I can but these two changes:
Would greatly improve my situation. |
@bitwiseman in particular it is this line which gets called excessively; should this be updated on GHBS? https://github.com/jenkinsci/github-branch-source-plugin/blob/7bec28752f4d250713a28dfe08fff4ebb7e4ee35/src/main/java/org/jenkinsci/plugins/github_branch_source/Connector.java#L557 I don't understand why Here's where it gets called the most https://github.com/jenkinsci/github-branch-source-plugin/blob/90e17c48a6accf2a1a0f189131607261c84e304a/src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMNavigator.java#L950-L952 These are similar logs that @KeepItSimpleStupid shared as well From user perspectiveHow users see this is a dropped webhook. They open a PR and builds never start because a PR job is never created; the MBP couldn't find "valid" credentials and so skips creating the job for the ref. In the case of users, they have to manually close and re-open the PR to try again with another webhook. Even worse when this happens with Tag pipelines. A tag pipeline will not get created and the only resolution is a GitHub admin "replaying" the webhook of the tag push event. Users can't resend these events themselves so this creates excessive support burden on internal GH teams. Overall the above adds to the perception that our services are poorly managed internally which is why I'm pushing out internal patches to hard-code these areas. |
I've edited the above comments a few times; just mentioning I'm done editing it |
I'd say yes. Basically, whatever clever things GHBS is doing should be done on the github-api side in
There's a lot about the SCM class structure design that is overly complex, which results in some odd/poor behavior. I could (and have previously) spent way too much time debugging and cleaning up GHBS code. Often I found the design of SCM classes themselves prevented deeper clean up.
I'm sorry to hear you're having to push out internal patches to stabilize your system. Hopefully the updates I'm doing in #1744 will address these, or otherwise allow you to mitigate them more easily. |
No worries, @bitwiseman I didn't mean to come off as complaining. I think Jenkins code complexity is a good reason why something like this could happen. I think my confusion stemmed from thinking it was intentional design but your point is valid it couldn't have been intended. It's pretty clear I'm also hitting up against the edges of performance which means I will hit edge case bugs most will not. So reporting (and at times trying to fix) issues like this hopefully make Jenkins and supporting libs overall better for all. |
I can close this issue once I verify its release of github-api plugin |
I just deployed the incremental build |
Since plugin is released I'll close this; I'll report back my usage here and if any issues I'll re-open. Thanks @bitwiseman |
@samrocketman |
@bitwiseman when you mean manually are you referring to downloading the HPI? Currently, I install plugins via maven (group/artifact/version) instead of JUC. If the artifact is still available in https://repo.jenkins-ci.org I should be able to install it fine. WRT test failures does this mean it won't be published to JUC? For the time being I'll reopen this issue until a release is GA for others. |
The new version is still available in maven, I think it is just excluded from JUC. |
This change is available in JUC (jenkinsci/bom#2694) . Please wait to close until you've tried it locally and can report results. |
@samrocketman @KeepItSimpleStupid |
Hi @bitwiseman ! Unfortunately, I don't see any improvement in our infrastructure for the moment : still similar amounts of I'll try to find some time tomorrow to enable some debug logs to trace every requests made by our Jenkins instance to the GitHub APIs and better understand what's going on, in a similar way that @samrocketman did in #1728 (comment) Less than 15 developers use our Jenkins instance every day (+ some automation with Renovate/Dependabot) so we can't say our usage is extraordinary : all of that is really strange ! |
@KeepItSimpleStupid Speaking of which, your previous logs only showed a few cases where your were querying more than once per second, so it would make sense that you haven't seen much difference. Hm... |
@KeepItSimpleStupid are you customizing the startup properties? There's a few you need to set to benefit from the PR fixes It's the retrying with random delays that will help. Set
|
@samrocketman @KeepItSimpleStupid |
I'm sorry, guys, some emergencies to handle on my side these last days : I'll give you a more detailed feedback tonight ! |
I plan to roll this out within the next two weeks and report back. |
So here are some fresh logs (curated), more detailed now thanks to what you added in the pull request ;) Jenkins controller logs
Then, to answer to @samrocketman : I didn't customize anything (yet), I was hoping that the sanity caching would be more than enough for our usage which seems far less intensive than yours, but it seems I was wrong. Thanks again to both of you, I'll make sure to give feedback more quickly about my next iterations :) |
@KeepItSimpleStupid specifically if you want it to apply at boot then you can pass the following JVM options
Or from the script console on the fly without restarting System.setProperty("org.kohsuke.github.GitHubClient.minRetryInterval", "1000")
System.setProperty("org.kohsuke.github.GitHubClient.maxRetryInterval", "3000")
System.setProperty("org.kohsuke.github.GitHubClient.retryCount", "60") The way Liam wrote that should work. |
Deployed this morning by customizing the Jenkins helm chart with the value controller:
javaOpts: >-
-Dorg.kohsuke.github.GitHubClient.minRetryInterval=1000
-Dorg.kohsuke.github.GitHubClient.maxRetryInterval=3000
-Dorg.kohsuke.github.GitHubClient.retryCount=60 I will give more feedback later, but first occurrences seem to be OK : we stopped before the 60 retries Jenkins controller logs
|
That checks out based on my experience. Here’s info about secondary limits https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits And upon reading Jenkins easily uses up 900 points per minute at times for REST. So retrying taking over a minute kind of makes sense. Mine would retry up to 29 times. Looks like yours stops around 22 retries. |
Oh ! I was aware of these secondary limits, but didn't know about this "point system". I was still really surprised that our common Jenkins usage triggers such severe rate limiting on GitHub side so I spent some time today to collect Github audit log of API requests I discovered that a small in-house workload (it displays the version of the components deployed in our different environments and their GitHub release notes) was triggering 1000+ requests in the same second to the GitHub API, and doing that every 30 minutes. I stopped the in-house workload at the end of the morning, and didn't observe a single occurrence in our Jenkins controller logs since :) This in-house workload authenticates with a technical Github account, and not with the same GitHub app used by our Jenkins instance, but both are hosted in the same Kubernetes cluster, and so use the same source IP to request the GitHub API : i suppose that the in-house workload was triggering some secondary limits based on the source IP, and so our Jenkins instance was just a collateral damage of this bad behavior. I'll wait to observe the behavior during some weekdays with our real Jenkins usage and report here. Thanks again @bitwiseman @samrocketman, even if Jenkins was not the culprit for my use case, the enhancements to this library and the discussions with you really helped me to finally get to the bottom of it ! |
Since I bake our infrastructure I decided to bake the tuning properties into init.groovy.d Jenkins hook scripts. That's one of the advantages of being able to tune this on the fly @bitwiseman ; static properties cannot be modified with init.groovy.d because it happens too late in the JVM startup. It reduces the amount of JVM tuning options I have to pass. Here's my startup script /**
Tune github-api client https://github.com/hub4j/github-api/issues/1728#issuecomment-1846337277
defaults: minRetryInterval=100, maxRetryInterval=100, retryCount=2
*/
System.setProperty("org.kohsuke.github.GitHubClient.minRetryInterval", "1000")
System.setProperty("org.kohsuke.github.GitHubClient.maxRetryInterval", "3000")
System.setProperty("org.kohsuke.github.GitHubClient.retryCount", "60")
/**
Tune scm-filter-jervis plugin client
https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/80e28289ede66fa18553b4e0ca5f518a4bd782bc/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L55-L79
defaults: minRetryInterval=1000, maxRetryInterval=3000, retryLimit=30
*/
//System.setProperty("net.gleske.scmfilter.impl.trait.JervisFilterTrait.minRetryInterval", "1000")
//System.setProperty("net.gleske.scmfilter.impl.trait.JervisFilterTrait.maxRetryInterval", "3000")
System.setProperty("net.gleske.scmfilter.impl.trait.JervisFilterTrait.retryLimit", "60")
|
This evening after some testing I've rolled the lib out to production.
One thing I notice is that the GitHub checks plugin https://plugins.jenkins.io/github-checks/ still inconsistently submits the Screenshot |
I've narrowed down the likely problematic area to https://github.com/jenkinsci/github-checks-plugin/blob/c3c161b5f741f921d818a76d10698a5e260d5255/src/main/java/io/jenkins/plugins/checks/github/GitHubChecksPublisher.java#L81-L82 with a source review. I've enabled debug logs for this class.
I'm having trouble reviewing exactly how GitHubClient is used but somehow it does not appear to be using the retry logic. Hopefully, I'll learn more as debug logs come in. |
I will say that this last bit (GitHub checks) is merely a UI bug for my users now. They just get a little confused when it doesn't vote back but they're not blocked. Your retry logic fixes the critical issue of jobs not getting created at all in MBP which is huge 😁 . I'll keep this open to track fixes for GitHub checks plugin, though. |
I notice lots of |
What is the URL for the null rate limit errors? |
Here's a few examples from today
However, I keep 1000 copies of debug logs (and use system logrotate and gzip to compress logs daily) so I have the logs from yesterday when I made that comment as well. This is so I can keep 30 days of debug logs for investigations and not run out of storage (compression is absolutely necessary). Here's the logs from yesterday around the time I noted the issue.
All of the errors from yesterday's comment were associated with the From Source: # using support plugin
cd /var/lib/logs/custom
find . -type f -mtime -1 -print0 | xargs -0 zgrep -F 'malformed X-RateLimit' | wc -l
7123
zgrep -F 'malformed X-RateLimit' * | wc -l
7788 It only occurs in |
An update, our stability has significantly improved. There's still occasional flakiness with GitHub checks plugin not submitting its own commit status (e.g. still in progress after job finishes). But overall, developer interruption is resolved. |
I’m going to call this resolved. The inability for GH checks to add status is rare and probably not worth chasing. Thanks for providing tunable options. |
The bug is in the following lines of code
github-api/src/main/java/org/kohsuke/github/GitHubClient.java
Lines 47 to 52 in 1cb9e66
It will only retry twice with 200ms in between retries.
Why this is a bug
GitHub branch source Jenkins plugin will drop builds occasionally from received webhooks. The GHBS plugin relies on GitHub plugin which relies on github-api plugin which provides this library as a client. Here's an exception from multibranch pipeline events.
How many retries should it be?
In practice, rolling out this patch from another plugin not using this library but does interact with GitHub I've seen GitHub API retried 28 times over the course of 1 minute. The retry limit for this was 30 and sleeping randomly between 1000 and 3000 ms. I've since increased the retry cap to 60 in my production system for this plugin.
jenkinsci/scm-filter-jervis-plugin@c21d0c1#diff-46da075f8e1e17ccf594a86bc96e8c8eaf295617b81d5ae5206634e37021bb49R119-R122
Ideal solution
You can keep the same retries but I would like this to be tunable by the user via system properties on the fly. That means do not use static properties except as a default value.
And for sleeping between retries I would like it to be random instead of a fixed value.
This should have a sane default but allow clients to tune them on the fly. Random delay between retries is a cloud best practice for interacting with distributed systems. See also https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
The text was updated successfully, but these errors were encountered: