Add sanity caching and retry controls #1744

bitwiseman · 2023-11-19T02:59:03Z

Description

Add one query per second limit on getRateLimit() and getMeta()
Add configurable and randomization to retries

Mitigate the most severe parts of #1728

@samrocketman @KeepItSimpleStupid
Please take a look.

Before submitting a PR:

Changes must not break binary backwards compatibility. If you are unclear on how to make the change you think is needed while maintaining backward compatibility, CONTRIBUTING.md for details.
Add JavaDocs and other comments explaining the behavior.
When adding or updating methods that fetch entities, add @link JavaDoc entries to the relevant documentation on https://docs.github.com/en/rest .
Add tests that cover any added or changed code. This generally requires capturing snapshot test data. See CONTRIBUTING.md for details.
Run mvn -D enable-ci clean install site locally. If this command doesn't succeed, your change will not pass CI.
Push your changes to a branch other than main. You will create your PR from that branch.

When creating a PR:

Fill in the "Description" above with clear summary of the changes. This includes:
- If this PR fixes one or more issues, include "Fixes #" lines for each issue.
- Provide links to relevant documentation on https://docs.github.com/en/rest where possible. If not including links, explain why not.
All lines of new code should be covered by tests as reported by code coverage. Any lines that are not covered must have PR comments explaining why they cannot be covered. For example, "Reaching this particular exception is hard and is not a particular common scenario."
Enable "Allow edits from maintainers".

codecov · 2023-11-19T03:05:13Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (2062439) 80.16% compared to head (7a73735) 80.41%.

Files	Patch %	Lines
src/main/java/org/kohsuke/github/GitHubClient.java	97.36%	1 Missing and 1 partial ⚠️
...va/org/kohsuke/github/GitHubSanityCachedValue.java	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1744      +/-   ##
============================================
+ Coverage     80.16%   80.41%   +0.24%     
- Complexity     2302     2309       +7     
============================================
  Files           217      218       +1     
  Lines          6964     7015      +51     
  Branches        371      371              
============================================
+ Hits           5583     5641      +58     
+ Misses         1150     1141       -9     
- Partials        231      233       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/java/org/kohsuke/github/GitHubClient.java

KeepItSimpleStupid · 2023-11-19T12:12:32Z

src/main/java/org/kohsuke/github/GitHubSanityCachedValue.java

+     */
+    <E extends Throwable> T get(SupplierThrows<T, E> query) throws E {
+        synchronized (lock) {
+            if (Instant.now().getEpochSecond() > lastQueriedAtEpochSeconds) {


Maybe the delay of "1 second" should be configurable as well ?

Co-authored-by: Fabien Thouny <[email protected]>

samrocketman

Can we not use properties in static fields? They require a JVM restart which can be a challenge on a heavily active system. Any particular reason you don't want to do a property look up where the integers get used?

Defining startup properties isn't generally bad but if you're trying to do tuning for the retry behavior it's a drag to have downtime every tine you tweak it.

e.g. System.setProperty within script console and then setting the system property on startup once you have found an acceptable range of values.

samrocketman · 2023-11-19T18:41:42Z

src/main/java/org/kohsuke/github/GitHubClient.java

@@ -632,11 +651,15 @@ private static IOException interpretApiError(IOException e,

    private static void logRetryConnectionError(IOException e, URL url, int retries) throws IOException {
        // There are a range of connection errors where we want to wait a moment and just automatically retry
+        long sleepTime = minRetryInterval;


I suggest setting minRetryInterval and maxRetryInterval via Integer.getInteger property here.

samrocketman · 2023-11-19T18:47:06Z

src/main/java/org/kohsuke/github/GitHubClient.java

        LOGGER.log(INFO,
-                e.getMessage() + " while connecting to " + url + ". Sleeping " + GitHubClient.retryTimeoutMillis
+                e.getMessage() + " while connecting to " + url + ". Sleeping " + sleepTime


I recommend creating a trace ID here for debug logging. That way an admin can search debug logs and find related logs for a single retry sequence. Here's an example

log https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/80e28289ede66fa18553b4e0ca5f518a4bd782bc/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L144

creating trace ID via sha256sum https://github.com/jenkinsci/scm-filter-jervis-plugin/blob/80e28289ede66fa18553b4e0ca5f518a4bd782bc/src/main/groovy/net/gleske/scmfilter/impl/trait/JervisFilterTrait.groovy#L245

When I enable debug logging for a class in the mentioned class it is so active in parallel that all of the logs come in out of order. Because of that, using the trace- ID as a prefix to all of the logs enable me to search for a series of logs along with their retries. It enabled me to find the maximum retry count across logs as well which helps an admin with tuning.

For exmaple, I default to retries of 30 in my class but I found in practice with GitHub it could retry up to 28 times. Because that was so close to the max retry limit I increased the retry limit to 60 in my particular setup.

I also set the minimum time between retries to be 1000ms and the maximum to be 3000ms. I've found GitHub requiring me to retry up to 1 minute in these scenarios because of secondary API limits.

The new secondary API limits are very aggressive at the moment.

I realize the logging mechanisms have to change a little bit with my feedback; they're not as straightforward as making the change just in this area of code.

bitwiseman · 2023-11-19T20:34:48Z

@samrocketman

I was thinking property lookup is slow and I don't want to do it per query. But compared to query times it's probably negligible. I see your point about down time.

This isn't intended as a long term feature as it stands right now and I want to have the least change to existing behavior possible.

I'd be okay with saying that if any of the environment variables are set on startup then they will be checked for every query. So, you have to opt in to the behavior on startup.

samrocketman · 2023-11-19T23:04:18Z

@samrocketman

I was thinking property lookup is slow and I don't want to do it per query. But compared to query times it's probably negligible. I see your point about down time.

This isn't intended as a long term feature as it stands right now and I want to have the least change to existing behavior possible.

I'd be okay with saying that if any of the environment variables are set on startup then they will be checked for every query. So, you have to opt in to the behavior on startup.

being tunable at all is a plus, really. In the case of pipeline API interactions we're moving to a weird but workable hack where we obtain one of 10 flocks randomly meaning there can be up to 10 clients active with GitHub at a time in pipelines. That's kind of how bad it is, though, we're at that point.

We're kind of pegged against GH limits so I do think property look up is negligible. But either way if that's how it is we can work with it. It solves a critical issue on our end with dropped pipeline jobs not being created when they should.

bitwiseman · 2023-11-20T22:33:57Z

@samrocketman
Updated with trace id for logging.

You said:

Maybe the delay of "1 second" should be configurable as well ?

Perhaps, but why add the complexity? My thought right now is this is purely a sanity check. One second is sane and simple.
If this is enough to bring stability, we're done. If not, we can add configurability.

KeepItSimpleStupid · 2023-11-20T23:24:43Z

You said:

Maybe the delay of "1 second" should be configurable as well ?

Perhaps, but why add the complexity? My thought right now is this is purely a sanity check. One second is sane and simple. If this is enough to bring stability, we're done. If not, we can add configurability.

@bitwiseman This comment was from me : I was not sure if you wanted to adopt an iterative approach or if you wanted to anticipate all the needs ;) But it's perfectly fine like that !

For my information, once it's merged and released, for a usage in Jenkins, a new version of this plugin would be needed, right ?

Thanks a lot !

bitwiseman · 2023-11-20T23:51:10Z

@KeepItSimpleStupid

@bitwiseman This comment was from me : I was not sure if you wanted to adopt an iterative approach or if you wanted to anticipate all the needs ;) But it's perfectly fine like that !

Yes, iterative.

For my information, once it's merged and released, for a usage in Jenkins, a new version of this plugin would be needed, right ?

Yes.

bitwiseman · 2023-11-21T16:58:10Z

@samrocketman
Does this look ready to merge?

bitwiseman · 2023-11-22T17:25:54Z

@samrocketman
Approved?

samrocketman · 2023-11-22T22:16:21Z

@samrocketman
Approved?

Away on holiday at the moment so can't easily review from mobile but I'll take a look

samrocketman

These changes look good to me. Thanks for implementing and accommodating feedback. Interesting use of thread local string I learn something every review in Jenkins projects.

bitwiseman · 2023-11-23T10:16:54Z

Released in 1.318.

samrocketman · 2023-11-23T18:44:32Z

I subscribed to github-api plugin repo for releases

Add sanity caching and retry controls

60a3fc6

KeepItSimpleStupid reviewed Nov 19, 2023

View reviewed changes

src/main/java/org/kohsuke/github/GitHubClient.java Outdated Show resolved Hide resolved

KeepItSimpleStupid reviewed Nov 19, 2023

View reviewed changes

Update src/main/java/org/kohsuke/github/GitHubClient.java

744911f

Co-authored-by: Fabien Thouny <[email protected]>

samrocketman suggested changes Nov 19, 2023

View reviewed changes

samrocketman reviewed Nov 19, 2023

View reviewed changes

bitwiseman added 2 commits November 20, 2023 10:24

Allow dynamic tuning of retries at runtime

df7b0d3

Add trace id to sendRequest logs

7cc9552

bitwiseman requested a review from samrocketman November 21, 2023 05:11

bitwiseman added 3 commits November 21, 2023 15:28

Add trace log testing configs

2a94473

Merge remote-tracking branch 'upstream/main' into sanity-cache

21aa01e

Cleanup test and do not run trace logging on slow tests

f01ddc8

bitwiseman force-pushed the sanity-cache branch from f9d9621 to f01ddc8 Compare November 22, 2023 00:00

bitwiseman mentioned this pull request Nov 22, 2023

GitHub client is fragile with recent GitHub API flakiness #1728

Closed

Tweak to get greater coverage

7a73735

samrocketman reviewed Nov 22, 2023

View reviewed changes

bitwiseman merged commit b2b3e1c into hub4j:main Nov 22, 2023
11 checks passed

bitwiseman deleted the sanity-cache branch November 22, 2023 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sanity caching and retry controls #1744

Add sanity caching and retry controls #1744

bitwiseman commented Nov 19, 2023 •

edited

Loading

codecov bot commented Nov 19, 2023 •

edited

Loading

KeepItSimpleStupid Nov 19, 2023

samrocketman left a comment •

edited

Loading

samrocketman Nov 19, 2023

samrocketman Nov 19, 2023

samrocketman Nov 19, 2023

bitwiseman commented Nov 19, 2023

samrocketman commented Nov 19, 2023 •

edited

Loading

bitwiseman commented Nov 20, 2023

KeepItSimpleStupid commented Nov 20, 2023

bitwiseman commented Nov 20, 2023

bitwiseman commented Nov 21, 2023

bitwiseman commented Nov 22, 2023

samrocketman commented Nov 22, 2023

samrocketman left a comment

bitwiseman commented Nov 23, 2023

samrocketman commented Nov 23, 2023

Add sanity caching and retry controls #1744

Add sanity caching and retry controls #1744

Conversation

bitwiseman commented Nov 19, 2023 • edited Loading

Description

Before submitting a PR:

When creating a PR:

codecov bot commented Nov 19, 2023 • edited Loading

Codecov Report

KeepItSimpleStupid Nov 19, 2023

Choose a reason for hiding this comment

samrocketman left a comment • edited Loading

Choose a reason for hiding this comment

samrocketman Nov 19, 2023

Choose a reason for hiding this comment

samrocketman Nov 19, 2023

Choose a reason for hiding this comment

samrocketman Nov 19, 2023

Choose a reason for hiding this comment

bitwiseman commented Nov 19, 2023

samrocketman commented Nov 19, 2023 • edited Loading

bitwiseman commented Nov 20, 2023

KeepItSimpleStupid commented Nov 20, 2023

bitwiseman commented Nov 20, 2023

bitwiseman commented Nov 21, 2023

bitwiseman commented Nov 22, 2023

samrocketman commented Nov 22, 2023

samrocketman left a comment

Choose a reason for hiding this comment

bitwiseman commented Nov 23, 2023

samrocketman commented Nov 23, 2023

bitwiseman commented Nov 19, 2023 •

edited

Loading

codecov bot commented Nov 19, 2023 •

edited

Loading

samrocketman left a comment •

edited

Loading

samrocketman commented Nov 19, 2023 •

edited

Loading