-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak related to resultCache
in TestResultAction
#652
Comments
cc @mdealer would it be possible for you to take a look please? |
What do you mean by OOM? Just a full heap is not an actual out of memory condition for garbage collected memory management. Did Jenkins crash? If not, then it works as intended. SoftReferences are freed as soon as memory is required for something else, so test results remain in cache for as long as required. We see similar heap usage on our side too. |
Yes Jenkins controller did indeed crash. The VM the Jenkins controller runs in has 64GiB of RAM. We have set a rather high limit for the heap, it was in the end not the JVM itself that threw a OutOfMemoryError but it was the system OOM killer.
I do see the heap usage going up and down during the day for like you mention, but overall there is a tendency towards growing more and more slowly over the week until the machine is OOM. So maybe it is a mistake to give it that much heap, (there is nothing running next to the jenkins controller inside the VM), so I though 60GiB should be ok. I am going to test with 50GiB of max heap and re-enable the junit cache and see what happens. |
We use OpenJDK 17 on our side plus these JVM args:
Also, anecdotally, a while ago we determined that workspace locator crashes our instance in the same way by an OOM. We weren't able to find a correlation with resultCache at that time. Maybe something to try to rule this out by setting a system property: Maybe you could get down to what kind of objects are actually piling by looking at a dump? |
Thank you for sharing the flags and infos. I will check how to integrate them / consolidate them with what we currently have. The JVM should be container aware(we are running inside lxc container), but only if the cgroup setup is proper I guess
This is the hosts ram (128GiB), so that might be fishy. My current hypothesis is that we are lacking GC pressure and the JVM was happy to trade GC time vs memory usage (since host memory was doing fine) Shortly before Jenkins was killed by the oom-killer, i manually triggered a heapdump creation and ran it through Eclipse Memory Analyzer using the leak report. Both reports point to the ConcurrentHashMap of the resultCache. suspect 1suspect 2It is unclear to me how meaningful these reports are given that the JVM might be misjudging the free ram of the system it is running on. |
In the Groovy script console, you could execute this to see how bad the map grows: And you could check what test results are actually there:
And you could also run this to clear it: Also, check that the cleanup of empty SoftReferences is actually called, this value should change every now and then by itself: Could help to get closer to the root cause of this. |
I began to suspect the size of the cache results to be the problem so I changed the config to the following, expecting this to be also be a working solution in our setup
However to my surprise after couple of hours I ran this
Sample output
The surprise is that LARGE_RESULT_CACHE_THRESHOLD is not respect apparently. Using a job where I know that rarely anybody is opening the testReport page, I can confirm that when i open the testReport then lastCleanupNs updates to a new value and the resultCache size increases by one. I was able to repeat this a couple of times using different builds. Refreshing the page did not make the counter increase. Now the line Is the code that should make resultCache respect LARGE_RESULT_CACHE_THRESHOLD missing or is it expected for the TestResult reference to become null at some point? The list of keys looks normal to me (387 at the time of printing me)
|
SoftReferences are nullified by the GC when it frees the contents. The GC decides this on its own, there is no hard limit. If the GC never does that for any reason, then the cache keeps growing. LARGE_RESULT_CACHE_THRESHOLD is not a cleanup of the cache, but a catch-up with GC operations, there is no hard limit here, just the available memory and GC. The catch-up is pointless if the map is small enough (the taken memory being less of a burden than iterating through the map). So, we need to find out why is GC not doing its job for you. Can you reduce the memory further to see what changes in behavior? EDIT: |
I am not sure if the disabling of the resultCache for sure fixed it, i would need to wait a couple of days to 100% confirm that. EDIT: I tried a Note that since i opened the ticket i have not changed any JVM settings, I currently only have prod available to test :-) |
Try this to check if it is full of nulls or not:
|
No, not a single null, getting a long list of the kind |
I forgot |
Still not a single null, lots of the kind |
Observe this for some time and report back. |
The reference from EntrySetView could be the .forEach that cleans up the nulls, but all of these should be SoftReference, so unless there is a bug in this specific constellation with this GC, not sure what to try other than the VM options. We run tests until the disk runs out of space on our side, need to dig a bit more on your setup. Next to try would be the JVM args and system props plus observations. Also, is it possible that we run into https://bugs.openjdk.org/browse/JDK-8192647 by a coincidence like those workspace locator threads? |
Jumping on this thread to say that we are encountering the same issue as well, exact same findings as those above:
and similar results for the other troubleshooting commands also similar results coming out of our eclipse heap dump |
This issue was incorrectly marked as complete, as we are experiencing the same issues outside of a container environment |
Jenkins and plugins versions report
Environment
What Operating System are you using (both controller, and any agents involved in the problem)?
Controller: Ubuntu 20.04.6 LTS
Agents: Ubuntu 20.04.6 LTS
Reproduction steps
Having the resultCache of TestResultAction enabled caused a slow memory build up for us. The problem seems to have started 3-4 weeks ago when we probably updated plugins along with a needed restart we had to do.
From then on, in a weeks time, more and more heap was being used until we were exhausting the 60GiB max heap.
Expected Results
Memory usage should not steadily increase over time.
Actual Results
Memory steadily building up until OOM
Note that the smaller hile in the middle was due to an unrelated controller restart.
Sadly I cannot recall (or know where to lookup) what changed around 3,4 weeks ago when the memory build up started to happen. But it looks like a recent change caused this.
Anything else?
Clearing the resultCache (resultCache.clear()) freed up the 'lost' memory
We disabled the cache via
hudson.tasks.junit.TestResultAction.RESULT_CACHE_ENABLED
and from now (26/09/2024) on , it looks like the memory build up is not happening anymore.Are you interested in contributing a fix?
Potentially yes, if can find some time, not an expert in Java though, so I might no be able to identify the source of the leak.
The text was updated successfully, but these errors were encountered: