Items not removed when CPU bound (using spark) #795

stemill · 2022-10-17T16:41:29Z

stemill
Oct 17, 2022

I'm using Caffiene to manage some reference data in a complicated spark job. It's working great up to a point...

If I partition the data so that all the cores on a worker are being utilised (in my test case a single 48 core worker) then the removal of cache entries stops and i eventually run out of memory. However if I drop back the number of partitions so that 5 cores are always idle then everything works great.

Is there a better way to do this? I've tried just leaving 1, 2, 3 or 4 cores idle but that doesn't seem to be enough and it seems a bit wasteful.

Answered by ben-manes

Oct 17, 2022

The cache should induce backpressure on writes if eviction cannot keep up. Can you try running our stress test?

By default the cache uses ForkJoinPool.commonPool() for any async work, which includes evictions. There have been bugs in some JDKs where FJP could drop tasks (race conditions causing internal data loss) and we've gradually made Caffeine more robust to these cases. When the cache induces backpressure on writes due to too many pending evictions it should block writers on the eviction lock, thereby unscheduling those threads or assisting if the eviction wasn't actually run.

The async eviction isn't necessary for our own logic, but we do have callbacks (like Caffeine.evictionListener…

View full answer

ben-manes · 2022-10-17T18:50:02Z

ben-manes
Oct 17, 2022
Maintainer

The cache should induce backpressure on writes if eviction cannot keep up. Can you try running our stress test?

By default the cache uses ForkJoinPool.commonPool() for any async work, which includes evictions. There have been bugs in some JDKs where FJP could drop tasks (race conditions causing internal data loss) and we've gradually made Caffeine more robust to these cases. When the cache induces backpressure on writes due to too many pending evictions it should block writers on the eviction lock, thereby unscheduling those threads or assisting if the eviction wasn't actually run.

The async eviction isn't necessary for our own logic, but we do have callbacks (like Caffeine.evictionListener) so we don't know how fast or slow that foreign user could might be. Many simply use Caffeine.executor(Runnable::run) to perform the work on the calling thread, so you might see if that setting helps you.

0 replies

stemill · 2022-10-17T21:27:41Z

stemill
Oct 17, 2022
Author

Thanks for the reply. I tried using Caffeine.newBuilder().executor(r => r.run) (scala based codebase) but that didn't seem to help. I also removed some logging code I had on the evictionListener but this actually made things worse!
However I then tried just running the code on a smaller worker (32 cores) and there was no problem with running on all cores. So for now i will just use more smaller workers!
I'll give the stress test a run tomorrow when i get a chance and let you know the results.

1 reply

ben-manes Oct 18, 2022
Maintainer

It might help to know which library and jdk version you are using, and what the configuration and usages look like. For instance, v2 does not deduplicate explicit refresh operations whereas v3 does (as shown in #694). It might just be that you should upgrade to the latest release.

stemill · 2022-10-18T07:42:06Z

stemill
Oct 18, 2022
Author

Unfortunately I'm stuck on java 8 for the time being. So it's version 2.9.3 of caffeine.

…

On Tue, 18 Oct 2022 at 08:29, Ben Manes ***@***.***> wrote: It might help to know which library and jdk version you are using, and what the configuration and usages look like. For instance, v2 does not deduplicate explicit refresh operations whereas v3 does (as shown in #694 <#694>). It might just be that you should upgrade to the latest release. — Reply to this email directly, view it on GitHub <#795 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAUOA57544KBLF46KFNTUITWDZGWTANCNFSM6AAAAAARHIDFOE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

stemill · 2022-10-19T12:54:07Z

stemill
Oct 19, 2022
Author

Hi. I managed to upgrade to java 11 and latest version of caffiene but still seeing the same problems.
How do i go about running the stress test. Will i need to run it on a worker within spark or just on an equivalent ec2 instance (i'll just run it on the driver).

4 replies

stemill Oct 19, 2022
Author

I'm also seeing a difference between a 32 core r4 aws instance and r5. On the r4 everything is fine, but problems with the r5.

ben-manes Oct 19, 2022
Maintainer

The stress test can be run by cloning this repository and using,

./gradlew :caffeine:stress --workload=[read, write, refresh]

You can just use an equivalent ec2 instance. The test spawns threads (2 x cpu) that loop trying to do work on the cache, and a scheduled thread periodically prints the internal state. There are no assertions, so you can look at the output and runaway growth should be easy to spot.

You might find it useful to inspect a heap dump and flight recording. I'm not sure how you determined it was from Caffeine, so this might give us more insights:

I find it helpful to get a heap dump on OOME, and restart the instance as this indicates a failure. Then you can open it up in a tool like VisualVM, JProfiler, Yourkit, Eclipse MAT, etc.

-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=?

Less useful but you can capture this on a running instance,

jmap -dump:live,format=b,file=? "$(pidof java)"

Java Flight Recorder is very useful, where this command will start and capture a profile for the given time duration (once, non-repeating). Then you can open it up in Java Mission Control to see if anything interesting stands out.

jcmd "$(pidof java)" JFR.start duration=? filename=? settings=profile.jfc

stemill Oct 20, 2022
Author

This is my bad, sorry. I've dropped out caffiene and back in our previous cache (basically a hashmap) and I'm getting the same behaviour.
Previously we were unable to test on larger machines because memory pressure would kill the jobs - this is why I introduced caffiene. This is working perfectly. It looks like we have other contention problems possibly IO which only emerge when the number of cores is increased. I incorrectly ascribed this to caffiene.
Many thanks for your help, the library is great.

ben-manes Oct 21, 2022
Maintainer

That's good to hear. I recommend using jfr to capture a profile and understand what is generating so much memory pressure. Typically this quickly identifies the problem and the fixes are often easy once you know the root cause. Profilers are under appreciated magic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Items not removed when CPU bound (using spark) #795

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Items not removed when CPU bound (using spark) #795

stemill Oct 17, 2022

Replies: 4 comments · 5 replies

ben-manes Oct 17, 2022 Maintainer

stemill Oct 17, 2022 Author

ben-manes Oct 18, 2022 Maintainer

stemill Oct 18, 2022 Author

stemill Oct 19, 2022 Author

stemill Oct 19, 2022 Author

ben-manes Oct 19, 2022 Maintainer

stemill Oct 20, 2022 Author

ben-manes Oct 21, 2022 Maintainer

stemill
Oct 17, 2022

Replies: 4 comments 5 replies

ben-manes
Oct 17, 2022
Maintainer

stemill
Oct 17, 2022
Author

ben-manes Oct 18, 2022
Maintainer

stemill
Oct 18, 2022
Author

stemill
Oct 19, 2022
Author

stemill Oct 19, 2022
Author

ben-manes Oct 19, 2022
Maintainer

stemill Oct 20, 2022
Author

ben-manes Oct 21, 2022
Maintainer