-
In an open source project, we use a Caffeine cache to store the results for a computation-intensive operation (node-prepping a JID String) that is pretty central to all functionality in the implementation. This cache is based on Caffeine 2.7.0, which it has been for many years. Today, I learned that on one installation (out of hundreds of thousands of downloads), a problem occurs that I've never seen before. I can't quite fathom what's going on, and I am hoping for feedback on potential causes or solutions. What is happening is that most threads in the application get stuck waiting to acquire a reentrant lock. This is pretty apparent when looking at a thread dump created when the problem occurs. The stack trace of one of these threads is pasted below (most threads have a similar stack trace). The 'owned by' message indicates that the lock is held by one particular thread. That thread, however, does not itself show up in the thread dump.
The lock that is being held is I'm at a bit of a loss here. From what I can tell, Caffeine's implementation diligently uses I am happy to accept that this reason for this misbehavior lies well outside of Caffeine's implementation, but before falling down more of the rabbit hole, I thought it best to verify if this rings a bell with anyone, just in case... |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 9 replies
-
Why is it that you find answers only minutes after posting a question, and not in the decades leading up to it? The system that's affected has logged a StackOverflowError, thrown on a line in which Caffeine is trying to release the lock that is at the center of this issue:
The overflow itself seems to be caused by some kind of recursion that is well outside of the area of influence of Caffeine. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if it is fine to resurrect this discussion. Our Jenkins instance experience the same issue as described by @guusdk - occasionally all Threads on the orchestrator wait for Jenkins uses Caffeine 3.1.8 and I'm puzzled how it is possible at all. I took a heap dump with That Thread object in question is references only by Jenkins logs have a lot of error messages around the thread pool which that Thread belonged to, so I assumption is that thread died without releasing its lock. While I don't expect Caffeine to solve that, as I wrote above, I'm puzzled how that is possible in the first place. Looking at the Caffeine code I don't see any leaks, ie if cache gets a lock it always releases it. The stack trace of blocked threads looks like that:
|
Beta Was this translation helpful? Give feedback.
-
I wrote a test that reproduces this problem on JDK 11 and 21. I simplified it down to remove any library code to show that it is a JDK bug and opened a ticket. The test will usually fail on the first invocation and once it becomes successful stay that way afterwards. This implies there is a warm up difference that causes the reserved stack fix to not be present. import java.util.concurrent.locks.ReentrantLock;
public class OverflowTest {
public static void recurse() {
var lock = new ReentrantLock();
var recurser = new Runnable[1];
recurser[0] = () -> {
lock.lock();
try {
recurser[0].run();
} finally {
lock.unlock();
}
};
try {
recurser[0].run();
throw new AssertionError();
} catch (StackOverflowError expected) {}
if (lock.isLocked()) {
System.err.println("Lock");
} else {
System.out.println("Unlocked");
}
}
public static void main(String[] args) {
for (int i = 0; i < 10; i++) {
recurse();
}
}
} Lock
Unlocked
Unlocked
Unlocked
Unlocked
Unlocked
Unlocked
Unlocked
Unlocked
Unlocked |
Beta Was this translation helpful? Give feedback.
Why is it that you find answers only minutes after posting a question, and not in the decades leading up to it?
The system that's affected has logged a StackOverflowError, thrown on a line in which Caffeine is trying to release the lock that is at the center of this issue: