Skip to content
Ben Manes edited this page Mar 28, 2015 · 40 revisions

Access order queue

A doubly linked list orders all of the entries residing in the hash table. An entry can be manipulated in O(1) time by finding it in the hash map, and then manipulating its neighboring elements.

Access order is defined as the creation, update, or read of an entry. The least recently used entry is ordered at the head and the most recently at the tail. This provides support for size-based eviction (maximumSize) and time-to-idle eviction (expireAfterAccess). The challenge is that every access requires a mutation of this list, which itself cannot be implemented efficiently concurrently.

Write order queue

Write order is defined as the creation or update of an entry. Similar to the access ordered queue, the write order queue is operated on in O(1) time. This queue is used for time-to-live expiration (expireAfterWrite).

Read buffer

Typical caches lock on each operation to safely reorder the entry in the access queue. An alternative is to store each reorder operation in a buffer and apply the changes in batches. This could be viewed as a write-ahead log for the page replacement policy. When the buffer is full an attempt is made to acquire the lock and perform the pending operations, but if it is already held then the thread can return immediately.

The read buffer is implemented as a segmented ring buffer. The segmentation is used to reduce contention and a segment is selected by the hash of the thread's id. The ring buffer is a fixed size array, making it efficient and minimizes garbage collection overhead. The number of segments can grow dynamically based on a contention detecting algorithm.

Write Buffer

Similar to the read buffer, this buffer is used to replay write events. The read buffer is allowed to be lossy, as those events are used for optimizing the hit rate of the eviction policy. Writes cannot be lost, so it must be implemented as an efficient unbounded queue. Due to the priority of draining the write buffer whenever populated, it typically stays empty or very small.

The buffer is implemented as a linked queue. While this creates garbage in comparison to a ring buffer, the write percentage of caches is such that a few eden space objects does not impact performance. There are many optimization approaches if needed, such as a dynamically growing a ring buffer of linked arrays and employing a combining arena.

Lock amortization

While traditional caches lock on each operation to perform tiny amounts of work, Caffeine batches the work and spreads the cost across many threads. This amortizes the penalty under lock rather than creating lock contention as the cost.

An advantage of batching is that the buffers are only ever drained by a single thread at a given time, due to the exclusive nature of the lock. This allows using more efficient multiple-producer / single-consumer based buffer implementations. It also better aligns to the hardware characteristics by taking advantage of CPU cache efficiencies.

Entry state transitions

A challenge when the cache is not guarded by an exclusive lock is that operations may be recorded and replayed in the wrong order. Due to races, a create-read-update-delete sequence may not be stored in the buffers in the same order. To do so would require coarse grained locks, thereby reducing performance.

As is typical in concurrent data structures, Caffeine resolves this dilemma using atomic state transitions. An entry is alive, retired, or dead. The alive state means that it exists in both the hash table and the access/write queues. When the entry is removed from the hash table, it is marked as retired and needs to be removed from the queues. When that occurs the entry is considered dead and is eligible for garbage collection.

Relaxed reads and writes

Caffeine takes great care to make every volatile operation count. Instead of thinking in terms of the language's volatile reads and writes, memory barriers provide a hardware oriented view. By understanding which barriers are emitted and their effect on the hardware and data visibility, there exists realizable potential for better performance.

Caffeine uses relaxed reads when exclusive access is guaranteed under the lock, as data visibility is provided by the lock acquisition's memory barrier. It may also be acceptable when data races will occur regardless, such as when checking if an entry has expired upon read to simulate a cache miss.

Caffeine uses relaxed writes in a similar manner to reads. If the entry is exclusively written to under a lock then the write can piggy back on memory barrier emitted when unlocked. It may also be acceptable at times to favor write skew, such as when updating the access timestamp when reading an entry.

Code generation

There are many different configuration options resulting in most fields being required only if a certain subset of features are enabled. If all fields are present by default, this can be wasteful by increasing the cache and per entry overhead. By code generating the optimal implementation, the runtime memory overhead is reduced at the cost of a larger binary on disk.

This technique has potential to be used for algorithmic optimizations. Perhaps when constructing the cache the user could specify the characteristics optimal for their usage. A mobile application may prefer a higher hit rate over concurrency, while a server application may desire a higher hit rate at a cost of larger memory overhead. Instead of trying to find the optimal balance for all usages, intent could drive the algorithmic choices.

Decorated hash map

Cache implementations decorate ConcurrentHashMap to add the desired features. The concurrent algorithms for caching and the hash table are individually complex. By keeping them separate, advancements in hash table design can easily be exploited and avoids coarsening locks to cover both the table and eviction concerns.

The cost of this approach is additional runtime overhead. Instead of wrapping the value to hold extra metadata, these fields could be inlined directly onto the table entries. The lack of encapsulation might provide fast paths through a single table operation rather than multiple map calls and short-lived object instances (e.g. lambdas).

The previous projects explored both approaches: the decorator style used by ConcurrentLinkedHashMap and forking the hash table by Guava. Not surprisingly, the motivations for forking rarely materialize due to the overwhelming engineering complexity of the resulting design.

Clone this wiki locally