-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Ehcache
In response Ehcache is starting to tackle performance, confirming these results. Their first fix reduced the execution time from 4.2 to 3.4 minutes. While there's a lot more work needed, that is an awesome first step and hopefully version 3.0 will mature to be a solid alternative.
Ehcache originated over 10 years ago when Java's concurrency features were immature. At that time coarse grained locking was an acceptable strategy and good performance came from not blocking the cache when a single entry was being loaded due to an in-flight database operation. In that era the product's features and integration into popular frameworks made it a good choice to adopt.
Ehcache 3.0 claims to be a cache on "steroids" by touting its many free and commercial features. Unfortunately the actual performance of the cache has decreased despite the numerous advances that the industry has made since the first release. The latest version has reduced both the hit rate and runtime performance.
When analyzing different eviction policies, Ehcache showed surprising behavior. Both versions use a sampled least-recently-used (LRU) policy to provide a hit rate near to a true LRU. For a small cache of 512
entries, the multi1
trace shows that the quality of the distribution sampled from was reduced.
Hit rate | Hits | Misses | Requests | Time | |
---|---|---|---|---|---|
Lru | 46.66 % | 7,400 | 8,458 | 15,858 | 10.8 ms |
Random | 42.09 % | 6,675 | 9,183 | 15,858 | 12.1 ms |
Caffeine | 54.89 % | 8,704 | 7,154 | 15,858 | 94.2 ms |
Ehcache 2.10 | 46.61 % | 7,391 | 8,467 | 15,858 | 68.6 ms |
Ehcache 3.0m4 | 36.46 % | 5,782 | 10,076 | 15,858 | 111.7 ms |
As the cache size increases the impact of the sample set decreases, raising the hit rate to that of a random policy. However the execution time explodes, taking over 4 minutes on the P8
trace with a maximum size of 65,536
entries. In comparison, an LRU cache takes only 4.5 seconds and provides a higher hit rate.
Hit rate | Hits | Misses | Requests | Time | |
---|---|---|---|---|---|
Lru | 36.10 % | 15,249,933 | 26,993,852 | 42,243,785 | 4.5 s |
Random | 33.36 % | 14,093,695 | 28,150,090 | 42,243,785 | 7.1 s |
Caffeine | 50.43 % | 21,302,116 | 20,941,669 | 42,243,785 | 16.2 s |
Ehcache 2.10 | 36.16 % | 15,275,313 | 26,968,472 | 42,243,785 | 1.6 m |
Ehcache 3.0m4 | 33.07 % | 13,968,421 | 28,275,364 | 42,243,785 | 4.2 m |
The cause of this slowdown is because as the cache size increases so does the time spent finding random entries to sample from. The iteration of the entries becomes a visible bottleneck, as indicated in the JMH benchmark below. For all caches eviction is an expensive operation and a concurrent cache might increase that penalty to favor faster reads. However, care is still required to avoid the cost of eviction from becoming prohibitively expensive.
maximum size | 100 | 10,000 | 1,000,000 | 10,000,000 |
---|---|---|---|---|
LinkedHashMap | 16.5 M ops/s | 17.6 M ops/s | 11.6 M ops/s | 5.5 M ops/s |
Caffeine | 3.1 M ops/s | 2.6 M ops/s | 1.5 M ops/s | 1.1 M ops/s |
Ehcache 2.10 | 462 K ops/s | 615 K ops/s | 258 K ops/s | 1.6 K ops/s |
Ehcache 3.0m4 | 673 K ops/s | 517 K ops/s | 182 K ops/s | 299 ops/s |
The benchmark below shows the throughput of 16 threads on a 16-core machine in read and write workloads. Ehcache has improved write performance at a slight cost to read throughput. Unfortunately there isn't a significant difference between Ehcache and a synchronized LinkedHashMap
used as an LRU cache. As the number of cores and threads grow the cache increasingly becomes a bottleneck, defeating its main purpose in an application.
Read | Read-Write (75-25%) | Write | |
---|---|---|---|
LinkedHashMap | 13.6 M ops/s | 12.8 M ops/s | 10.9 M ops/s |
Caffeine | 382 M ops/s | 279 M ops/s | 48.3 M ops/s |
Ehcache 2.10 | 20.7 M ops/s | 8.5 M ops/s | 4.7 M ops/s |
Ehcache 3.0m4 | 17.6 M ops/s | 17 M ops/s | 13.9 M ops/s |
The above conclusions counter those made by Ehcache's benchmarks, which were written to support the desired outcome. Most cache usages follow Zipf's law and are heavily skewed towards reads. Ehcache instead tries to show improved performance between versions by using a large linear distribution. This allows benchmark threads to never contend as the probability of accessing the same entry is near zero. By modeling each cache read as a write, the performance appears to improve due to integrating a backport of an updated ConcurrentHashMap where coarse-grained locks were replaced with fine-grained writes. Yet this benefit disappears in a realistic workload where most accesses are to the same hot entries. The follow-up question of why all cache reads must suffer the penalty of a hash table write (rather than a much faster read) is never asked. When the measurements are designed to match the desired conclusion the real performance problems are ignored.
Ehcache has not evolved with performance as a valued core competency. Sadly the development team is [dismissive] ehcache-video by relying on predetermined conclusions instead of profiling, measuring, and understanding their code. This leads to benchmarks that validate the desired hypothesis and ignore the interesting follow up questions. This myopic approach to development disregards concerns raised by users, even when the performance issues are experienced in real world applications.