-
Notifications
You must be signed in to change notification settings - Fork 31
Latency
In these benchmarks, a cache miss is essentially free. These tests exist purely to compare the raw execution speed of the cache bookkeeping code. In a real setting, where a cache miss is presumably quite expensive, the relative overhead of the cache will be very small.
DISCLAIMER: Always measure performance in the context of your application. The results provided here are intended as a guide.
Benchmarks are based on BenchmarkDotNet, so are single threaded. Both ConcurrentLru
and ConcurrentLfu
scale well with concurrent workloads. The relative ranking of each cache implementation is stable across .NET Framework/Core/5/6 and on the CPU architectures available in Azure (e.g. Intel Skylake, AMD Zen). Absolute performance can vary.
In this test the same items are fetched repeatedly, no items are evicted. Representative of high hit rate scenario, when there are a low number of hot items. ConcurrentDictionary
lookup is used to establish a baseline.
- The
ConcurrentLru
family does not move items in the queues, it is just marking as accessed for pure cache hits and is therefore very fast. This test highlights the difference in performance betweenFastConcurrentLru
,ConcurrentLru
, aFastConcurrentLru
with atomic GetOrAdd (AtomicFastLru) andConcurrentTLru
. All share the same underlying cache algorithm but have successively more features enabled. -
ConcurrentLfu
is performing a dictionary lookup and logging the read in the read buffer. It has been configured with the BackgroundThreadScheduler in this test. -
ClassicLru
must maintain item order, and is internally splicing the fetched item to the head (MRU position) of the linked list. - Both the older runtime
MemoryCache
and new extensions memory cache are tested using v8.0.0. Since the key is an integer, the runtime cache must convert to a string and the extensions version must box the integer, hence there are memory allocations).
Tabular Benchmark Data
Method | Runtime | Mean | StdDev | Ratio | Allocated |
---|---|---|---|---|---|
ConcurrentDictionary | .NET 6.0 | 7.169 ns | 0.0791 ns | 1.00 | - |
FastConcurrentLru | .NET 6.0 | 8.750 ns | 0.0390 ns | 1.22 | - |
ConcurrentLru | .NET 6.0 | 15.027 ns | 0.2593 ns | 2.10 | - |
AtomicFastLru | .NET 6.0 | 21.706 ns | 0.9389 ns | 3.07 | - |
FastConcurrentTLru | .NET 6.0 | 12.559 ns | 0.2071 ns | 1.76 | - |
FastConcLruAfterAccess | .NET 6.0 | 11.919 ns | 0.1627 ns | 1.66 | - |
FastConcLruAfter | .NET 6.0 | 15.720 ns | 2.0918 ns | 2.26 | - |
ConcurrentTLru | .NET 6.0 | 16.949 ns | 0.1484 ns | 2.37 | - |
ConcurrentLfu | .NET 6.0 | 23.589 ns | 0.2951 ns | 3.30 | - |
ClassicLru | .NET 6.0 | 44.810 ns | 0.5894 ns | 6.26 | - |
RuntimeMemoryCacheGet | .NET 6.0 | 135.328 ns | 4.1732 ns | 19.08 | 32 B |
ExtensionsMemoryCacheGet | .NET 6.0 | 50.453 ns | 1.2636 ns | 7.11 | 24 B |
ConcurrentDictionary | .NET Framework 4.8 | 15.212 ns | 0.0514 ns | 1.00 | - |
FastConcurrentLru | .NET Framework 4.8 | 16.007 ns | 0.1257 ns | 1.05 | - |
ConcurrentLru | .NET Framework 4.8 | 19.723 ns | 0.2037 ns | 1.29 | - |
AtomicFastLru | .NET Framework 4.8 | 30.607 ns | 0.2735 ns | 2.01 | - |
FastConcurrentTLru | .NET Framework 4.8 | 50.051 ns | 0.3676 ns | 3.29 | - |
FastConcLruAfterAccess | .NET Framework 4.8 | 48.976 ns | 0.4488 ns | 3.23 | - |
FastConcLruAfter | .NET Framework 4.8 | 50.478 ns | 0.3402 ns | 3.32 | - |
ConcurrentTLru | .NET Framework 4.8 | 50.059 ns | 0.1214 ns | 3.29 | - |
ConcurrentLfu | .NET Framework 4.8 | 41.095 ns | 0.8555 ns | 2.72 | - |
ClassicLru | .NET Framework 4.8 | 55.754 ns | 0.2141 ns | 3.67 | - |
RuntimeMemoryCacheGet | .NET Framework 4.8 | 291.193 ns | 1.5455 ns | 19.16 | 32 B |
ExtensionsMemoryCacheGet | .NET Framework 4.8 | 93.258 ns | 0.8944 ns | 6.14 | 24 B |
Take 1000 samples of a Zipfian distribution over a set of keys of size N and use the keys to lookup values in the cache. If there are N items, the probability of accessing an item numbered i or less is (i / N)^s.
s = 0.86 (yields approx 80/20 distribution)
N = 500
Cache size = N / 10 (so we can cache 10% of the total set). ConcurrentLru has approximately the same computational overhead as a standard LRU in this single threaded test.
Method | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|
ClassicLru | 175.7 ns | 2.75 ns | 2.43 ns | 1.00 | 0.00 |
FastConcurrentLru | 180.2 ns | 2.55 ns | 2.26 ns | 1.03 | 0.02 |
ConcurrentLru | 189.1 ns | 3.14 ns | 2.94 ns | 1.08 | 0.03 |
FastConcurrentTLru | 261.4 ns | 4.53 ns | 4.01 ns | 1.49 | 0.04 |
ConcurrentTLru | 266.1 ns | 3.96 ns | 3.51 ns | 1.51 | 0.03 |