-
Notifications
You must be signed in to change notification settings - Fork 154
Impact of Lock Contention
Credit: This very informative analysis is done by Rao Fu
- Server: two 8-core CPUs, 72G ram, Intel 82576 NIC, twemcache forked from https://github.com/twitter/twemcache.
- Load generator: mcperf. Each mcperf instance establishes one persistent TCP connection to the server and a command request is only submitted after the response for the previous request has been received. Multiple mcperf instances run simultaneously on the mesos cluster and there can be at most 8 mcperf instances running on one mesos slave machine.
- Request type: Requests are GET only and the size of all objects is 100 bytes. Each mcperf instance sends 500K GET requests. The cache is warmed up prior to the start of the mcperf instances and all GET requests hit in the cache.
-
Latency measurement: All reported latencies are measured on the client side. It's the time duration from when the
send
system call is made to send the request to when theepoll
system call returns and indicates the response is ready to be received.
The graph below shows how well twemcache scales with the number of the threads. The best performance is observed with eight threads. Compared with four threads (default # of threads used by twemcache), eight threads provides 17% speedup on the average latency and 8% speedup on the P95 latency with a sacrifice on the P999 (~-15%). Increasing the number of threads beyond 8 affects performance negatively.
twemcache has one lock (cache_lock) that protects both the hashtable and the LRU lists. It's the primary reason why twemcache does not scale well with the number of threads. The top 10 functions are shown in below for both 4 threads and 8 threads. __lll_unlock_wake
and __lll_lock_wait
are two lower-level functions in glibc that implements pthread mutex. They only account for 4.2% of the CPU time for 4 threads but 19.1% of the CPU time for 8 threads. For 16 threads, the two function takes 60% of the CPU time and that wipes off any gain from using more threads.
count | pct | function |
---|---|---|
1572 | 52.4% | __sendmsg_nocancel |
668 | 22.3% | __read_nocancel |
82 | 2.7% | __lll_unlock_wake |
78 | 2.6% | __epoll_wait_nocancel |
66 | 2.2% | __pthread_mutex_lock |
58 | 1.9% | assoc_find |
48 | 1.6% | _IO_vfprintf |
45 | 1.5% | __lll_lock_wait |
36 | 1.2% | conn_add_iov |
27 | 0.9% | memchr |
count | pct | function |
---|---|---|
2392 | 45.0% | __sendmsg_nocancel |
890 | 16.7% | __read_nocancel |
645 | 12.1% | __lll_unlock_wake |
374 | 7.0% | __lll_lock_wait |
167 | 3.1% | __epoll_wait_nocancel |
95 | 1.8% | _IO_vfprintf |
83 | 1.6% | __pthread_mutex_lock |
54 | 1.0% | assoc_find |
47 | 0.9% | conn_add_iov |
39 | 0.7% | pthread_mutex_unlock |
To evaluate how much benefit we can get by breaking the global lock, we made a simple code change to make the hashtable lock free. Since the workload we used is GET
only, twemcache behaves correctly with this change. As shown in the table below, __lll_unlock_wake
and __lll_lock_wait
becomes negligible even with 8 threads after we made the hashtable lock free. This results in a 31% reduction on the average latency. GET
is a relatively "cheap" operation and for other operations such as SET
a thread likely spends more time in the critical regions. The performance gain of breaking the global lock is potentially larger.
count | pct | function |
---|---|---|
3071 | 57.4% | __sendmsg_nocancel |
1022 | 19.1% | __read_nocancel |
213 | 4.0% | __epoll_wait_nocancel |
119 | 2.2% | _IO_vfprintf |
72 | 1.3% | __pthread_mutex_lock |
61 | 1.1% | assoc_get_bucket |
57 | 1.1% | conn_add_iov |
50 | 0.9% | clock_gettime |
37 | 0.7% | __lll_unlock_wake |
34 | 0.6% | __pthread_getspecific |
version | avg | min | max | stddev | p95 | p99 | p999 |
---|---|---|---|---|---|---|---|
top of trunk | 0.343 | 0.065 | 15.473 | 0.146 | 0.570 | 0.787 | 1.200 |
lock-free hashtable | 0.262 | 0.064 | 12.736 | 0.094 | 0.382 | 0.603 | 0.860 |
speedup | 30.92% | 49.21% | 30.51% | 39.53% |