Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

Impact of Lock Contention

thinkingfish edited this page Apr 23, 2013 · 3 revisions

Credit: This very informative analysis is done by Rao Fu

The performance testing environment is as below

  • Server: two 8-core CPUs, 72G ram, Intel 82576 NIC, twemcache forked from https://github.com/twitter/twemcache.
  • Load generator: mcperf. Each mcperf instance establishes one persistent TCP connection to the server and a command request is only submitted after the response for the previous request has been received. Multiple mcperf instances run simultaneously on the mesos cluster and there can be at most 8 mcperf instances running on one mesos slave machine.
  • Request type: Requests are GET only and the size of all objects is 100 bytes. Each mcperf instance sends 500K GET requests. The cache is warmed up prior to the start of the mcperf instances and all GET requests hit in the cache.
  • Latency measurement: All reported latencies are measured on the client side. It's the time duration from when the send system call is made to send the request to when the epoll system call returns and indicates the response is ready to be received.

Performance of twemcache with respect to the number of threads.

The graph below shows how well twemcache scales with the number of the threads. The best performance is observed with eight threads. Compared with four threads (default # of threads used by twemcache), eight threads provides 17% speedup on the average latency and 8% speedup on the P95 latency with a sacrifice on the P999 (~-15%). Increasing the number of threads beyond 8 affects performance negatively.

scalability chart

Performance data on breaking the global cache lock.

twemcache has one lock (cache_lock) that protects both the hashtable and the LRU lists. It's the primary reason why twemcache does not scale well with the number of threads. The top 10 functions are shown in below for both 4 threads and 8 threads. __lll_unlock_wake and __lll_lock_wait are two lower-level functions in glibc that implements pthread mutex. They only account for 4.2% of the CPU time for 4 threads but 19.1% of the CPU time for 8 threads. For 16 threads, the two function takes 60% of the CPU time and that wipes off any gain from using more threads.

Top 10 functions for 4 threads

count pct function
1572 52.4% __sendmsg_nocancel
668 22.3% __read_nocancel
82 2.7% __lll_unlock_wake
78 2.6% __epoll_wait_nocancel
66 2.2% __pthread_mutex_lock
58 1.9% assoc_find
48 1.6% _IO_vfprintf
45 1.5% __lll_lock_wait
36 1.2% conn_add_iov
27 0.9% memchr

Top 10 functions for 8 threads

count pct function
2392 45.0% __sendmsg_nocancel
890 16.7% __read_nocancel
645 12.1% __lll_unlock_wake
374 7.0% __lll_lock_wait
167 3.1% __epoll_wait_nocancel
95 1.8% _IO_vfprintf
83 1.6% __pthread_mutex_lock
54 1.0% assoc_find
47 0.9% conn_add_iov
39 0.7% pthread_mutex_unlock

To evaluate how much benefit we can get by breaking the global lock, we made a simple code change to make the hashtable lock free. Since the workload we used is GET only, twemcache behaves correctly with this change. As shown in the table below, __lll_unlock_wake and __lll_lock_wait becomes negligible even with 8 threads after we made the hashtable lock free. This results in a 31% reduction on the average latency. GET is a relatively "cheap" operation and for other operations such as SET a thread likely spends more time in the critical regions. The performance gain of breaking the global lock is potentially larger.

Top 10 functions for 8 threads with the lock-free hashtable+

count pct function
3071 57.4% __sendmsg_nocancel
1022 19.1% __read_nocancel
213 4.0% __epoll_wait_nocancel
119 2.2% _IO_vfprintf
72 1.3% __pthread_mutex_lock
61 1.1% assoc_get_bucket
57 1.1% conn_add_iov
50 0.9% clock_gettime
37 0.7% __lll_unlock_wake
34 0.6% __pthread_getspecific

Performance gain from making the hashtable lock free

version avg min max stddev p95 p99 p999
top of trunk 0.343 0.065 15.473 0.146 0.570 0.787 1.200
lock-free hashtable 0.262 0.064 12.736 0.094 0.382 0.603 0.860
speedup 30.92% 49.21% 30.51% 39.53%