Impact of Lock Contention

Credit: This very informative analysis is done by Rao Fu

The performance testing environment is as below

Server: two 8-core CPUs, 72G ram, Intel 82576 NIC, twemcache forked from https://github.com/twitter/twemcache.
Load generator: mcperf. Each mcperf instance establishes one persistent TCP connection to the server and a command request is only submitted after the response for the previous request has been received. Multiple mcperf instances run simultaneously on the mesos cluster and there can be at most 8 mcperf instances running on one mesos slave machine.
Request type: Requests are GET only and the size of all objects is 100 bytes. Each mcperf instance sends 500K GET requests. The cache is warmed up prior to the start of the mcperf instances and all GET requests hit in the cache.
Latency measurement: All reported latencies are measured on the client side. It's the time duration from when the send system call is made to send the request to when the epoll system call returns and indicates the response is ready to be received.

Performance of twemcache with respect to the number of threads.

The graph below shows how well twemcache scales with the number of the threads. The best performance is observed with eight threads. Compared with four threads (default # of threads used by twemcache), eight threads provides 17% speedup on the average latency and 8% speedup on the P95 latency with a sacrifice on the P999 (~-15%). Increasing the number of threads beyond 8 affects performance negatively.

scalability chart

Performance data on breaking the global cache lock.

twemcache has one lock (cache_lock) that protects both the hashtable and the LRU lists. It's the primary reason why twemcache does not scale well with the number of threads. The top 10 functions are shown in below for both 4 threads and 8 threads. __lll_unlock_wake and __lll_lock_wait are two lower-level functions in glibc that implements pthread mutex. They only account for 4.2% of the CPU time for 4 threads but 19.1% of the CPU time for 8 threads. For 16 threads, the two function takes 60% of the CPU time and that wipes off any gain from using more threads.

Top 10 functions for 4 threads

count	pct	function
1572	52.4%	__sendmsg_nocancel
668	22.3%	__read_nocancel
82	2.7%	__lll_unlock_wake
78	2.6%	__epoll_wait_nocancel
66	2.2%	__pthread_mutex_lock
58	1.9%	assoc_find
48	1.6%	_IO_vfprintf
45	1.5%	__lll_lock_wait
36	1.2%	conn_add_iov
27	0.9%	memchr

Top 10 functions for 8 threads

count	pct	function
2392	45.0%	__sendmsg_nocancel
890	16.7%	__read_nocancel
645	12.1%	__lll_unlock_wake
374	7.0%	__lll_lock_wait
167	3.1%	__epoll_wait_nocancel
95	1.8%	_IO_vfprintf
83	1.6%	__pthread_mutex_lock
54	1.0%	assoc_find
47	0.9%	conn_add_iov
39	0.7%	pthread_mutex_unlock

To evaluate how much benefit we can get by breaking the global lock, we made a simple code change to make the hashtable lock free. Since the workload we used is GET only, twemcache behaves correctly with this change. As shown in the table below, __lll_unlock_wake and __lll_lock_wait becomes negligible even with 8 threads after we made the hashtable lock free. This results in a 31% reduction on the average latency. GET is a relatively "cheap" operation and for other operations such as SET a thread likely spends more time in the critical regions. The performance gain of breaking the global lock is potentially larger.

Top 10 functions for 8 threads with the lock-free hashtable+

count	pct	function
3071	57.4%	__sendmsg_nocancel
1022	19.1%	__read_nocancel
213	4.0%	__epoll_wait_nocancel
119	2.2%	_IO_vfprintf
72	1.3%	__pthread_mutex_lock
61	1.1%	assoc_get_bucket
57	1.1%	conn_add_iov
50	0.9%	clock_gettime
37	0.7%	__lll_unlock_wake
34	0.6%	__pthread_getspecific

Performance gain from making the hashtable lock free

version	avg	min	max	stddev	p95	p99	p999
top of trunk	0.343	0.065	15.473	0.146	0.570	0.787	1.200
lock-free hashtable	0.262	0.064	12.736	0.094	0.382	0.603	0.860
speedup	30.92%				49.21%	30.51%	39.53%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly