Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency support #8

Open
wants to merge 32 commits into
base: concurrent
Choose a base branch
from

Conversation

ChaosD
Copy link

@ChaosD ChaosD commented Nov 27, 2023

Support for concurrent cache instance

The current libCacheSim has limited support for concurrent simulation. It allows n threads to access n cache instances, but it does not permit n threads to read/write 1 cache instance. This limitation hinders researchers from evaluating the concurrent performance of their algorithms or systems. Therefore, we plan to add concurrent support to libCacheSim. The goal of this project is to implement thread-safe cache instances while maintaining the original performance of miss ratio and single-thread throughput.

Part 1: Concurrent Support to Hashtable

The simplest cache instance can be divided into two parts:

  • A hashtable for data indexing.
  • Customized eviction metadata for cache eviction.

All cache behaviors involve these two types of data. To make cache operations thread-safe, we should first ensure the hashtable is thread-safe. This part is to implement a thread-safe hashtable. For simplicity, the concurrent hashtable is called con-HashTable in the doc.

Con-HashTable Implementation

LibCacheSim uses a chainedHashTable, as shown:

A hashtable
|  ----------------|
|     bucket 0     | ----> cache_obj_t* ----> cache_obj_t* ----> NULL
|  ----------------|
|     bucket 1     | ----> cache_obj_t*
|  ----------------|
|     bucket 2     | ----> NULL
|  ----------------|
|     bucket 3     | ----> cache_obj_t* ----> cache_obj_t* ----> nULL
|  ----------------|
|     bucket 4     | ----> NULL
|  ----------------|
|     bucket 5     | ----> NULL
|  ----------------|

The hashtable is naturally friendly with concurrency because operations on different buckets do not compete with each other. To make the hashtable supports concurrency, what we should do is controlling operations on the same bucket. A naive method is to use an rwlock for each bucket, as shown:

A hashtable
|----------------------|
|  rwlock   bucket 0   | ----> cache_obj_t* ----> cache_obj_t* ----> NULL
|----------------------|
|  rwlock   bucket 1   | ----> cache_obj_t*
|----------------------|
|  rwlock   bucket 2   | ----> NULL
|----------------------|
|  rwlock   bucket 3   | ----> cache_obj_t* ----> cache_obj_t* ----> nULL
|----------------------|
|  rwlock   bucket 4   | ----> NULL
|----------------------|
|  rwlock   bucket 5   | ----> NULL
|----------------------|

However, this method is not space-efficient. In the implementation, we initialize an rwlock pool, whose size is a power of 2. We use the function rw_index = bucket_index & (rw_count-1) to map buckets to rwlocks, allowing multiple buckets to share the same rwlock.

Rwlock pool (count=4)	A hashtable
|-----------------|	| ---------------|
|   rw_lock 0     |	|    bucket 0    | ----> cache_obj_t* ----> cache_obj_t* ----> NULL
|-----------------|	| ---------------|
|   rw_lock 1     |	|    bucket 1    | ----> cache_obj_t*
|-----------------|	| ---------------|
|   rw_lock 2     |	|    bucket 2    | ----> NULL
|-----------------|	| ---------------|
|   rw_lock 3     |	|    bucket 3    | ----> cache_obj_t* ----> cache_obj_t* ----> nULL
|-----------------|	| ---------------|
			|    bucket 4    | ----> NULL
			| ---------------|
			|    bucket 5    | ----> NULL
			| ---------------|

For simplicity, the con-HashTable is a static hashtable, whose default size is 2^23 (around 8 Million). The con-HashTable supports thread-safe inserts, deletes, and reads. It is the basic data structure for cache instance.

Test for con-HashTable

We conducted some simple tests for the concurrent hash table (con-HashTable).

  • Environment:

    • CPU: Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz * 2 (104 cores)
    • Memory: 512GB
    • OS: Ubuntu 20.04.5 LTS
    • Compiler: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)
    • Test Program: test_concurrent_hashtable.c
  • Test:

    1. Single-thread Test: We tested the single-thread throughput of con-HashTable and the original chained hash table separately for insertions, reads, and deletes. We used MOPS (million operations per second) as the metric.

    2. Multi-thread Test: We tested the multi-thread throughput of con-HashTable. Each thread runs as follows:

      • Step 1: It generates $2^{19}$ random numbers as keys to insert.

      • Step 2, it reads the $2^{19}$ random keys for 10 rounds.

      • Step 3, it deletes the $2^{19}$ random keys.

      • Repeat steps 1-3 until the program ends. We used MOPS as the metric to evaluate the throughput of con-HashTable.

      We also tested the original chained hash table for comparison. Since the original hash table is not thread-safe, we tested it with one thread only.

  • Result:

    1. Single-thread Test: The throughput of con-HashTable is lower than the original hash table. This is because con-HashTable uses rwlocks to protect buckets, which introduces some overhead.

      • Insertion: The original hashtable is 1.55x faster.
      • Read: The original hashtable is 1.89x faster.
      • Delete: The original hashtabl is 1.52x faster.
    2. Multi-thread Test: The maximum throughput of con-HashTable is 22.6x higher than the original hash table.

    test_1

  • Reproduce:

    • Single-thread Test (default time is 20s):

      • Original hashtable: bin/testCHashtable --ht-type 1
      • con-HashTable: bin/testCHashtable
    • Multi-thread Test:

      • Original hashtable - 1 thread: bin/testCHashtable --test-type 1 --ht-type 1
      • con-HashTable - 1 thread: bin/testCHashtable --test-type 1
      • con-HashTable - 10 thread: bin/testCHashtable --test-type 1 --thread-num 10
      • ...
      • con-HashTable - 110 thread: bin/testCHashtable --test-type 1 --thread-num 110

Support for libcuckoo hash table.

TODO()

Part 2: Thread-safe Cache Operations

Cache behaviors involve both the hashtable and eviction metadata. For instance, evicting an item (e.g., item A) from the cache consists of two sub-commands:

  1. Remove A from the eviction metadata.
  2. Remove A from the hashtable.

In concurrent scenarios, other threads may operate on associated data during eviction, i.e., accessing A during steps 1 and 2. At that time, eviction metadata may be inconsistent with the hashtable, and the client may access undesired data. We need to redesign cache operations for consistent updates to the hashtable and eviction metadata in concurrent scenarios.

To make sure the consistency, we introduce a new flags in_cache for each object to show whether the object is in the cache. Only if the object is both in the hashtable and eviction metadata, in_cache is set to true. Either the object is not in hashtable or not in the eviction metadata, in_cache is set to false. Users can only access cache objects with in_cache set to true.

To ensure consistency, we introduce a new flag in_cache for each object to indicate whether the object is in the cache. in_cache is set to true only if the object is both in the hashtable and eviction metadata. If the object is not in the hashtable or not in the eviction metadata, in_cache is set to false. Users can only access cache objects with in_cache set to true.

In part 1, we have implemented a thread-safe hashtable, with three fundamental behaviors: insert an object into the hashtable (insert), remove an object from the hashtable (remove), and find an object from the hashtable (find). Similarly, we define three fundamental behaviors for the eviction metadata: insert an object into the eviction metadata (insert), evict an object from the eviction metadata (evict), and access an object (access). Assuming the eviction metadata behaviors are thread-safe (or atomic). The basic cache operations can be based on the above behaviors, which are implemented as follows.

In part 1, we have implemented a thread-safe hashtable with three fundamental behaviors: inserting an object into the hashtable (insert), removing an object from the hashtable (remove), and finding an object from the hashtable (find). Similarly, we define three fundamental behaviors for the eviction metadata: inserting an object into the eviction metadata (insert), evicting an object from the eviction metadata (evict), and accessing an object (access). Assuming the eviction metadata behaviors are thread-safe, the basic cache operations can be based on the above behaviors, which are implemented as follows.

find_base(key) {
	object = hashtable.find(key)
	// In the following operation, if object is not NULL and object.in_cache is true, update the evictMeta and return true;
	// otherwise, return false
	return evictMeta.access(object)
}

insert_base(object) {
	hashtable.insert(object)
	evictMeta.insert(object)
	object.is_in_cache = true
}

evict_base() {
	// In this operation, object.in_cache is set to false 
	object = evictMeta.evict()
	hashtable.remove(object)
	return object
}

remove_base(key) {
	// In this operation, object.in_cache is set to false 
	hashtable.remove(key)
}

To simplify the design of eviction metadata, the function remove_base does not immediately remove the object from the cache. Instead, it only sets the in_cache flag to false and removes its pointer from the hashtable. However, the object remains in the eviction metadata. It will only be removed from the eviction metadata when it is evicted. Since this item will never be accessed again, it will be evicted soon in the future.

  • Races between find_base and evict_base. The result is consistent. As the object is evicted from the eviction metadata, it is marked as in_cache = false. Therefore, find_base will return false even if the object is still in the hashtable.
  • Races between find_base and insert_base. The result is consistent. The in_cache flag is set to true only if the object is in both the hashtable and eviction metadata. Then the object can be found by the client.
  • Races between evict_base and insert_base. The result is possibly inconsistent, but it does not affect correctness. In the scenario where an object is just inserted and immediately evicted, inconsistency may occur:
Thread 1 (insert)	: 	hashtable.insert(object)
Thread 1 (insert)	: 	evictMeta.insert(object)
Thread 2 (evict)	: 	object is just evictMeta.evict()
Thread 2 (evict)	: 	hashtable.remove(object)
Thread 1 (insert)	: 	object.is_in_cache = true
Thread 2 (evict)	: 	return object

In this case, the object is not in the cache, but the in_cache flag is set to true. However, this discrepancy of an evicted object does not impact the correctness of the cache.

We can implement other cache operations based on these thread-safe basic cache operations.

insert(key) {
	if (evictMeta.size() < capacity) {
		object = create_object(key)
	} else {
		object = evict_base()
		object.reset(key) // Reuse the object to avoid memory allocation
	}
	insert_base(object)
}

get(key) {
	if (find_base(key)) {
		return true
	} else {
		insert(key)
		return false
	}
}

Part 3: Thread-safe Eviction Algorithms

In this part, our goal is to ensure that eviction metadata (and associated functions) are thread-safe. We start by implementing some common eviction algorithms for prototype system validation. The algorithms include LRU and FIFO.

Implementation of LRU

As discussed in the previous section, eviction metadata involves three basic operations: insert, evict, and access. We use a doubly linked list with a mutex to implement thread-safe operations for the LRU policy. Here's the code:

/* LRU operations */
insert(object) {
    lock()
    list_.pushFront(object)
    unlock()
}

evict() {
    lock()
    object = list_.popBack()
    object.in_cache = false
    unlock()
    return object
}

access(object) {
    lock()
    accessable = (object != NULL and object.in_cache)
    if(accessable) {
        list_.moveFront(object)
    }
    unlock()
    return accessable
}

Implementation of FIFO

While the LRU policy ensures thread safety by using a mutex for the doubly linked list, this introduces locking overhead. To reduce this overhead, we implement the FIFO policy without locks. It uses a singly linked list and CAS commands for thread safety. Here's the code:

/* FIFO operations */
insert(object) {
    // Insert object at the tail of the list
    object.next = NULL
    do {
        oldTail = list_.tail
        // CAS (Compare-and-Swap) is an atomic operation.
        // It returns true if the tail is oldTail and updates the tail to object.
        // Otherwise, it returns false.
    } while(list_.tail.CAS(oldTail, object) is false)
    
    if(oldTail is NULL) {
        list_.head = object
    } else {
        oldTail.next = object
    }
}

evict() {
    // Evict object from the head of the list
    do {
        obj_to_evict = list_.head
    } while(list_.head.CAS(obj_to_evict, obj_to_evict.next) is false)
    
    if(obj_to_evict.next is NULL) {
        list_.tail = NULL
    }
 
    return obj_to_evict
}

access(object) {
    // Do nothing for FIFO
}

Similarly, we implement thread-safe versions of LFU, SIEVE, and CLOCK. LFU and SIEVE are similar to LRU, using mutexes, while CLOCK is similar to FIFO, utilizing CAS.

Part 4: Thread-safe CacheSim.

In this part, we redesign CacheSim to support concurrent simulation. This step prepares the project for prototype system validation.

We make the following changes to the original CacheSim to support concurrent simulation.

Changes to CacheSim Parameters

The original CacheSim doesn't support multiple threads accessing the same cache instance, but it has a parameter called num-thread. This parameter specifies the number of threads accessing cache instances. For example:

$ ./bin/cachesim ../data/trace.vscsi vscsi lru 1gb,2gb,3gb,4gb ----num-thread=16

This command runs CacheSim with four cache instances (LRU caches with sizes of 1GB, 2GB, 3GB, and 4GB, respectively) and creates a thread pool with 16 threads. However, each thread can only access one cache instance, underutilizing the thread pool. This parameter is used for efficient batch evaluations on multi-core machines.

We repurpose this parameter so that num-thread threads access one cache instance. For instance, the above command would create a thread pool with 64 threads, with each thread accessing one of the four cache instances. This ensures each cache instance is accessed by 16 threads. The default value of num-thread is 1.

Scaling traces

To make the traces suitable for concurrent simulation, we'll scale them. Each thread reads the trace and adds a unique prefix to each object ID. For example, object ID 1 in the first thread becomes 1001, 1002, ..., 1016, where the prefix is the thread ID. This ensures each thread has a unique trace with the same distribution.

Outputing throughput

In addition to miss ratio and the number of misses, CacheSim in multi-thread tests will now output throughput, measured in million operations per second. The simulation ends when any thread finishes the trace, and throughput is calculated as the total number of operations divided by the total time. \par

Now, we can use the new CacheSim to evaluate the multi-thread performance of different cache instances. Here's an example command:

$ ./bin/cachesim -t "obj-id-col=2,delimiter=," ../data/twitter_cluster52.csv csv lru 10kb --num-thread=4

This command tests an LRU cache with a size of 10 KiB using the twitter_cluster52.csv trace, with 4 threads accessing the same cache instance, each launching 1M requests.

We then test the performance of the new CacheSim with various cache algorithms (LRU, LFU, Sieve, FIFO, and CLOCK) using 2 to 20 threads, with each thread corresponding to 1M requests and 10KiB cache space. The results are depicted in the following figure.

image

Part 5: Further Developments

Thread-safe Admission and Prefetch Algorithms

The implementation of thread-safe admission and prefetch algorithms is a crucial next step. While I lack familiarity with these algorithms, they play a vital role in cache management and optimization. Therefore, I'll leave this part to be handled by others who specialize in these areas.

More eviction algorithms

Expanding CacheSim to support more eviction algorithms like ARC, 2Q, and MQ would enhance its utility and applicability. However, integrating these policies will require additional modifications and careful consideration of their implementation specifics.

Extensive Testing

The current test suite is simple and does not cover all edge cases or scenarios. Extensive testing is essential to validate the correctness and robustness of the implementation.

Current Insuffciencies and Bugs

  • The thread-safe hashtable with a fixed size isn't space-efficient.
  • The lock-free eviciton policies (i.e., FIFO and Clock) is not completely consistent yet. If the inserted cache item is evicted immediately, the item is possiblly inconsistent: the object is still at eviction metadata with in_cache = true. However, it has been removed from the hashtable. This cache item is not accessible by the client, which is in a wrong state. However, it will eventually be evicted by the eviction policy, making the cache keeps working.
  • The lock-free eviction policies (such as FIFO and Clock) aren't entirely consistent yet. Immediate eviction after insertion can lead to inconsistent cache states: the object is removed from the hashtable, but still at eviction metadata with in_cache = true. This cache item is in a wrong state. However, it will eventually be evicted by the eviction policy, making the cache keeps working.
  • The remove operation in the cache doesn't remove objects from the eviction metadata. It only sets the in_cache flag to false and removes the pointer from the hashtable. This design reduces the update overhead of the eviction metadata, but leading to space inefficiency. Additionally, this design does not suit for random-based eviciton policies, which cannot guarantee that the invalid(and never accessed) cache items will be evicted eventually.

@@ -139,6 +139,7 @@ struct cache_obj;
typedef struct cache_obj {
struct cache_obj *hash_next;
obj_id_t obj_id;
uint64_t fingerprint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use less space for fingerprint? like one-byte? or guard with a macro? Because we may have billions of cache_obj in simulations, we are very sensitive to the memory usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave it if you do not plan to merge into develop branch

@@ -196,13 +196,13 @@ cache_obj_t *chained_hashtable_insert(hashtable_t *hashtable, request_t *req) {
cache_obj->hash_next = new_cache_obj;
cache_obj = new_cache_obj;
}
hashtable->n_obj += 1;
__sync_fetch_and_add(&hashtable->n_obj, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat, __sync_fetch_and_add is legacy code according to GCC doc (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html), it might be better to use __atomic_add_fetch

Copy link
Author

@ChaosD ChaosD Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, __sync_fetch_and_add has been outdated. However, when I used <stdatomic.h> and atomic_add_fetch, I got an error: ‘_Atomic’ does not name a type. I'm not sure whether it is caused by my GCC version. Maybe you can help me to solve this problem :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's weird, I think atomic does not need a header. Let's just leave it as it as for now.

Copy link
Member

@1a1a11a 1a1a11a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look great to me! :)

if (cur_obj != NULL) {
prev_obj->hash_next = cur_obj->hash_next;
if (!hashtable->external_obj) free_cache_obj(cur_obj);
hashtable->n_obj -= 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there was a bug in the old code. Good catch! Can you port this change to the develop branch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course I will. This is just a minor bug, because it seems that this function is not be used in the project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

1a1a11a and others added 18 commits December 15, 2023 14:22
1. fix warnings 
2. fix turning off zstd not working
3. fix bugs at several places
4. it is still challenging to build on macos
* add SIEVE to be the main cache of TinyLFU algorithm

* log fix

* Add options for trace print formatting

* add a note how to run the caffeine simulator

* Update table of contents in README.md
* fix compile warnning and errors
* fix bugs in CMakeLists
* remove xxhash from source and use system-level xxhash so that macos can compile
* change global variable names
* add n_core function for macos
* Update README.md

* Update README.md

* Update README.md
zztaki and others added 6 commits February 22, 2024 20:51
update documentation of traceConv;
add better description in Belady algorithm
Merge the 'cacheMon-develop' into concurrency-support
@1a1a11a
Copy link
Member

1a1a11a commented Apr 29, 2024

Great work! Should I take a look now or wait until you finish? BTW, it would be great if you can push to https://github.com/1a1a11a/libCacheSim, but if it is too much trouble. I can do the work. :)

@ChaosD
Copy link
Author

ChaosD commented Apr 29, 2024

Great work! Should I take a look now or wait until you finish? BTW, it would be great if you can push to https://github.com/1a1a11a/libCacheSim, but if it is too much trouble. I can do the work. :)

The task about codes is almost completed. You can check the codes now. However, I make many updates, which may take several days to review them. I'll try to reply any question if you have when I'm free☺️.

I'll use larger traces to test the program in the few days. If no serious error, I'd like to push the codes to the given repo.

Sadly, the current code is impossible to merge into the original project now. The cahce behaviors in multiple threads are different with the single-thread ones.

@Bob-Chen222
Copy link

Hi Chaos, I just used your forked libCacheSim repo to test the scalability of the LRU algorithm. It all works great, but I noticed that the number of requests recorded is inconsistent as the number of threads increases. Here is what I received on my end:
Screenshot 2024-05-18 at 10 51 44 AM

I used a synthesized oracleGeneral trace containing 20 million requests. The request number displayed is inconsistent when the thread number increases to 2 and 4.
This inconsistency can also be replicated using ./bin/cachesim -t "obj-id-col=2,delimiter=," ../data/twitter_cluster52.csv csv lru 10kb --num-thread=4

Could you take a look at it? Thanks!

@ChaosD
Copy link
Author

ChaosD commented May 18, 2024

Hi Chaos, I just used your forked libCacheSim repo to test the scalability of the LRU algorithm. It all works great, but I noticed that the number of requests recorded is inconsistent as the number of threads increases. Here is what I received on my end: Screenshot 2024-05-18 at 10 51 44 AM

I used a synthesized oracleGeneral trace containing 20 million requests. The request number displayed is inconsistent when the thread number increases to 2 and 4. This inconsistency can also be replicated using ./bin/cachesim -t "obj-id-col=2,delimiter=," ../data/twitter_cluster52.csv csv lru 10kb --num-thread=4

Could you take a look at it? Thanks!

Hello Bob, the result is correct. I modified the source code of cachesim to stop when any thread completes its work. As a result, the displayed request number is typically smaller than the input. This adjustment is for calculating the saturation throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants