Accelerate hash table iterator with prefetching #1501

NadavGigi · 2025-01-01T13:54:28Z

This PR introduces improvements to the hashtable iterator, implementing prefetching technique described in the blog post Unlock One Million RPS - Part 2 . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations.

In the pursuit of maximizing iterator performance, I conducted a comprehensive series of experiments. My tests encompassed a wide range of approaches, including processing multiple bucket indices in parallel, prefetching the next bucket upon completion of the current one, and several other timing and quantity variations. Surprisingly, after rigorous testing and performance analysis, the simplest implementation presented in this PR consistently outperformed all other more complex strategies.

Implementation

Each time we start iterating over a bucket, we prefetch data for future iterations:

We prefetch the entries of the next bucket (if it exists).
We prefetch the structure (but not the entries) of the bucket after the next.
This prefetching is done when we pick up a new bucket, increasing the chance that the data will be in cache by the time we need it.

Performance

The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command.

+--------------------+------------------+-----------------------------+
| prefetching        | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| No                 | 11.112279        | 4,499,529                   |
| Yes                | 3.141916         | 15,913,862                  |
+--------------------+------------------+-----------------------------+
Improvement:
Comparing the iterator without prefetching to the one with prefetching, 
we can see a speed improvement of 11.112279 / 3.141916 ≈ 3.54 times faster.

Save command improvment

Setup:

64cores Graviton 3 Amazon EC2 instance.
50 mil keys in size of 100 bytes each.
Running valkey server over RAM file system.
crc checksum and comperssion off.

Results

+--------------------+------------------+-----------------------------+
| prefetching        | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| No                 | 28               | 1,785,700                   |
| Yes                | 19.6             | 2,550,000                   |
+--------------------+------------------+-----------------------------+
Improvement:
- Reduced SAVE time by 30% (8.4 seconds faster)
- Increased key processing rate by 42.8% (764,300 more keys/second)

src/hashtable.c

madolson · 2025-01-02T19:00:43Z

How does this compare to having an iterator that actually returns a batch of items. Something like:

void *entries[7];
size_t num_entries;
entries = getBatchEntries(iterator, *num_entries);
if (entries) {
    for (size_t i = 0; i < num_entries; i++) {
        whatever(entries[i]);
    }
}

I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching.

madolson · 2025-01-02T19:01:31Z

It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads.

Then don't name them threads, it makes the implementation much harder to follow.

zuiderkwast · 2025-01-05T16:43:16Z

I agree with Madelyn that the word "threads" is not suitable. Can we call it batches?

This looks quite complex. Was this adapted from an implementation for dict?

For hashtable, there is a natural batch of one hashtable bucket (up to 7 elements). Can't we just prefetch all elements in a hashtable bucket and the child bucket if any? When we return the last element in the bucket, we'd prefetch the next bucket's elements. Then we just at read and return the elements as before without keeping track of anything. It'd be much simpler.

For KEYS, we just want to return the elements so I guess it can be faster with a more complex solution, but for other usages like rdb dump and aof rewrite, we do more things after each call to hashtableNext so I would assume most of the benefit can be seen even with the simpler approach.

Did you try the simpler approach already?

src/hashtable.c

NadavGigi · 2025-01-06T07:35:56Z

How does this compare to having an iterator that actually returns a batch of items. Something like:
void *entries[7];
size_t num_entries;
entries = getBatchEntries(iterator, *num_entries);
if (entries) {
    for (size_t i = 0; i < num_entries; i++) {
        whatever(entries[i]);
    }
}
I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching.
@madolson
Thank you for your suggestion. If I understand correctly, this approach would necessitate changes to every piece of code that currently uses the iterator, not just the iterator implementation itself. However, I may have misinterpreted your proposal. Could you please provide further explanation or clarification on how this batch iterator could be implemented without requiring widespread changes to the existing codebase?

It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads.

Then don't name them threads, it makes the implementation much harder to follow.

Done.

NadavGigi · 2025-01-06T10:21:48Z

I agree with Madelyn that the word "threads" is not suitable. Can we call it batches?

This looks quite complex. Was this adapted from an implementation for dict?

For hashtable, there is a natural batch of one hashtable bucket (up to 7 elements). Can't we just prefetch all elements in a hashtable bucket and the child bucket if any? When we return the last element in the bucket, we'd prefetch the next bucket's elements. Then we just at read and return the elements as before without keeping track of anything. It'd be much simpler.

For KEYS, we just want to return the elements so I guess it can be faster with a more complex solution, but for other usages like rdb dump and aof rewrite, we do more things after each call to hashtableNext so I would assume most of the benefit can be seen even with the simpler approach.

Did you try the simpler approach already?
@zuiderkwast

I have changed the name "threads" to bucketPrefetchInfo since its the number of buckets we scan in parrallel.
Yes, it was adapted from dict implementation.
No, I have tried a simpler version that benefits only from prefetching current bucket("batch 1" in performance table).
Following your comment, I have implemented the solution you suggested(indeed, very easy approach) and we can see the results here:

By conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each.
+--------------------+------------------+-----------------------------+
| Implementation     | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| Iterator without   | 11.112279        | 4,499,529                   |
|    prefetching     |                  |                             |
| Viktor's approach  | 3.712602         | 13,467,954                  |
| batch iterator     | 3.336432         | 14,985,700                  |
+--------------------+------------------+-----------------------------+

Batch iterator processes keys about 11% faster than Viktor's approach.
SAVE command time is the same for batch iterator and Viktor's approach.

codecov · 2025-01-06T18:35:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.84%. Comparing base (c0014ef) to head (4de5b1a).
Report is 3 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1501      +/-   ##
============================================
+ Coverage     70.82%   70.84%   +0.02%     
============================================
  Files           120      120              
  Lines         64911    64915       +4     
============================================
+ Hits          45972    45992      +20     
+ Misses        18939    18923      -16

Files with missing lines	Coverage Δ
src/hashtable.c	`75.91% <100.00%> (-0.42%)`	⬇️

... and 15 files with indirect coverage changes

zuiderkwast · 2025-01-06T18:44:50Z

Batch iterator processes keys about 11% faster than Viktor's approach.
SAVE command time is the same for batch iterator and Viktor's approach.

Thanks for implementing and comparing this! From your table we can see that "my" approach is 3.0 times more throughput than without prefetching, compared to 3.33 times for your Nadav's approch.

Given the performance vs complexity, I'm leaning towards "my" simpler approach. Since you've already done the implementation, can you post it in a separate draft PR so we can see it?

Maybe the simple approach can be optimized a bit more, without too much increased complexity. (For example, prefetching the next 2-3 top-level buckets so we don't slow when there's an empty bucket.)

NadavGigi · 2025-01-07T17:55:50Z

Batch iterator processes keys about 11% faster than Viktor's approach.
SAVE command time is the same for batch iterator and Viktor's approach.

Thanks for implementing and comparing this! From your table we can see that "my" approach is 3.0 times more throughput than without prefetching, compared to 3.33 times for your Nadav's approch.

Given the performance vs complexity, I'm leaning towards "my" simpler approach. Since you've already done the implementation, can you post it in a separate draft PR so we can see it?

Maybe the simple approach can be optimized a bit more, without too much increased complexity. (For example, prefetching the next 2-3 top-level buckets so we don't slow when there's an empty bucket.)

@zuiderkwast
The implementation has been refined and now yields even better results. It's very similar to your approach but with some optimizations.

zuiderkwast

Awesome! This implementation is much simpler and it's impressive that it's just as fast as the other one.

It seems that it could be optimized even a bit more to avoid a few more cache misses, or maybe am I missing something?

In general, the basic idea for prefetching is to only follow links one step at a time, with some time between each step, to make sure that we only access memory that has already been prefetched. I guess we can do this where we do iter->index++:

Prefetch the bucket at index + 2 (just valkey_prefetch(b))
Prefetch the top-level elements and child-bucket pointer at index + 1. (The bucket itself has been prefetched the last time we incremented the index.)
Prefetch the first level child-bucket elements at the current index. We start returning elements at the current top-level bucket elements that are already prefetched.

This makes sure that when we do index++ the next time, we only access memory that has already been previously prefetched. Right?

To also support long chains of child-buckets, we can prefetch the sub-child-bucket elements when we start returning elements from the child-bucket, and so on.

If this gets too complex and doesn't give any better performance, then ignore it.

src/hashtable.c

Signed-off-by: NadavGigi <[email protected]>

zuiderkwast

This looks great to me now.

Can you run the benchmark again after the last changes?

NadavGigi · 2025-01-08T22:13:10Z

This looks great to me now.

Can you run the benchmark again after the last changes?

After running the benchmarks, I can confirm that the results remain the same as those presented in the table from the first comment.

NadavGigi force-pushed the batch_iterator branch from 30276c1 to 6cf4299 Compare January 1, 2025 17:29

NadavGigi changed the title ~~Improving iterator using prefetch~~ Accelerate hash table iterator with prefetching Jan 1, 2025

github-advanced-security bot found potential problems Jan 2, 2025

View reviewed changes

src/hashtable.c Fixed Show resolved Hide resolved

NadavGigi force-pushed the batch_iterator branch 2 times, most recently from e001ab1 to ae465ad Compare January 2, 2025 10:44

ranshid requested a review from uriyage January 2, 2025 16:26

NadavGigi closed this Jan 5, 2025

NadavGigi reopened this Jan 5, 2025

NadavGigi force-pushed the batch_iterator branch 3 times, most recently from 86230a2 to 8b2a3a5 Compare January 5, 2025 11:23

zuiderkwast self-requested a review January 5, 2025 16:28

NadavGigi force-pushed the batch_iterator branch from 8b2a3a5 to c6ac6d7 Compare January 5, 2025 20:10

uriyage reviewed Jan 6, 2025

View reviewed changes

src/hashtable.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

NadavGigi force-pushed the batch_iterator branch 2 times, most recently from 204d534 to 9b8461b Compare January 6, 2025 12:28

NadavGigi force-pushed the batch_iterator branch from 9b8461b to 4de5b1a Compare January 7, 2025 15:01

zuiderkwast reviewed Jan 7, 2025

View reviewed changes

src/hashtable.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

NadavGigi force-pushed the batch_iterator branch 2 times, most recently from 2125fbe to e6bdc5a Compare January 8, 2025 15:01

zuiderkwast reviewed Jan 8, 2025

View reviewed changes

src/hashtable.c Outdated Show resolved Hide resolved

NadavGigi force-pushed the batch_iterator branch from e6bdc5a to 8f5ef3a Compare January 8, 2025 15:46

Improving iterator using prefetch

5328105

Signed-off-by: NadavGigi <[email protected]>

NadavGigi force-pushed the batch_iterator branch from 8f5ef3a to 5328105 Compare January 8, 2025 16:16

zuiderkwast approved these changes Jan 8, 2025

View reviewed changes

zuiderkwast added the release-notes This issue should get a line item in the release notes label Jan 8, 2025

zuiderkwast merged commit 9f4815a into valkey-io:unstable Jan 8, 2025
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate hash table iterator with prefetching #1501

Accelerate hash table iterator with prefetching #1501

NadavGigi commented Jan 1, 2025 •

edited

Loading

madolson commented Jan 2, 2025

madolson commented Jan 2, 2025

zuiderkwast commented Jan 5, 2025 •

edited

Loading

NadavGigi commented Jan 6, 2025 •

edited

Loading

NadavGigi commented Jan 6, 2025 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading

zuiderkwast commented Jan 6, 2025

NadavGigi commented Jan 7, 2025

zuiderkwast left a comment

zuiderkwast left a comment

NadavGigi commented Jan 8, 2025

Accelerate hash table iterator with prefetching #1501

Accelerate hash table iterator with prefetching #1501

Conversation

NadavGigi commented Jan 1, 2025 • edited Loading

Implementation

Performance

Save command improvment

Setup:

Results

madolson commented Jan 2, 2025

madolson commented Jan 2, 2025

zuiderkwast commented Jan 5, 2025 • edited Loading

NadavGigi commented Jan 6, 2025 • edited Loading

NadavGigi commented Jan 6, 2025 • edited Loading

codecov bot commented Jan 6, 2025 • edited Loading

Codecov Report

zuiderkwast commented Jan 6, 2025

NadavGigi commented Jan 7, 2025

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

NadavGigi commented Jan 8, 2025

NadavGigi commented Jan 1, 2025 •

edited

Loading

zuiderkwast commented Jan 5, 2025 •

edited

Loading

NadavGigi commented Jan 6, 2025 •

edited

Loading

NadavGigi commented Jan 6, 2025 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading