Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate hash table iterator with prefetching #1501

Merged
merged 1 commit into from
Jan 8, 2025

Conversation

NadavGigi
Copy link
Contributor

@NadavGigi NadavGigi commented Jan 1, 2025

This PR introduces improvements to the hashtable iterator, implementing prefetching technique described in the blog post Unlock One Million RPS - Part 2 . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations.

In the pursuit of maximizing iterator performance, I conducted a comprehensive series of experiments. My tests encompassed a wide range of approaches, including processing multiple bucket indices in parallel, prefetching the next bucket upon completion of the current one, and several other timing and quantity variations. Surprisingly, after rigorous testing and performance analysis, the simplest implementation presented in this PR consistently outperformed all other more complex strategies.

Implementation

Each time we start iterating over a bucket, we prefetch data for future iterations:

  • We prefetch the entries of the next bucket (if it exists).
  • We prefetch the structure (but not the entries) of the bucket after the next.
    This prefetching is done when we pick up a new bucket, increasing the chance that the data will be in cache by the time we need it.

Performance

The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command.

+--------------------+------------------+-----------------------------+
| prefetching        | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| No                 | 11.112279        | 4,499,529                   |
| Yes                | 3.141916         | 15,913,862                  |
+--------------------+------------------+-----------------------------+
Improvement:
Comparing the iterator without prefetching to the one with prefetching, 
we can see a speed improvement of 11.112279 / 3.141916 ≈ 3.54 times faster.

Save command improvment

Setup:

  • 64cores Graviton 3 Amazon EC2 instance.
  • 50 mil keys in size of 100 bytes each.
  • Running valkey server over RAM file system.
  • crc checksum and comperssion off.

Results

+--------------------+------------------+-----------------------------+
| prefetching        | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| No                 | 28               | 1,785,700                   |
| Yes                | 19.6             | 2,550,000                   |
+--------------------+------------------+-----------------------------+
Improvement:
- Reduced SAVE time by 30% (8.4 seconds faster)
- Increased key processing rate by 42.8% (764,300 more keys/second)

@NadavGigi NadavGigi changed the title Improving iterator using prefetch Accelerate hash table iterator with prefetching Jan 1, 2025
src/hashtable.c Fixed Show resolved Hide resolved
@NadavGigi NadavGigi force-pushed the batch_iterator branch 2 times, most recently from e001ab1 to ae465ad Compare January 2, 2025 10:44
@ranshid ranshid requested a review from uriyage January 2, 2025 16:26
@madolson
Copy link
Member

madolson commented Jan 2, 2025

How does this compare to having an iterator that actually returns a batch of items. Something like:

void *entries[7];
size_t num_entries;
entries = getBatchEntries(iterator, *num_entries);
if (entries) {
    for (size_t i = 0; i < num_entries; i++) {
        whatever(entries[i]);
    }
}

I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching.

@madolson
Copy link
Member

madolson commented Jan 2, 2025

It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads.

Then don't name them threads, it makes the implementation much harder to follow.

@NadavGigi NadavGigi closed this Jan 5, 2025
@NadavGigi NadavGigi reopened this Jan 5, 2025
@NadavGigi NadavGigi force-pushed the batch_iterator branch 3 times, most recently from 86230a2 to 8b2a3a5 Compare January 5, 2025 11:23
@zuiderkwast zuiderkwast self-requested a review January 5, 2025 16:28
@zuiderkwast
Copy link
Contributor

zuiderkwast commented Jan 5, 2025

I agree with Madelyn that the word "threads" is not suitable. Can we call it batches?

This looks quite complex. Was this adapted from an implementation for dict?

For hashtable, there is a natural batch of one hashtable bucket (up to 7 elements). Can't we just prefetch all elements in a hashtable bucket and the child bucket if any? When we return the last element in the bucket, we'd prefetch the next bucket's elements. Then we just at read and return the elements as before without keeping track of anything. It'd be much simpler.

For KEYS, we just want to return the elements so I guess it can be faster with a more complex solution, but for other usages like rdb dump and aof rewrite, we do more things after each call to hashtableNext so I would assume most of the benefit can be seen even with the simpler approach.

Did you try the simpler approach already?

src/hashtable.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
@NadavGigi
Copy link
Contributor Author

NadavGigi commented Jan 6, 2025

How does this compare to having an iterator that actually returns a batch of items. Something like:

void *entries[7];
size_t num_entries;
entries = getBatchEntries(iterator, *num_entries);
if (entries) {
    for (size_t i = 0; i < num_entries; i++) {
        whatever(entries[i]);
    }
}

I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching.
@madolson
Thank you for your suggestion. If I understand correctly, this approach would necessitate changes to every piece of code that currently uses the iterator, not just the iterator implementation itself. However, I may have misinterpreted your proposal. Could you please provide further explanation or clarification on how this batch iterator could be implemented without requiring widespread changes to the existing codebase?

It's important to note that while we refer to 'threads' in this implementation, we're not actually using operating system threads.

Then don't name them threads, it makes the implementation much harder to follow.

Done.

@NadavGigi
Copy link
Contributor Author

NadavGigi commented Jan 6, 2025

I agree with Madelyn that the word "threads" is not suitable. Can we call it batches?

This looks quite complex. Was this adapted from an implementation for dict?

For hashtable, there is a natural batch of one hashtable bucket (up to 7 elements). Can't we just prefetch all elements in a hashtable bucket and the child bucket if any? When we return the last element in the bucket, we'd prefetch the next bucket's elements. Then we just at read and return the elements as before without keeping track of anything. It'd be much simpler.

For KEYS, we just want to return the elements so I guess it can be faster with a more complex solution, but for other usages like rdb dump and aof rewrite, we do more things after each call to hashtableNext so I would assume most of the benefit can be seen even with the simpler approach.

Did you try the simpler approach already?
@zuiderkwast

  1. I have changed the name "threads" to bucketPrefetchInfo since its the number of buckets we scan in parrallel.
  2. Yes, it was adapted from dict implementation.
  3. No, I have tried a simpler version that benefits only from prefetching current bucket("batch 1" in performance table).
    Following your comment, I have implemented the solution you suggested(indeed, very easy approach) and we can see the results here:
By conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each.
+--------------------+------------------+-----------------------------+
| Implementation     | Time (seconds)   | Keys Processed per Second   |
+--------------------+------------------+-----------------------------+
| Iterator without   | 11.112279        | 4,499,529                   |
|    prefetching     |                  |                             |
| Viktor's approach  | 3.712602         | 13,467,954                  |
| batch iterator     | 3.336432         | 14,985,700                  |
+--------------------+------------------+-----------------------------+

Batch iterator processes keys about 11% faster than Viktor's approach.
SAVE command time is the same for batch iterator and Viktor's approach.

@NadavGigi NadavGigi force-pushed the batch_iterator branch 2 times, most recently from 204d534 to 9b8461b Compare January 6, 2025 12:28
Copy link

codecov bot commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.84%. Comparing base (c0014ef) to head (4de5b1a).
Report is 3 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1501      +/-   ##
============================================
+ Coverage     70.82%   70.84%   +0.02%     
============================================
  Files           120      120              
  Lines         64911    64915       +4     
============================================
+ Hits          45972    45992      +20     
+ Misses        18939    18923      -16     
Files with missing lines Coverage Δ
src/hashtable.c 75.91% <100.00%> (-0.42%) ⬇️

... and 15 files with indirect coverage changes

@zuiderkwast
Copy link
Contributor

Batch iterator processes keys about 11% faster than Viktor's approach.
SAVE command time is the same for batch iterator and Viktor's approach.

Thanks for implementing and comparing this! From your table we can see that "my" approach is 3.0 times more throughput than without prefetching, compared to 3.33 times for your Nadav's approch.

Given the performance vs complexity, I'm leaning towards "my" simpler approach. Since you've already done the implementation, can you post it in a separate draft PR so we can see it?

Maybe the simple approach can be optimized a bit more, without too much increased complexity. (For example, prefetching the next 2-3 top-level buckets so we don't slow when there's an empty bucket.)

@NadavGigi
Copy link
Contributor Author

Batch iterator processes keys about 11% faster than Viktor's approach.
SAVE command time is the same for batch iterator and Viktor's approach.

Thanks for implementing and comparing this! From your table we can see that "my" approach is 3.0 times more throughput than without prefetching, compared to 3.33 times for your Nadav's approch.

Given the performance vs complexity, I'm leaning towards "my" simpler approach. Since you've already done the implementation, can you post it in a separate draft PR so we can see it?

Maybe the simple approach can be optimized a bit more, without too much increased complexity. (For example, prefetching the next 2-3 top-level buckets so we don't slow when there's an empty bucket.)

@zuiderkwast
The implementation has been refined and now yields even better results. It's very similar to your approach but with some optimizations.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This implementation is much simpler and it's impressive that it's just as fast as the other one.

It seems that it could be optimized even a bit more to avoid a few more cache misses, or maybe am I missing something?

In general, the basic idea for prefetching is to only follow links one step at a time, with some time between each step, to make sure that we only access memory that has already been prefetched. I guess we can do this where we do iter->index++:

  • Prefetch the bucket at index + 2 (just valkey_prefetch(b))
  • Prefetch the top-level elements and child-bucket pointer at index + 1. (The bucket itself has been prefetched the last time we incremented the index.)
  • Prefetch the first level child-bucket elements at the current index. We start returning elements at the current top-level bucket elements that are already prefetched.

This makes sure that when we do index++ the next time, we only access memory that has already been previously prefetched. Right?

To also support long chains of child-buckets, we can prefetch the sub-child-bucket elements when we start returning elements from the child-bucket, and so on.

If this gets too complex and doesn't give any better performance, then ignore it.

src/hashtable.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
@NadavGigi NadavGigi force-pushed the batch_iterator branch 2 times, most recently from 2125fbe to e6bdc5a Compare January 8, 2025 15:01
src/hashtable.c Outdated Show resolved Hide resolved
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me now.

Can you run the benchmark again after the last changes?

@zuiderkwast zuiderkwast added the release-notes This issue should get a line item in the release notes label Jan 8, 2025
@NadavGigi
Copy link
Contributor Author

This looks great to me now.

Can you run the benchmark again after the last changes?

After running the benchmarks, I can confirm that the results remain the same as those presented in the table from the first comment.

@zuiderkwast zuiderkwast merged commit 9f4815a into valkey-io:unstable Jan 8, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants