-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate hash table iterator with prefetching #1501
Conversation
30276c1
to
6cf4299
Compare
e001ab1
to
ae465ad
Compare
How does this compare to having an iterator that actually returns a batch of items. Something like:
I generally prefer to avoid manually executing prefetching when we can just efficiently process the data, as we then give more hints to the compiler and the processor so it can efficiently do its own re-ordering and prefetching. |
Then don't name them threads, it makes the implementation much harder to follow. |
86230a2
to
8b2a3a5
Compare
I agree with Madelyn that the word "threads" is not suitable. Can we call it batches? This looks quite complex. Was this adapted from an implementation for dict? For hashtable, there is a natural batch of one hashtable bucket (up to 7 elements). Can't we just prefetch all elements in a hashtable bucket and the child bucket if any? When we return the last element in the bucket, we'd prefetch the next bucket's elements. Then we just at read and return the elements as before without keeping track of anything. It'd be much simpler. For KEYS, we just want to return the elements so I guess it can be faster with a more complex solution, but for other usages like rdb dump and aof rewrite, we do more things after each call to hashtableNext so I would assume most of the benefit can be seen even with the simpler approach. Did you try the simpler approach already? |
8b2a3a5
to
c6ac6d7
Compare
Done. |
Batch iterator processes keys about 11% faster than Viktor's approach. |
204d534
to
9b8461b
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1501 +/- ##
============================================
+ Coverage 70.82% 70.84% +0.02%
============================================
Files 120 120
Lines 64911 64915 +4
============================================
+ Hits 45972 45992 +20
+ Misses 18939 18923 -16
|
Thanks for implementing and comparing this! From your table we can see that "my" approach is 3.0 times more throughput than without prefetching, compared to 3.33 times for your Nadav's approch. Given the performance vs complexity, I'm leaning towards "my" simpler approach. Since you've already done the implementation, can you post it in a separate draft PR so we can see it? Maybe the simple approach can be optimized a bit more, without too much increased complexity. (For example, prefetching the next 2-3 top-level buckets so we don't slow when there's an empty bucket.) |
9b8461b
to
4de5b1a
Compare
@zuiderkwast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This implementation is much simpler and it's impressive that it's just as fast as the other one.
It seems that it could be optimized even a bit more to avoid a few more cache misses, or maybe am I missing something?
In general, the basic idea for prefetching is to only follow links one step at a time, with some time between each step, to make sure that we only access memory that has already been prefetched. I guess we can do this where we do iter->index++
:
- Prefetch the bucket at index + 2 (just
valkey_prefetch(b)
) - Prefetch the top-level elements and child-bucket pointer at index + 1. (The bucket itself has been prefetched the last time we incremented the index.)
- Prefetch the first level child-bucket elements at the current index. We start returning elements at the current top-level bucket elements that are already prefetched.
This makes sure that when we do index++ the next time, we only access memory that has already been previously prefetched. Right?
To also support long chains of child-buckets, we can prefetch the sub-child-bucket elements when we start returning elements from the child-bucket, and so on.
If this gets too complex and doesn't give any better performance, then ignore it.
2125fbe
to
e6bdc5a
Compare
e6bdc5a
to
8f5ef3a
Compare
Signed-off-by: NadavGigi <[email protected]>
8f5ef3a
to
5328105
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me now.
Can you run the benchmark again after the last changes?
After running the benchmarks, I can confirm that the results remain the same as those presented in the table from the first comment. |
This PR introduces improvements to the hashtable iterator, implementing prefetching technique described in the blog post Unlock One Million RPS - Part 2 . The changes lay the groundwork for further enhancements in use cases involving iterators. Future PRs will build upon this foundation to improve performance and functionality in various iterator-dependent operations.
In the pursuit of maximizing iterator performance, I conducted a comprehensive series of experiments. My tests encompassed a wide range of approaches, including processing multiple bucket indices in parallel, prefetching the next bucket upon completion of the current one, and several other timing and quantity variations. Surprisingly, after rigorous testing and performance analysis, the simplest implementation presented in this PR consistently outperformed all other more complex strategies.
Implementation
Each time we start iterating over a bucket, we prefetch data for future iterations:
This prefetching is done when we pick up a new bucket, increasing the chance that the data will be in cache by the time we need it.
Performance
The data below was taken by conducting keys command on 64cores Graviton 3 Amazon EC2 instance with 50 mil keys in size of 100 bytes each. The results regarding the duration of “keys *” command was taken from “info all” command.
Save command improvment
Setup:
Results