-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436
base: nightly
Are you sure you want to change the base?
[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436
Conversation
Faster String.__hash__() algorithm by not using the builtin `hash` function, which uses SIMD underneath and somehow it's slowing the hash logic down. The hashing logic was slowing down Mojo string-keyed dict compared with Python, and with this optimization it should be faster now. See modularml#1747 Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
So, we will not change the hash algorithm for all the strings but the ones used as keys in Dicts, expected to be small most of the times Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
…string-based-dicts-approach-2
Signed-off-by: Manuel Saelices <[email protected]>
…string-based-dicts-approach-2
Workaround of modularml#3437 Signed-off-by: Manuel Saelices <[email protected]>
…sh in strings Signed-off-by: Manuel Saelices <[email protected]>
which to call depending on how we initialize the dict Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
Yeah we need better overload mechanics with trait conformance for this. Just a quick workaround: since K is known at compile time, you could just do @parameter
if K in (String, StringLiteral, ...):
hash_string(...)
else:
hash(...) |
Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
5b4311b
to
6817159
Compare
Signed-off-by: Manuel Saelices <[email protected]>
6817159
to
fc8c226
Compare
Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
I think it does not work: Also, Anyway, the current solution, using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! However, I do wonder if we still need this after #3604?
var hash = 5381 # typical starting value | ||
for c in s.as_string_slice(): | ||
hash = ((hash << 5) + hash) + ord(c) # hash * 33 + ord(char) | ||
return hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need hash by code-points? If not, you can gain free performance by using this:
var hash = 5381 # typical starting value | |
for c in s.as_string_slice(): | |
hash = ((hash << 5) + hash) + ord(c) # hash * 33 + ord(char) | |
return hash | |
hash = 5381 # typical starting value | |
for c in s.as_bytes_span(): | |
hash = hash * 33 + int(c[]) # hash * 33 + ord(char) | |
return hash |
The compiler can turn your h * 33
into a shift+plus (I checked the generated assembly). LLVM is indeed good at doing this kind of scalar optimisation.
@@ -55,6 +56,44 @@ trait RepresentableKeyElement(KeyElement, Representable): | |||
pass | |||
|
|||
|
|||
@always_inline("nodebug") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@always_inline("nodebug") | |
@always_inline |
ref: a1ecf50
yeah sorry I imagined some syntax there 🤣 |
It's slower but more readable. It will be faster overtime as Mojo dicts and strings become fasters. E.g. see modularml/mojo#3436 modularml/mojo#3528 modularml/mojo#3615 Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
The idea is not to use the SIMD hashing algorithm when the Dict keys are strings, usually small ones
Benchmark 1
Same benchmark used in this other optimization: #3071
Results (lower is better)
Before :
After:
This is roughly x2 speed-up in
Dict[String, X]
.Benchmark 2
Results (the last column is timing, lower is better)
Before:
After:
This means a ~x3 speed-up in this benchmark
Benchmark algorithm used:
My machine: