[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436

msaelices · 2024-08-31T10:28:28Z

The idea is not to use the SIMD hashing algorithm when the Dict keys are strings, usually small ones

Benchmark 1

Same benchmark used in this other optimization: #3071

from collections import Dict
from time import now
from random import *

alias iteration_size = 128 #2048
def main():
    var result: Int=0
    var start = now()
    var stop = now()

    small2 = Dict[Int,Int]()
    start = now()
    for x in range(100):
        for i in range(iteration_size):
            small2[i]=i
        for i in range(iteration_size):
            result += small2[i]
    stop = now()
    print("Int dicts:", stop-start, "ns", result, "rows")

    small3 = Dict[String,String]()
    start = now()
    for x in range(100):
        for i in range(iteration_size):
            small3[str(i)]=str(i)
        for i in range(iteration_size):
            result += len(small3[str(i)])
    stop = now()
    print("String dicts:", stop-start, "ns", result, "rows")

Results (lower is better)

Before :

Int dicts: 5420564 ns 209612800 rows
String dicts: 273638909 ns 210321000 rows

After:

Int dicts: 3266663 ns 209612800 rows
String dicts: 126893375 ns 210321000 rows

This is roughly x2 speed-up in Dict[String, X].

Benchmark 2

Results (the last column is timing, lower is better)

Before:

version,n_wds,n_keys,the,sec
13638,1944,236,0.0083167180000000007
13638,1944,236,0.008287684
13638,1944,236,0.0084045869999999998
13638,1944,236,0.0083934049999999996
13638,1944,236,0.0083934230000000006
13638,1944,236,0.0084422609999999995
13638,1944,236,0.0084124249999999994
13638,1944,236,0.0084025450000000008
13638,1944,236,0.0084028699999999998
13638,1944,236,0.0083881839999999999

After:

version,n_wds,n_keys,the,sec
13638,1943,236,0.0027207440000000002
13638,1943,236,0.0026398810000000002
13638,1943,236,0.0026396229999999998
13638,1943,236,0.002676248
13638,1943,236,0.0027269859999999998
13638,1943,236,0.0027547840000000001
13638,1943,236,0.0027066960000000002
13638,1943,236,0.0027737180000000001
13638,1943,236,0.0028008479999999999
13638,1943,236,0.0029300910000000001

This means a ~x3 speed-up in this benchmark

Benchmark algorithm used:

from collections import List, Dict
from time import now

fn get_wds() raises -> List[String]:
    # String shortened because Github PR description limitation
    # Taken from this: https://github.com/ekbrown/scripting_for_linguists/blob/main/0a0HuaT4Vm7FoYvccyRRQj.txt
    input = String("""
Hey friends, it's your girl Bray. Enjoy Jolene. Welcome to back to her. If you aspire to heal evolve or revolutionize this podcast is for you. Make sure you subscribe and follow us on Instagram at official back to her...
    """)
    return input.upper().split(" ")


fn get_freqs(wds: List[String]) raises -> Dict[String, UInt64]:
    var freqs = Dict[String, UInt64]()
    for wd_ref in wds:
        wd = wd_ref[]
        if wd in freqs:
            freqs[wd] = freqs[wd] + 1
        else:
            freqs[wd] = 1
    return freqs


fn main() raises:
    var wds: List[String] = get_wds()
    var n_wds = len(wds)

    var out_path = "report.csv"
    with open(out_path, "w") as outfile:
        outfile.write(str("version,n_wds,n_keys,the,sec\n"))
        for _ in range(10):
            var t0 = now()
            var freqs = get_freqs(wds)
            var t1 = now()
            var duration = (t1 - t0) / 1_000_000_000
            var the = freqs["THE"]
            var n_keys = len(freqs.keys())
            var out_str = str(n_wds) + "," + str(n_keys) + "," + str(the) + "," + str(duration) + "\n"
            outfile.write(out_str)
    print("DONE, saved to", out_path)

My machine:

> lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Vendor ID:               GenuineIntel
  Model name:            12th Gen Intel(R) Core(TM) i7-12700H
    CPU family:          6
    Model:               154
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           1
    Stepping:            3
    CPU max MHz:         4700,0000
    CPU min MHz:         400,0000
    BogoMIPS:            5376.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch
                         _perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm ss
                         e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tp
                         r_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 x
                         saves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movd
                         iri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   544 KiB (14 instances)
  L1i:                   704 KiB (14 instances)
  L2:                    11,5 MiB (8 instances)
  L3:                    24 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-19
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Faster String.__hash__() algorithm by not using the builtin `hash` function, which uses SIMD underneath and somehow it's slowing the hash logic down. The hashing logic was slowing down Mojo string-keyed dict compared with Python, and with this optimization it should be faster now. See modularml#1747 Signed-off-by: Manuel Saelices <[email protected]>

Signed-off-by: Manuel Saelices <[email protected]>

So, we will not change the hash algorithm for all the strings but the ones used as keys in Dicts, expected to be small most of the times Signed-off-by: Manuel Saelices <[email protected]>

Signed-off-by: Manuel Saelices <[email protected]>

…string-based-dicts-approach-2

Signed-off-by: Manuel Saelices <[email protected]>

…string-based-dicts-approach-2

Workaround of modularml#3437 Signed-off-by: Manuel Saelices <[email protected]>

…sh in strings Signed-off-by: Manuel Saelices <[email protected]>

which to call depending on how we initialize the dict Signed-off-by: Manuel Saelices <[email protected]>

Signed-off-by: Manuel Saelices <[email protected]>

martinvuyk · 2024-09-20T14:02:41Z

Yeah we need better overload mechanics with trait conformance for this.

Just a quick workaround: since K is known at compile time, you could just do

@parameter
if K in (String, StringLiteral, ...):
  hash_string(...)
else:
  hash(...)

Signed-off-by: Manuel Saelices <[email protected]>

msaelices · 2024-10-05T21:11:58Z

Yeah we need better overload mechanics with trait conformance for this.

Just a quick workaround: since K is known at compile time, you could just do
@parameter
if K in (String, StringLiteral, ...):
  hash_string(...)
else:
  hash(...)

I think it does not work:

Also, K == String or K == StringLiteral does not work either as KeyElement does not implement __eq__

Anyway, the current solution, using _type_is_eq, works pretty well :)

soraros

Good work! However, I do wonder if we still need this after #3604?

soraros · 2024-10-05T22:30:58Z

stdlib/src/collections/dict.mojo

+    var hash = 5381  # typical starting value
+    for c in s.as_string_slice():
+        hash = ((hash << 5) + hash) + ord(c)  # hash * 33 + ord(char)
+    return hash


Do we need hash by code-points? If not, you can gain free performance by using this:

Suggested change

var hash = 5381 # typical starting value

for c in s.as_string_slice():

hash = ((hash << 5) + hash) + ord(c) # hash * 33 + ord(char)

return hash

hash = 5381 # typical starting value

for c in s.as_bytes_span():

hash = hash * 33 + int(c[]) # hash * 33 + ord(char)

return hash

The compiler can turn your h * 33 into a shift+plus (I checked the generated assembly). LLVM is indeed good at doing this kind of scalar optimisation.

soraros · 2024-10-05T22:32:26Z

stdlib/src/collections/dict.mojo

@@ -55,6 +56,44 @@ trait RepresentableKeyElement(KeyElement, Representable):
    pass


+@always_inline("nodebug")


Suggested change

@always_inline("nodebug")

@always_inline

ref: a1ecf50

martinvuyk · 2024-10-09T19:54:51Z

Also, K == String or K == StringLiteral does not work either as KeyElement does not implement eq
Anyway, the current solution, using _type_is_eq, works pretty well :)

yeah sorry I imagined some syntax there 🤣

It's slower but more readable. It will be faster overtime as Mojo dicts and strings become fasters. E.g. see modularml/mojo#3436 modularml/mojo#3528 modularml/mojo#3615 Signed-off-by: Manuel Saelices <[email protected]>

Signed-off-by: Manuel Saelices <[email protected]>

msaelices added 10 commits August 28, 2024 23:44

[stdlib] Ensure the string literals use the same hashing algorithm

3c02f95

Signed-off-by: Manuel Saelices <[email protected]>

Merge branch 'nightly' into faster-hash-in-string-based-dicts

207822f

Merge branch 'nightly' into faster-hash-in-string-based-dicts

3a76bd9

[stdlib] More specific optimization only for string-keyed dicts

7648e00

So, we will not change the hash algorithm for all the strings but the ones used as keys in Dicts, expected to be small most of the times Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] Revert some changes

1fb4278

Signed-off-by: Manuel Saelices <[email protected]>

Merge branch 'faster-hash-in-string-based-dicts' into faster-hash-in-…

f849389

…string-based-dicts-approach-2

[stdlib] Simpler logic

3628882

Signed-off-by: Manuel Saelices <[email protected]>

Merge branch 'nightly' into faster-hash-in-string-based-dicts

8b2c269

Merge branch 'faster-hash-in-string-based-dicts' into faster-hash-in-…

adf2e00

…string-based-dicts-approach-2

msaelices requested a review from a team as a code owner August 31, 2024 10:28

msaelices changed the title ~~Faster hash in string based dicts - Approach 2~~ [stdlib] Faster hashing logic in string-keyed dicts - Less disruptive approach Aug 31, 2024

msaelices changed the title ~~[stdlib] Faster hashing logic in string-keyed dicts - Less disruptive approach~~ [stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) - Less disruptive approach Aug 31, 2024

msaelices mentioned this pull request Aug 31, 2024

[BUG] cannot implicitly convert 'K' value to 'K' in assignment in condicional conformance #3437

Closed

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

8f2b9cd

msaelices mentioned this pull request Sep 1, 2024

[stdlib] Faster hashing logic in string-keyed dicts (~x2 speed-up) #3427

Closed

msaelices added 6 commits September 1, 2024 19:21

[stdlib] Fix weird compiling error.

748d59a

Workaround of modularml#3437 Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] Override the Dict.__setitem__() to make sure is using new ha…

7ff6ef2

…sh in strings Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] Move _hash_key functions inside the struct to be able to choose

551d032

which to call depending on how we initialize the dict Signed-off-by: Manuel Saelices <[email protected]>

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

2bf5359

Signed-off-by: Manuel Saelices <[email protected]>

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

36928bd

[stdlib] Fix issue with conditional conformances

9cb87ee

Signed-off-by: Manuel Saelices <[email protected]>

martinvuyk mentioned this pull request Sep 20, 2024

[stdlib] Add a new hash function based on AHash algorithm #3476

Closed

msaelices added 4 commits September 24, 2024 00:22

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

cec1bbc

Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] Hack to fix the "cannot convert 'K' value to 'K'" error

8f55ba5

Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] Simplifications in the code

ab59b1a

Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] x3 speed-up by using string slices and save one copyinit

1197496

Signed-off-by: Manuel Saelices <[email protected]>

msaelices force-pushed the faster-hash-in-string-based-dicts-approach-2 branch from 5b4311b to 6817159 Compare September 23, 2024 23:19

[stdlib] Clean-up unneeded code

fc8c226

Signed-off-by: Manuel Saelices <[email protected]>

msaelices force-pushed the faster-hash-in-string-based-dicts-approach-2 branch from 6817159 to fc8c226 Compare September 23, 2024 23:20

msaelices added 2 commits September 24, 2024 01:24

[stdlib] Use transfer operator in __setitem__()

9a29d8c

Signed-off-by: Manuel Saelices <[email protected]>

[stdlib] Better function name

b853d4f

Signed-off-by: Manuel Saelices <[email protected]>

msaelices changed the title ~~[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) - Less disruptive approach~~ [stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) Sep 23, 2024

msaelices added 4 commits September 24, 2024 19:03

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

993b72e

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

03ddda3

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

e2ea0de

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

440fccf

soraros reviewed Oct 5, 2024

View reviewed changes

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

418dd78

msaelices mentioned this pull request Oct 12, 2024

Experiment of using the AHash algorithm to hash Dict keys mzaks/mojo#1

Open

Merge branch 'nightly' into faster-hash-in-string-based-dicts-approach-2

94a3eec

Signed-off-by: Manuel Saelices <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436

[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436

msaelices commented Aug 31, 2024 •

edited

Loading

martinvuyk commented Sep 20, 2024

msaelices commented Oct 5, 2024 •

edited

Loading

soraros left a comment

soraros Oct 5, 2024

soraros Oct 5, 2024

martinvuyk commented Oct 9, 2024

		@@ -55,6 +56,44 @@ trait RepresentableKeyElement(KeyElement, Representable):
		pass


		@always_inline("nodebug")

[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436

Are you sure you want to change the base?

[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436

Conversation

msaelices commented Aug 31, 2024 • edited Loading

Benchmark 1

Results (lower is better)

Benchmark 2

Results (the last column is timing, lower is better)

martinvuyk commented Sep 20, 2024

msaelices commented Oct 5, 2024 • edited Loading

soraros left a comment

Choose a reason for hiding this comment

soraros Oct 5, 2024

Choose a reason for hiding this comment

soraros Oct 5, 2024

Choose a reason for hiding this comment

martinvuyk commented Oct 9, 2024

msaelices commented Aug 31, 2024 •

edited

Loading

msaelices commented Oct 5, 2024 •

edited

Loading