Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) #3436

Open
wants to merge 30 commits into
base: nightly
Choose a base branch
from

Conversation

msaelices
Copy link
Contributor

@msaelices msaelices commented Aug 31, 2024

The idea is not to use the SIMD hashing algorithm when the Dict keys are strings, usually small ones

Benchmark 1

Same benchmark used in this other optimization: #3071

from collections import Dict
from time import now
from random import *

alias iteration_size = 128 #2048
def main():
    var result: Int=0
    var start = now()
    var stop = now()

    small2 = Dict[Int,Int]()
    start = now()
    for x in range(100):
        for i in range(iteration_size):
            small2[i]=i
        for i in range(iteration_size):
            result += small2[i]
    stop = now()
    print("Int dicts:", stop-start, "ns", result, "rows")

    small3 = Dict[String,String]()
    start = now()
    for x in range(100):
        for i in range(iteration_size):
            small3[str(i)]=str(i)
        for i in range(iteration_size):
            result += len(small3[str(i)])
    stop = now()
    print("String dicts:", stop-start, "ns", result, "rows")

Results (lower is better)

Before :

Int dicts: 5420564 ns 209612800 rows
String dicts: 273638909 ns 210321000 rows

After:

Int dicts: 3266663 ns 209612800 rows
String dicts: 126893375 ns 210321000 rows

This is roughly x2 speed-up in Dict[String, X].

Benchmark 2

Results (the last column is timing, lower is better)

Before:

version,n_wds,n_keys,the,sec
13638,1944,236,0.0083167180000000007
13638,1944,236,0.008287684
13638,1944,236,0.0084045869999999998
13638,1944,236,0.0083934049999999996
13638,1944,236,0.0083934230000000006
13638,1944,236,0.0084422609999999995
13638,1944,236,0.0084124249999999994
13638,1944,236,0.0084025450000000008
13638,1944,236,0.0084028699999999998
13638,1944,236,0.0083881839999999999

After:

version,n_wds,n_keys,the,sec
13638,1943,236,0.0027207440000000002
13638,1943,236,0.0026398810000000002
13638,1943,236,0.0026396229999999998
13638,1943,236,0.002676248
13638,1943,236,0.0027269859999999998
13638,1943,236,0.0027547840000000001
13638,1943,236,0.0027066960000000002
13638,1943,236,0.0027737180000000001
13638,1943,236,0.0028008479999999999
13638,1943,236,0.0029300910000000001

This means a ~x3 speed-up in this benchmark

Benchmark algorithm used:

from collections import List, Dict
from time import now

fn get_wds() raises -> List[String]:
    # String shortened because Github PR description limitation
    # Taken from this: https://github.com/ekbrown/scripting_for_linguists/blob/main/0a0HuaT4Vm7FoYvccyRRQj.txt
    input = String("""
Hey friends, it's your girl Bray. Enjoy Jolene. Welcome to back to her. If you aspire to heal evolve or revolutionize this podcast is for you. Make sure you subscribe and follow us on Instagram at official back to her...
    """)
    return input.upper().split(" ")


fn get_freqs(wds: List[String]) raises -> Dict[String, UInt64]:
    var freqs = Dict[String, UInt64]()
    for wd_ref in wds:
        wd = wd_ref[]
        if wd in freqs:
            freqs[wd] = freqs[wd] + 1
        else:
            freqs[wd] = 1
    return freqs


fn main() raises:
    var wds: List[String] = get_wds()
    var n_wds = len(wds)

    var out_path = "report.csv"
    with open(out_path, "w") as outfile:
        outfile.write(str("version,n_wds,n_keys,the,sec\n"))
        for _ in range(10):
            var t0 = now()
            var freqs = get_freqs(wds)
            var t1 = now()
            var duration = (t1 - t0) / 1_000_000_000
            var the = freqs["THE"]
            var n_keys = len(freqs.keys())
            var out_str = str(n_wds) + "," + str(n_keys) + "," + str(the) + "," + str(duration) + "\n"
            outfile.write(out_str)
    print("DONE, saved to", out_path)

My machine:

> lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Vendor ID:               GenuineIntel
  Model name:            12th Gen Intel(R) Core(TM) i7-12700H
    CPU family:          6
    Model:               154
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           1
    Stepping:            3
    CPU max MHz:         4700,0000
    CPU min MHz:         400,0000
    BogoMIPS:            5376.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch
                         _perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm ss
                         e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tp
                         r_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 x
                         saves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movd
                         iri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   544 KiB (14 instances)
  L1i:                   704 KiB (14 instances)
  L2:                    11,5 MiB (8 instances)
  L3:                    24 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-19
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Faster String.__hash__() algorithm by not using the builtin `hash`
function, which uses SIMD underneath and somehow it's slowing the hash
logic down. The hashing logic was slowing down Mojo string-keyed dict
compared with Python, and with this optimization it should be faster
now.

See modularml#1747

Signed-off-by: Manuel Saelices <[email protected]>
So, we will not change the hash algorithm for all the strings but the
ones used as keys in Dicts, expected to be small most of the times

Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
Signed-off-by: Manuel Saelices <[email protected]>
@msaelices msaelices requested a review from a team as a code owner August 31, 2024 10:28
@msaelices msaelices changed the title Faster hash in string based dicts - Approach 2 [stdlib] Faster hashing logic in string-keyed dicts - Less disruptive approach Aug 31, 2024
@msaelices msaelices changed the title [stdlib] Faster hashing logic in string-keyed dicts - Less disruptive approach [stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) - Less disruptive approach Aug 31, 2024
@martinvuyk
Copy link
Contributor

Yeah we need better overload mechanics with trait conformance for this.

Just a quick workaround: since K is known at compile time, you could just do

@parameter
if K in (String, StringLiteral, ...):
  hash_string(...)
else:
  hash(...)

@msaelices msaelices force-pushed the faster-hash-in-string-based-dicts-approach-2 branch from 5b4311b to 6817159 Compare September 23, 2024 23:19
Signed-off-by: Manuel Saelices <[email protected]>
@msaelices msaelices force-pushed the faster-hash-in-string-based-dicts-approach-2 branch from 6817159 to fc8c226 Compare September 23, 2024 23:20
@msaelices msaelices changed the title [stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) - Less disruptive approach [stdlib] Faster hashing logic in string-keyed dicts (~2x speed-up) Sep 23, 2024
@msaelices
Copy link
Contributor Author

msaelices commented Oct 5, 2024

Yeah we need better overload mechanics with trait conformance for this.

Just a quick workaround: since K is known at compile time, you could just do

@parameter
if K in (String, StringLiteral, ...):
  hash_string(...)
else:
  hash(...)

I think it does not work:

image

Also, K == String or K == StringLiteral does not work either as KeyElement does not implement __eq__

Anyway, the current solution, using _type_is_eq, works pretty well :)

Copy link
Contributor

@soraros soraros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! However, I do wonder if we still need this after #3604?

Comment on lines +73 to +76
var hash = 5381 # typical starting value
for c in s.as_string_slice():
hash = ((hash << 5) + hash) + ord(c) # hash * 33 + ord(char)
return hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need hash by code-points? If not, you can gain free performance by using this:

Suggested change
var hash = 5381 # typical starting value
for c in s.as_string_slice():
hash = ((hash << 5) + hash) + ord(c) # hash * 33 + ord(char)
return hash
hash = 5381 # typical starting value
for c in s.as_bytes_span():
hash = hash * 33 + int(c[]) # hash * 33 + ord(char)
return hash

The compiler can turn your h * 33 into a shift+plus (I checked the generated assembly). LLVM is indeed good at doing this kind of scalar optimisation.

@@ -55,6 +56,44 @@ trait RepresentableKeyElement(KeyElement, Representable):
pass


@always_inline("nodebug")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@always_inline("nodebug")
@always_inline

ref: a1ecf50

@martinvuyk
Copy link
Contributor

Also, K == String or K == StringLiteral does not work either as KeyElement does not implement eq
Anyway, the current solution, using _type_is_eq, works pretty well :)

yeah sorry I imagined some syntax there 🤣

msaelices added a commit to msaelices/aoc2023 that referenced this pull request Oct 12, 2024
It's slower but more readable. It will be faster overtime as Mojo dicts
and strings become fasters. E.g. see
modularml/mojo#3436
modularml/mojo#3528
modularml/mojo#3615

Signed-off-by: Manuel Saelices <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants