Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new ScanPrefix accumulate algorithm #15

Merged
merged 2 commits into from
Dec 23, 2024
Merged

Conversation

anicusan
Copy link
Member

Added second accumulate algorithm using coupled lookback of pre-scanned prefixes (=> one extra kernel launch), with that ScanPrefixes algorithm becoming the default on Metal.

This fixes the decoupled-lookback issue on Metal.

…ed prefixes (one extra kernel launch), with that `ScanPrefixes` algorithm becoming the default on Metal.
@anicusan
Copy link
Member Author

We're still faster than the current default Metal accumulate:

using BenchmarkTools
using Metal
import AcceleratedKernels as AK

using Random
Random.seed!(0)

function akacc(v)
    va = AK.accumulate(+, v, init=zero(eltype(v)), block_size=1024)
    Metal.synchronize()
    va
end

function baseacc(v)
    va = accumulate(+, v, init=zero(eltype(v)))
    Metal.synchronize()
    va
end

v = MtlArray(rand(1:100, 10_000_000))

# Correctness checks
va = akacc(v) |> Array
vb = baseacc(v) |> Array
@assert va == vb

# Benchmarks
println("Base vs AK")
display(@benchmark baseacc($v))
display(@benchmark akacc($v))

And timings:

julia> include("accumulate_benchmark.jl")
Base vs AK
BenchmarkTools.Trial: 603 samples with 1 evaluation.
 Range (min … max):  3.369 ms … 52.091 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.746 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.300 ms ±  6.782 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▁  ▁ ▅▇▅▂                                                
  █████████████▇▅▇▇▆▇▅▆▅▄▅▄▄▁▄▁▁▅▁▅▁▆▆▄▆▄▇▅▄▄▄▄▁▁▁▁▁▁▁▁▁▄▄▅▆ ▇
  3.37 ms      Histogram: log(frequency) by time     36.9 ms <

 Memory estimate: 45.41 KiB, allocs estimate: 1568.
BenchmarkTools.Trial: 644 samples with 1 evaluation.
 Range (min … max):  4.535 ms … 35.595 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.928 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.770 ms ±  4.089 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▁    ▁▂▁▂▄▄▃▂                                             
  ███▇▆▁▆██████████▇▄▆▆▇▆▇▅▇▅▆▁▁▅▁▁▁▄▄▁▁▄▁▁▄▄▁▁▁▅▁▁▁▁▁▁▄▄▁▁▄ ▇
  4.53 ms      Histogram: log(frequency) by time     27.6 ms <

 Memory estimate: 16.63 KiB, allocs estimate: 565.

@anicusan anicusan merged commit 3e814ca into main Dec 23, 2024
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

accumulate on Metal sometimes fails due to weaker @synchronize guarantees than on other platforms
1 participant