Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for dims kwarg #6

Closed
pxl-th opened this issue Oct 16, 2024 · 14 comments
Closed

Support for dims kwarg #6

pxl-th opened this issue Oct 16, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@pxl-th
Copy link
Member

pxl-th commented Oct 16, 2024

Base methods, such as accumulate!, mapreduce have support for dims kwarg.
Is there a plan for adding such support here?
We can then replace other kernels from AMDGPU/CUDA with AK implementation.

@anicusan
Copy link
Member

anicusan commented Oct 18, 2024

Hi Anton, sorry for the late reply - I've been thinking about the best parallelisation approach for n-dimensional reductions; the lazy way would be to simply run the same reduction steps over each index permutation bar the dims one over which we're running the reduction - while using the same amount of shared memory as the 1D case, its synchronisation overhead increases with the number of elements. Not great, and it would not be performant for big arrays where the reduced dimension is small.

I am trying to implement a parallelisation approach where each reduction operation done by each thread runs over all index permutations bar dims (at the expense of more shared memory used), but this requires shared memory whose size is dependent on the specific dimension-wise sizes of the input array - which are runtime values, and hence cannot be used to define @private Int (runtime_length,). I need to think a bit more about shared memory sizes...

Still, the 1D and n-dimensional cases would follow separate codepaths for maximum performance - the n-dimensional case simply needs more scaffolding / memory / indices than the 1D one, and no device-to-host copying (which is needed if dims is not specified).

If you want to, you can forward the call for the 1D case to AK while I wrestle with shared memory. I will make the functions accept N-dimensional arrays which will be operated over linear indices, like Julia Base reductions without specified dims.

@anicusan anicusan added the enhancement New feature or request label Oct 18, 2024
@pxl-th
Copy link
Member Author

pxl-th commented Oct 18, 2024

Thanks for working on this!

If you want to use runtime values, you can wrap them in Val before passing to the kernel:

@kernel function ker(x, ::Val{runtime_length}) where {runtime_length}
    m = @private Int (runtime_length,)
    ...
end

ker(ROCBackend())(x, Val(size(x, 2)))

However, every time you pass a different value it will recompile the kernel, so probably not good if they change a lot.

@anicusan
Copy link
Member

anicusan commented Nov 4, 2024

Hi, I think I cracked it! I wrote two parallelisation approaches depending on whether the reduced dimension has fewer elements than the other dimensions combined (e.g. reduce(+, rand(3, 1000), dims=1) has only 3 elements in the reduced axis).

On my Mac M3 I get the following benchmark results:

# N-dimensional reduction benchmark against Base
using Metal
using KernelAbstractions
import AcceleratedKernels as AK

using BenchmarkTools
using Random
Random.seed!(0)


function sum_base(s; dims)
    d = reduce(+, s; init=zero(eltype(s)), dims=dims)
    KernelAbstractions.synchronize(get_backend(s))
    d
end


function sum_ak(s; dims)
    d = AK.reduce(+, s; init=zero(eltype(s)), dims=dims)
    KernelAbstractions.synchronize(get_backend(s))
    d
end


# Make array with highly unequal per-axis sizes
s = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))

# Correctness
@assert sum_base(s, dims=1) == sum_ak(s, dims=1)
@assert sum_base(s, dims=2) == sum_ak(s, dims=2)

# Benchmarks
println("\nReduction over small axis - AK vs Base")
display(@benchmark sum_ak($s, dims=1))
display(@benchmark sum_base($s, dims=1))

println("\nReduction over long axis - AK vs Base")
display(@benchmark sum_ak($s, dims=2))
display(@benchmark sum_base($s, dims=2))
Reduction over small axis - AK vs Base
BenchmarkTools.Trial: 5711 samples with 1 evaluation.
 Range (min … max):  237.292 μs … 83.431 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     444.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   875.866 μs ±  3.540 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▂                                                          ▁
  ███▆▅▅▃▃▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▄▁▁▅▃▁▃▃▄▄▄▃ █
  237 μs        Histogram: log(frequency) by time      23.1 ms <

 Memory estimate: 6.87 KiB, allocs estimate: 230.
BenchmarkTools.Trial: 5296 samples with 1 evaluation.
 Range (min … max):  445.750 μs … 39.186 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     877.438 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   947.043 μs ±  1.070 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂▃▃▆██▆▆▂                                   
  ▁▁▂▂▂▂▃▄▃▂▂▁▂▂▂▂▂▆█████████▆▅▃▂▂▂▂▂▂▂▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁ ▃
  446 μs          Histogram: frequency by time         1.55 ms <

 Memory estimate: 6.01 KiB, allocs estimate: 209.

Reduction over long axis - AK vs Base
BenchmarkTools.Trial: 4739 samples with 1 evaluation.
 Range (min … max):  493.708 μs … 80.576 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     822.292 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.049 ms ±  2.197 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▂                                                           
  ▄██▄▂▂▂▂▂▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▂▂▂▂▁▁▂▂▂▁▂▁▁▂▁▁▂▂ ▂
  494 μs          Histogram: frequency by time         11.3 ms <

 Memory estimate: 6.53 KiB, allocs estimate: 228.
BenchmarkTools.Trial: 3242 samples with 1 evaluation.
 Range (min … max):  328.209 μs … 167.401 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     699.604 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.535 ms ±   7.122 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅                                                             
  ███▆▅▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▃▁▃▁▃▁▁▁▁▃▁▁▃▁▃▃▄▄▄▃▁▃▃▃▄▁▁▁▃ █
  328 μs        Histogram: log(frequency) by time       33.7 ms <

 Memory estimate: 11.57 KiB, allocs estimate: 407.

AK gets slightly better if we set block_size=512, but some machines (e.g. Intel Graphics) only allow block_size <= 256, so I left the conservative defaults in.

I'd be curious to see benchmarks on a ROCm machine if you have one on hand.

@pxl-th
Copy link
Member Author

pxl-th commented Nov 7, 2024

Great work! I just tested it on AMD GPU with your script, with block_size=512 performance is definitely better:

Reduction over small axis - AK vs Base
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  47.157 μs  864.506 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     56.875 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   71.476 μs ±  31.530 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▄▆██▆▄▃▃▂▁                          ▁▁▃▄▄▄▄▃▂▁▁            ▂
  █████████████▇▇▇▇▅▅▅▅▅▆▄▅▄▅▅▅▃▅▆▅▃▅▆▇▇████████████▇▆▆▆▅▃▆▄▄▄ █
  47.2 μs       Histogram: log(frequency) by time       154 μs <

 Memory estimate: 4.12 KiB, allocs estimate: 145.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):   99.655 μs  677.307 ms  ┊ GC (min  max): 0.00%  1.66%
 Time  (median):     111.988 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   193.832 μs ±   6.772 ms  ┊ GC (mean ± σ):  0.58% ± 0.02%

     ▃▅██▇▆▄▂                           ▁▂▃▄▄▄▃▂▂▂▂▁▁           ▂
  ▄▇██████████▆▆▆▆▇▇▆▇▅▆▅▆▅▅▄▄▇▄▄▄▂▄▃▅▅▇█████████████▇▇▇▇▆▅▅▄▅▅ █
  99.7 μs       Histogram: log(frequency) by time        206 μs <

 Memory estimate: 4.06 KiB, allocs estimate: 148.

Reduction over long axis - AK vs Base
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  87.613 μs  474.281 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     90.619 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   91.768 μs ±   7.733 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▃█▂                                                       
  ▂▂▃▅███▆▆▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  87.6 μs         Histogram: frequency by time          118 μs <

 Memory estimate: 4.03 KiB, allocs estimate: 139.
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  76.562 μs  469.373 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     79.528 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   80.609 μs ±   7.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▅█▅▁                                                      
  ▂▂▄█████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  76.6 μs         Histogram: frequency by time          106 μs <

 Memory estimate: 8.67 KiB, allocs estimate: 221.

@pxl-th
Copy link
Member Author

pxl-th commented Nov 7, 2024

We can use occupancy API for optimal block_size before launching the kernel.
But I think it KernelAbstractions.jl needs some work to expose this API.

@anicusan
Copy link
Member

anicusan commented Nov 8, 2024

That's great news, thank you for testing it on AMD. In terms of optimal kernel launches, I still like the generality and simplicity in something like:

popt = AK.@tune reduce(f, src, init=init, block_size=$block_size) block_size=(64, 128, 256, 512, 1024)

and

# Not tied to any specific backend, algorithm or API
popt = AK.@tune begin
    reduce(f, src, init=init,
           block_size=$block_size,
           switch_below=$switch_below)
    block_size=(64, 128, 256, 512, 1024)
    switch_below=(1, 10, 100, 1000, 10000)
end

Which could try all permutations in the $ arguments and return the optimal combination. One of these days I should learn Julia macros properly...

@anicusan
Copy link
Member

N-dimensional reduce and mapreduce implemented and tested.

Keeping this issue open until we add an N-dimensional accumulate too.

@yolhan83
Copy link

yolhan83 commented Nov 13, 2024

Hello, I wonder if it could be possible to have a foreachaxes thing that would iterate on a specific axes similar to

for i in axes(A,3)
end

something like

foreachaxes(A,3) do i
end

Maybe this would just mean making

function _foreachaxes_gpu(
    f,
    itr,
    n::Int,
    backend::GPU;

    block_size::Int=256,
)
    # GPU implementation
    @argcheck block_size > 0
    _foreachindex_global!(backend, block_size)(f, axes(itr,n), ndrange=size(itr,n))
    nothing
end

and the other related ones too, or will there be an issue with the closure f ?
Thank you a lot for this package btw.

@anicusan
Copy link
Member

Hi @yolhan83 , it was quite simple to add - it is available on the AcceleratedKernels#main version on the repo now :)

However, do keep in mind that on GPUs it is better to have more threads doing less work each (ideally uniformly), than fewer threads doing more work. So if you can process your elements independently, even if you're using a multidimensional array, it's better to use foreachindex (which just indexes into each element linearly).

@yolhan83
Copy link

Yeah the main purpose I see of adding this was to when most axes are set to some constant and you want to loop on the last ones without making a view but yes better not try to do things in a resulting ndim-1 array inside this loop for sure. Thank you ☺️

@yolhan83
Copy link

yolhan83 commented Nov 13, 2024

Maybe if it leads to people doing too many work by thread this could be changed so that when looping on a multidimensional array, you will have 1 thread per index but and easy way to retrieve the dimensions of the original arrays like a CartesianIndex way of thinking, don't know if I'm clear enough here sorry 😅 may be a foreachCartesianIndex ? The one you added could still be here since it's a different purpose.

@anicusan
Copy link
Member

anicusan commented Nov 13, 2024

you want to loop on the last ones without making a view

That makes perfect sense.

have 1 thread per index but and easy way to retrieve the dimensions of the original arrays

I think you can use CartesianIndex inside a GPU kernel; you can also access size and strides if you need to do fancy index trickery manually (I did that in the N-dimensional reduce if you're curious). Or simply, for a 2D array you can do:

# GPU version with two nested `for` loops linearised
AK.foreachindex(1:nrows * ncols, backend) do i
    irow, icol = divrem(i - 1, ncols)
    # ... do work with `irow` and `icol`
end

@anicusan
Copy link
Member

Hi all, while my liver recovered post-Christmas I implemented an N-dimensional accumulate algorithm too - so we now have full dims support for accumulate, reduce and mapreduce!

On top of them I also added higher-order arithmetic functions (sum, prod, minimum, maximum, count, any, all); they are all available in AcceleratedKernels v0.2.2.

If you test/benchmark them on your machines at any point, please let me know how it goes; I'll close this issue now. Happy new year :)

@anicusan anicusan closed this as completed Jan 4, 2025
@pxl-th
Copy link
Member Author

pxl-th commented Jan 6, 2025

Thanks! I'll try them soon and post an update here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants