-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce multiple consecutive values in each thread to improve efficiency #112
Conversation
4 seems to be the ideal grain size based on testing in https://gist.github.com/maxwindiff/63b4e8f64c73c4b5b4ff00dd4fb79c5d It's a bit eerie that the running times are exactly the same for all scalar data types, whether it's |
Maybe verify execution times with |
It doesn't seem to be able to capture timing information. Do you have any idea what I'm missing? I remember seeing the timing graphs once, but I couldn't get it to work again. julia> da = MtlArray(fill(Float32(1), 8192 * 8192));
2023-03-04 16:02:35.649 julia18[9346:290846] Metal GPU Frame Capture Enabled
julia> Metal.@profile sum(da)
[ Info: GPU frame capture saved to /Users/kichi/.julia/dev/Metal/julia_capture_1.gputrace/
6.7108864f7 |
I think I know what happened, I didn't use |
Re-did the benchmarks, this time things look more reasonable: https://gist.github.com/maxwindiff/3ff4886b3e6c1d8ba9ecafb13f84eab7 A grain size of
|
Made a few more changes:
Now performance is better across the board: |
There does't seem to be a limit to the number of threadgroups. For example, this is pretty close to the maximum device buffer size and it runs fine: julia> a = MtlArray(fill(1, 30000, 30000, 2));
julia> sum(a; dims=3)
30000×30000×1 MtlArray{Int64, 3}:
[:, :, 1] =
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 … 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
... So I think not having the |
The reduction tests in gpuarrays seems pretty comprehensive already. This is ready for review. |
Actually thinking about it more, the array dimensions in |
Adding a test here - JuliaGPU/GPUArrays.jl#459 |
Awesome work, thanks for looking into this! Mind rebasing after I generated conflicts everywhere with the Since this doesn't add tests (it shouldn't), I assume you verified this works with the GPUArrays PR for large reduction tests? Merging and tagging that may take a bit, but I'd like to merge this in the mean time.
I'm not too familiar with Xcode yet, but I think you have to hit Replay to get performance measurements. |
Done, rebased! Yep this passes the new GPUArrays tests. |
Using ideas from:
This replaces the old "reduce serially across chunks of input vector that don't fit in a group" loop, which doesn't seem to apply to Metal.jl since we always launch enough threadgroups to cover the entire input.
In a simple test this is 2-3x faster than HEAD and achieves > 130GB/s in the 1D reduction case. I'll need to experiment with the
stridegrain size parameter and document it, and test more complex cases like such as non-power-of-2 sizes or reducing only certain dimensions.Before:
After:
Helps with #46