Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce multiple consecutive values in each thread to improve efficiency #112

Merged
merged 6 commits into from
Mar 9, 2023

Conversation

maxwindiff
Copy link
Contributor

@maxwindiff maxwindiff commented Mar 3, 2023

Using ideas from:

This replaces the old "reduce serially across chunks of input vector that don't fit in a group" loop, which doesn't seem to apply to Metal.jl since we always launch enough threadgroups to cover the entire input.

In a simple test this is 2-3x faster than HEAD and achieves > 130GB/s in the 1D reduction case. I'll need to experiment with the stride grain size parameter and document it, and test more complex cases like such as non-power-of-2 sizes or reducing only certain dimensions.

  • Experiment with grain size parameter and document it
  • Test performance of non-power-of-2 sizes
  • Test performance of reducing certain dimensions of N-D arrays
  • Check the behavior when number of threadgroups is too high
  • Write more tests as needed

Before:

function init(dims...)
  a = Array{Float32}(undef, dims...)
  for i in 1:length(a)
    a[i] = 1
  end
  return (a, MtlArray(a))
end

a, da = init(8192 * 8192);
b, db = init(8192, 8192);

julia> @btime sum(da)
  6.033 ms (754 allocations: 20.80 KiB)
6.7108864f7

julia> @btime sum(db)
  7.424 ms (759 allocations: 21.33 KiB)
6.7108864f7

After:

julia> @btime sum(da)
  2.030 ms (775 allocations: 21.20 KiB)
6.7108864f7

julia> @btime sum(db)
  3.009 ms (780 allocations: 21.72 KiB)
6.7108864f7

Helps with #46

@maxwindiff
Copy link
Contributor Author

4 seems to be the ideal grain size based on testing in https://gist.github.com/maxwindiff/63b4e8f64c73c4b5b4ff00dd4fb79c5d

It's a bit eerie that the running times are exactly the same for all scalar data types, whether it's UInt8, Float32 or Int64, but I don't see anything obviously wrong with the benchmarking code.

@maleadt
Copy link
Member

maleadt commented Mar 4, 2023

Maybe verify execution times with Metal.@profile and Xcode?

@maxwindiff
Copy link
Contributor Author

maxwindiff commented Mar 5, 2023

It doesn't seem to be able to capture timing information. Do you have any idea what I'm missing? I remember seeing the timing graphs once, but I couldn't get it to work again.

julia> da = MtlArray(fill(Float32(1), 8192 * 8192));
2023-03-04 16:02:35.649 julia18[9346:290846] Metal GPU Frame Capture Enabled

julia> Metal.@profile sum(da)
[ Info: GPU frame capture saved to /Users/kichi/.julia/dev/Metal/julia_capture_1.gputrace/
6.7108864f7

Screenshot 2023-03-04 at 16 05 56

Screenshot 2023-03-04 at 16 05 50

Screenshot 2023-03-04 at 16 06 25

@maxwindiff
Copy link
Contributor Author

I think I know what happened, I didn't use $<var> in @btime. Need to redo the tests.

@maxwindiff
Copy link
Contributor Author

Re-did the benchmarks, this time things look more reasonable: https://gist.github.com/maxwindiff/3ff4886b3e6c1d8ba9ecafb13f84eab7

A grain size of 16 / sizeof(T) generally works well, except when reducing along certain dimensions of an n-d array. I can see why it isn't helpful to, say, read multiple values along the 2nd dimension of a 3D array. But it was surprising to see reading along the 1st dimension slowing things down... I need to investigate more.

sum((256, 512, 512); dims=[1])
grain=1 :  7.610 ms (272 allocations: 7.86 KiB)
grain=2 :  8.056 ms (273 allocations: 7.88 KiB)
grain=4 :  9.150 ms (273 allocations: 7.88 KiB)
grain=8 :  11.315 ms (273 allocations: 7.88 KiB)
grain=16:  20.130 ms (273 allocations: 7.88 KiB)

sum((256, 512, 512); dims=[2])
grain=1 :  9.178 ms (278 allocations: 7.95 KiB)
grain=2 :  8.769 ms (278 allocations: 7.95 KiB)
grain=4 :  9.287 ms (278 allocations: 7.95 KiB)
grain=8 :  12.515 ms (278 allocations: 7.95 KiB)
grain=16:  22.905 ms (278 allocations: 7.95 KiB)

@maxwindiff
Copy link
Contributor Author

maxwindiff commented Mar 5, 2023

Made a few more changes:

  • Only read multiple values if size of reduction dimension is greater than threads_per_threadgroup (otherwise we'll just starve threads of work for no good reason. This is why sum((256,512,512); dim=1) was slow before)
  • Only read multiple values if reduction dimension is potentially contiguous in memory

Now performance is better across the board:
HEAD: https://gist.github.com/maxwindiff/4d861ae188c89ea48dd9c2986e349ca4
This PR: https://gist.github.com/maxwindiff/dcb8f3bcc926cbfc3ef55940fd35f345

@maxwindiff
Copy link
Contributor Author

There does't seem to be a limit to the number of threadgroups. For example, this is pretty close to the maximum device buffer size and it runs fine:

julia> a = MtlArray(fill(1, 30000, 30000, 2));

julia> sum(a; dims=3)
30000×30000×1 MtlArray{Int64, 3}:
[:, :, 1] =
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2    2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2     2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
...

So I think not having the ireduce += localDim_reduce * groupDim_reduce loop is fine.

@maxwindiff
Copy link
Contributor Author

The reduction tests in gpuarrays seems pretty comprehensive already. This is ready for review.

@maxwindiff maxwindiff marked this pull request as ready for review March 6, 2023 04:45
@maxwindiff
Copy link
Contributor Author

Actually thinking about it more, the array dimensions in gpuarrays/reductions are all pretty small and would not trigger multi-read. I should add a few test cases with larger array sizes.

@maxwindiff
Copy link
Contributor Author

Adding a test here - JuliaGPU/GPUArrays.jl#459
It passes on nightly, but on other platforms it had a failure when testing Float32. I couldn't reproduce the failure locally on 1.8 however. @maleadt do you have any suggestions on how to investigate?

@maleadt maleadt added the performance Gotta go fast. label Mar 8, 2023
@maleadt
Copy link
Member

maleadt commented Mar 8, 2023

Awesome work, thanks for looking into this! Mind rebasing after I generated conflicts everywhere with the libcmt removal? 😅

Since this doesn't add tests (it shouldn't), I assume you verified this works with the GPUArrays PR for large reduction tests? Merging and tagging that may take a bit, but I'd like to merge this in the mean time.

It doesn't seem to be able to capture timing information. Do you have any idea what I'm missing? I remember seeing the timing graphs once, but I couldn't get it to work again.

I'm not too familiar with Xcode yet, but I think you have to hit Replay to get performance measurements.

@maxwindiff
Copy link
Contributor Author

Done, rebased! Yep this passes the new GPUArrays tests.

@maleadt maleadt merged commit af6f7c4 into JuliaGPU:main Mar 9, 2023
@maxwindiff maxwindiff deleted the reduce branch March 9, 2023 17:42
christiangnrd added a commit to christiangnrd/Metal.jl that referenced this pull request Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Gotta go fast.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants