Reduce multiple consecutive values in each thread to improve efficiency #112

maxwindiff · 2023-03-03T08:27:33Z

Using ideas from:

This replaces the old "reduce serially across chunks of input vector that don't fit in a group" loop, which doesn't seem to apply to Metal.jl since we always launch enough threadgroups to cover the entire input.

In a simple test this is 2-3x faster than HEAD and achieves > 130GB/s in the 1D reduction case. I'll need to experiment with the ~~stride~~ grain size parameter and document it, and test more complex cases like such as non-power-of-2 sizes or reducing only certain dimensions.

Experiment with grain size parameter and document it
Test performance of non-power-of-2 sizes
Test performance of reducing certain dimensions of N-D arrays
Check the behavior when number of threadgroups is too high
Write more tests as needed

Before:

function init(dims...)
  a = Array{Float32}(undef, dims...)
  for i in 1:length(a)
    a[i] = 1
  end
  return (a, MtlArray(a))
end

a, da = init(8192 * 8192);
b, db = init(8192, 8192);

julia> @btime sum(da)
  6.033 ms (754 allocations: 20.80 KiB)
6.7108864f7

julia> @btime sum(db)
  7.424 ms (759 allocations: 21.33 KiB)
6.7108864f7

After:

julia> @btime sum(da)
  2.030 ms (775 allocations: 21.20 KiB)
6.7108864f7

julia> @btime sum(db)
  3.009 ms (780 allocations: 21.72 KiB)
6.7108864f7

Helps with #46

maxwindiff · 2023-03-04T06:43:02Z

4 seems to be the ideal grain size based on testing in https://gist.github.com/maxwindiff/63b4e8f64c73c4b5b4ff00dd4fb79c5d

It's a bit eerie that the running times are exactly the same for all scalar data types, whether it's UInt8, Float32 or Int64, but I don't see anything obviously wrong with the benchmarking code.

maleadt · 2023-03-04T07:41:47Z

Maybe verify execution times with Metal.@profile and Xcode?

maxwindiff · 2023-03-05T00:14:55Z

It doesn't seem to be able to capture timing information. Do you have any idea what I'm missing? I remember seeing the timing graphs once, but I couldn't get it to work again.

julia> da = MtlArray(fill(Float32(1), 8192 * 8192));
2023-03-04 16:02:35.649 julia18[9346:290846] Metal GPU Frame Capture Enabled

julia> Metal.@profile sum(da)
[ Info: GPU frame capture saved to /Users/kichi/.julia/dev/Metal/julia_capture_1.gputrace/
6.7108864f7

maxwindiff · 2023-03-05T01:06:31Z

I think I know what happened, I didn't use $<var> in @btime. Need to redo the tests.

maxwindiff · 2023-03-05T06:47:12Z

Re-did the benchmarks, this time things look more reasonable: https://gist.github.com/maxwindiff/3ff4886b3e6c1d8ba9ecafb13f84eab7

A grain size of 16 / sizeof(T) generally works well, except when reducing along certain dimensions of an n-d array. I can see why it isn't helpful to, say, read multiple values along the 2nd dimension of a 3D array. But it was surprising to see reading along the 1st dimension slowing things down... I need to investigate more.

sum((256, 512, 512); dims=[1])
grain=1 :  7.610 ms (272 allocations: 7.86 KiB)
grain=2 :  8.056 ms (273 allocations: 7.88 KiB)
grain=4 :  9.150 ms (273 allocations: 7.88 KiB)
grain=8 :  11.315 ms (273 allocations: 7.88 KiB)
grain=16:  20.130 ms (273 allocations: 7.88 KiB)

sum((256, 512, 512); dims=[2])
grain=1 :  9.178 ms (278 allocations: 7.95 KiB)
grain=2 :  8.769 ms (278 allocations: 7.95 KiB)
grain=4 :  9.287 ms (278 allocations: 7.95 KiB)
grain=8 :  12.515 ms (278 allocations: 7.95 KiB)
grain=16:  22.905 ms (278 allocations: 7.95 KiB)

maxwindiff · 2023-03-05T18:22:57Z

Made a few more changes:

Only read multiple values if size of reduction dimension is greater than threads_per_threadgroup (otherwise we'll just starve threads of work for no good reason. This is why sum((256,512,512); dim=1) was slow before)
Only read multiple values if reduction dimension is potentially contiguous in memory

Now performance is better across the board:
HEAD: https://gist.github.com/maxwindiff/4d861ae188c89ea48dd9c2986e349ca4
This PR: https://gist.github.com/maxwindiff/dcb8f3bcc926cbfc3ef55940fd35f345

maxwindiff · 2023-03-06T04:43:36Z

There does't seem to be a limit to the number of threadgroups. For example, this is pretty close to the maximum device buffer size and it runs fine:

julia> a = MtlArray(fill(1, 30000, 30000, 2));

julia> sum(a; dims=3)
30000×30000×1 MtlArray{Int64, 3}:
[:, :, 1] =
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  …  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2     2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
...

So I think not having the ireduce += localDim_reduce * groupDim_reduce loop is fine.

maxwindiff · 2023-03-06T04:45:46Z

The reduction tests in gpuarrays seems pretty comprehensive already. This is ready for review.

maxwindiff · 2023-03-06T08:31:33Z

Actually thinking about it more, the array dimensions in gpuarrays/reductions are all pretty small and would not trigger multi-read. I should add a few test cases with larger array sizes.

maxwindiff · 2023-03-08T06:46:02Z

Adding a test here - JuliaGPU/GPUArrays.jl#459
It passes on nightly, but on other platforms it had a failure when testing Float32. I couldn't reproduce the failure locally on 1.8 however. @maleadt do you have any suggestions on how to investigate?

maleadt · 2023-03-08T16:01:07Z

Awesome work, thanks for looking into this! Mind rebasing after I generated conflicts everywhere with the libcmt removal? 😅

Since this doesn't add tests (it shouldn't), I assume you verified this works with the GPUArrays PR for large reduction tests? Merging and tagging that may take a bit, but I'd like to merge this in the mean time.

It doesn't seem to be able to capture timing information. Do you have any idea what I'm missing? I remember seeing the timing graphs once, but I couldn't get it to work again.

I'm not too familiar with Xcode yet, but I think you have to hit Replay to get performance measurements.

Using ideas from: * https://kieber-emmons.medium.com/optimizing-parallel-reduction-in-metal-for-apple-m1-8e8677b49b01 * https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

…hreads

…mory

maxwindiff · 2023-03-09T05:19:38Z

Done, rebased! Yep this passes the new GPUArrays tests.

maxwindiff force-pushed the reduce branch from d24102e to 9aff1b1 Compare March 4, 2023 06:40

maxwindiff marked this pull request as ready for review March 6, 2023 04:45

maleadt added the performance Gotta go fast. label Mar 8, 2023

maxwindiff added 6 commits March 8, 2023 20:49

Reduce multiple consecutive values in each thread to improve efficiency

f0f1692

Using ideas from: * https://kieber-emmons.medium.com/optimizing-parallel-reduction-in-metal-for-apple-m1-8e8677b49b01 * https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Rename "stride" to "grain" and hardcode to 4

2ed3e00

Pick grain size intelligently, and cap grain size to avoid starving t…

51229b6

…hreads

Only read multiple elements if they are likely to be contiguous in me…

8d71cf5

…mory

check for nothing

c214237

simpler way to calculate maximum grain size

4af969b

maxwindiff force-pushed the reduce branch from 0ee3bba to 4af969b Compare March 9, 2023 05:17

maleadt merged commit af6f7c4 into JuliaGPU:main Mar 9, 2023

maxwindiff deleted the reduce branch March 9, 2023 17:42

christiangnrd added a commit to christiangnrd/Metal.jl that referenced this pull request Mar 14, 2023

Revert issue JuliaGPU#112

f7a0526

christiangnrd mentioned this pull request Mar 14, 2023

mapreduce kernel uses too many threads #132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce multiple consecutive values in each thread to improve efficiency #112

Reduce multiple consecutive values in each thread to improve efficiency #112

maxwindiff commented Mar 3, 2023 •

edited

Loading

maxwindiff commented Mar 4, 2023

maleadt commented Mar 4, 2023

maxwindiff commented Mar 5, 2023 •

edited

Loading

maxwindiff commented Mar 5, 2023

maxwindiff commented Mar 5, 2023

maxwindiff commented Mar 5, 2023 •

edited

Loading

maxwindiff commented Mar 6, 2023

maxwindiff commented Mar 6, 2023

maxwindiff commented Mar 6, 2023

maxwindiff commented Mar 8, 2023

maleadt commented Mar 8, 2023

maxwindiff commented Mar 9, 2023

Reduce multiple consecutive values in each thread to improve efficiency #112

Reduce multiple consecutive values in each thread to improve efficiency #112

Conversation

maxwindiff commented Mar 3, 2023 • edited Loading

maxwindiff commented Mar 4, 2023

maleadt commented Mar 4, 2023

maxwindiff commented Mar 5, 2023 • edited Loading

maxwindiff commented Mar 5, 2023

maxwindiff commented Mar 5, 2023

maxwindiff commented Mar 5, 2023 • edited Loading

maxwindiff commented Mar 6, 2023

maxwindiff commented Mar 6, 2023

maxwindiff commented Mar 6, 2023

maxwindiff commented Mar 8, 2023

maleadt commented Mar 8, 2023

maxwindiff commented Mar 9, 2023

maxwindiff commented Mar 3, 2023 •

edited

Loading

maxwindiff commented Mar 5, 2023 •

edited

Loading

maxwindiff commented Mar 5, 2023 •

edited

Loading