Minor mapreduce improvements #303

maleadt · 2024-03-05T11:32:48Z

The grain heuristic was suboptimal, as it resulted in unnecessary launches (cc @maxwindiff):

julia> sum(mtl(rand(10_240_000)))
Reducing (10240000,) to (1,)
Reducing (1, 2500) to (1, 1)
Reducing (1, 1, 2) to (1, 1, 1)
5.1199045f6

Now:

julia> sum(mtl(rand(10_240_000)))
Reducing (10240000,) to (1,)
Reducing (1, 2500) to (1, 1)
5.1215505f6

This doesn't affect performance much though. In fact, it looks like mapreduce performs reasonably well after #123. Using this to measure:

using Metal
using Chairmarks

function memcopy(output_data::AbstractArray{T}, input_data::AbstractArray{T}) where T
    i = thread_position_in_grid_1d()
    if 1 <= i <= length(input_data)
        @inbounds output_data[i] = input_data[i]
    end
    return
end

function main(N=2^26)
    threads = 256
    groups = cld(N, threads)

    cpu_in = rand(Float32, N)
    gpu_in = MtlArray(cpu_in)
    gpu_out = similar(gpu_in)
    println("data size: ", Base.format_bytes(sizeof(gpu_in)))

    # verify results
    @metal threads=threads groups=groups memcopy(gpu_out, gpu_in)
    @assert Array(gpu_in) == Array(gpu_out)

    # show speed
    print("memcopy: ")
    bench = @b Metal.@sync @metal threads=threads groups=groups memcopy(gpu_out, gpu_in)
    print(Base.format_bytes(2*sizeof(gpu_in) / bench.time), "/s in ")
    display(bench)

    # CPU sum
    print("sum (CPU): ")
    bench = @b sum(cpu_in)
    print(Base.format_bytes(sizeof(cpu_in) / bench.time), "/s in ")
    display(bench)

    # reference sum
    @assert sum(gpu_in) ≈ sum(cpu_in)

    # show reference speed
    print("sum (GPU): ")
    bench = @b sum(gpu_in)
    print(Base.format_bytes(sizeof(gpu_in) / bench.time), "/s in ")
    display(bench)
end

On an M1, which has 66GB/s max theoretical bandwidth:

julia> main()
data size: 256.000 MiB
memcopy: 55.387 GiB/s in 9.027 ms (153 allocs: 3.844 KiB)
sum (CPU): 44.415 GiB/s in 5.629 ms
sum (GPU): 54.377 GiB/s in 4.598 ms (795 allocs: 21.891 KiB)

On an M3 Pro, which has 150GB/s max theoretical bandwidth:

julia> main()
data size: 256.000 MiB
memcopy: 118.581 GiB/s in 4.217 ms (153 allocs: 3.844 KiB)
sum (CPU): 59.968 GiB/s in 4.169 ms
sum (GPU): 114.830 GiB/s in 2.177 ms (795 allocs: 21.891 KiB)

So I think we can close #46

maleadt added 2 commits March 5, 2024 12:28

Fix shuffle intrinsic.

99379ec

Improve grain heuristic to reduce launches

f2c7619

maleadt added performance Gotta go fast. arrays Things about the array abstraction. labels Mar 5, 2024

tgymnich approved these changes Mar 5, 2024

View reviewed changes

maleadt merged commit 9f23773 into main Mar 5, 2024
1 check passed

maleadt deleted the tb/mapreduce branch March 5, 2024 12:33

maleadt mentioned this pull request Mar 5, 2024

mapreduce allocates a lot on the CPU #211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor mapreduce improvements #303

Minor mapreduce improvements #303

maleadt commented Mar 5, 2024

Minor mapreduce improvements #303

Minor mapreduce improvements #303

Conversation

maleadt commented Mar 5, 2024