Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor mapreduce improvements #303

Merged
merged 2 commits into from
Mar 5, 2024
Merged

Minor mapreduce improvements #303

merged 2 commits into from
Mar 5, 2024

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Mar 5, 2024

The grain heuristic was suboptimal, as it resulted in unnecessary launches (cc @maxwindiff):

julia> sum(mtl(rand(10_240_000)))
Reducing (10240000,) to (1,)
Reducing (1, 2500) to (1, 1)
Reducing (1, 1, 2) to (1, 1, 1)
5.1199045f6

Now:

julia> sum(mtl(rand(10_240_000)))
Reducing (10240000,) to (1,)
Reducing (1, 2500) to (1, 1)
5.1215505f6

This doesn't affect performance much though. In fact, it looks like mapreduce performs reasonably well after #123. Using this to measure:

using Metal
using Chairmarks

function memcopy(output_data::AbstractArray{T}, input_data::AbstractArray{T}) where T
    i = thread_position_in_grid_1d()
    if 1 <= i <= length(input_data)
        @inbounds output_data[i] = input_data[i]
    end
    return
end

function main(N=2^26)
    threads = 256
    groups = cld(N, threads)

    cpu_in = rand(Float32, N)
    gpu_in = MtlArray(cpu_in)
    gpu_out = similar(gpu_in)
    println("data size: ", Base.format_bytes(sizeof(gpu_in)))

    # verify results
    @metal threads=threads groups=groups memcopy(gpu_out, gpu_in)
    @assert Array(gpu_in) == Array(gpu_out)

    # show speed
    print("memcopy: ")
    bench = @b Metal.@sync @metal threads=threads groups=groups memcopy(gpu_out, gpu_in)
    print(Base.format_bytes(2*sizeof(gpu_in) / bench.time), "/s in ")
    display(bench)

    # CPU sum
    print("sum (CPU): ")
    bench = @b sum(cpu_in)
    print(Base.format_bytes(sizeof(cpu_in) / bench.time), "/s in ")
    display(bench)

    # reference sum
    @assert sum(gpu_in)  sum(cpu_in)

    # show reference speed
    print("sum (GPU): ")
    bench = @b sum(gpu_in)
    print(Base.format_bytes(sizeof(gpu_in) / bench.time), "/s in ")
    display(bench)
end

On an M1, which has 66GB/s max theoretical bandwidth:

julia> main()
data size: 256.000 MiB
memcopy: 55.387 GiB/s in 9.027 ms (153 allocs: 3.844 KiB)
sum (CPU): 44.415 GiB/s in 5.629 ms
sum (GPU): 54.377 GiB/s in 4.598 ms (795 allocs: 21.891 KiB)

On an M3 Pro, which has 150GB/s max theoretical bandwidth:

julia> main()
data size: 256.000 MiB
memcopy: 118.581 GiB/s in 4.217 ms (153 allocs: 3.844 KiB)
sum (CPU): 59.968 GiB/s in 4.169 ms
sum (GPU): 114.830 GiB/s in 2.177 ms (795 allocs: 21.891 KiB)

So I think we can close #46

@maleadt maleadt added performance Gotta go fast. arrays Things about the array abstraction. labels Mar 5, 2024
@maleadt maleadt merged commit 9f23773 into main Mar 5, 2024
1 check passed
@maleadt maleadt deleted the tb/mapreduce branch March 5, 2024 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays Things about the array abstraction. performance Gotta go fast.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor performance of mapreduce
2 participants