diff --git a/previews/PR304/api/array/index.html b/previews/PR304/api/array/index.html index d77b8a941..78da947cd 100644 --- a/previews/PR304/api/array/index.html +++ b/previews/PR304/api/array/index.html @@ -20,4 +20,4 @@ 3-element MtlVector{Int64, Metal.MTL.MTLResourceStorageModePrivate}: 1 2 - 3source + 3source diff --git a/previews/PR304/api/compiler/index.html b/previews/PR304/api/compiler/index.html index 68572bf55..3fb551542 100644 --- a/previews/PR304/api/compiler/index.html +++ b/previews/PR304/api/compiler/index.html @@ -1,8 +1,8 @@ -Compiler · Metal.jl

Compiler

Execution

The main entry-point to the compiler is the @metal macro:

Metal.@metalMacro
@metal threads=... groups=... [kwargs...] func(args...)

High-level interface for executing code on a GPU.

The @metal macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a Metal function upon first use, and to a certain extent arguments will be converted and managed automatically using mtlconvert. Finally, a call to mtlcall is performed, creating a command buffer in the current global command queue then committing it.

There is one supported keyword argument that influences the behavior of @metal:

  • launch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.
  • name: the name of the kernel in the generated code. Defaults to an automatically- generated name.
  • queue: the command queue to use for this kernel. Defaults to the global command queue.
source

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

Metal.mtlconvertFunction

mtlconvert(x, [cce])

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the Metal.Adaptor type.

source
Metal.mtlfunctionFunction
mtlfunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @metal.

The output of this function is automatically cached, i.e. you can simply call mtlfunction in a hot path without degrading performance. New code will be generated automatically when the function changes, or when different types or keyword arguments are provided.

source

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
+Compiler · Metal.jl

Compiler

Execution

The main entry-point to the compiler is the @metal macro:

Metal.@metalMacro
@metal threads=... groups=... [kwargs...] func(args...)

High-level interface for executing code on a GPU.

The @metal macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a Metal function upon first use, and to a certain extent arguments will be converted and managed automatically using mtlconvert. Finally, a call to mtlcall is performed, creating a command buffer in the current global command queue then committing it.

There is one supported keyword argument that influences the behavior of @metal:

  • launch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.
  • name: the name of the kernel in the generated code. Defaults to an automatically- generated name.
  • queue: the command queue to use for this kernel. Defaults to the global command queue.
source

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

Metal.mtlconvertFunction

mtlconvert(x, [cce])

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the Metal.Adaptor type.

source
Metal.mtlfunctionFunction
mtlfunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @metal.

The output of this function is automatically cached, i.e. you can simply call mtlfunction in a hot path without degrading performance. New code will be generated automatically when the function changes, or when different types or keyword arguments are provided.

source

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
 @device_code_typed
 @device_code_warntype
 @device_code_llvm
 @device_code_native
 @device_code_agx
-@device_code

For more information, please consult the GPUCompiler.jl documentation. code_agx is actually code_native:

+@device_code

For more information, please consult the GPUCompiler.jl documentation. code_agx is actually code_native:

diff --git a/previews/PR304/api/essentials/index.html b/previews/PR304/api/essentials/index.html index 9e9a721e6..f930e5a2b 100644 --- a/previews/PR304/api/essentials/index.html +++ b/previews/PR304/api/essentials/index.html @@ -1,2 +1,2 @@ -Essentials · Metal.jl

Essentials

Global State

Metal.device!Function
device!(dev::MTLDevice)

Sets the Metal GPU device associated with the current Julia task.

source
Metal.current_deviceFunction
current_device()::MTLDevice

Return the Metal GPU device associated with the current Julia task.

Since all M-series systems currently only externally show a single GPU, this function effectively returns the only system GPU.

source
Metal.global_queueFunction
global_queue(dev::MTLDevice)::MTLCommandQueue

Return the Metal command queue associated with the current Julia thread.

source
Metal.synchronizeFunction
synchronize(queue)

Wait for currently committed GPU work on this queue to finish.

Create a new MTLCommandBuffer from the global command queue, commit it to the queue, and simply wait for it to be completed. Since command buffers should execute in a First-In-First-Out manner, this synchronizes the GPU.

source
+Essentials · Metal.jl

Essentials

Global State

Metal.device!Function
device!(dev::MTLDevice)

Sets the Metal GPU device associated with the current Julia task.

source
Metal.current_deviceFunction
current_device()::MTLDevice

Return the Metal GPU device associated with the current Julia task.

Since all M-series systems currently only externally show a single GPU, this function effectively returns the only system GPU.

source
Metal.global_queueFunction
global_queue(dev::MTLDevice)::MTLCommandQueue

Return the Metal command queue associated with the current Julia thread.

source
Metal.synchronizeFunction
synchronize(queue)

Wait for currently committed GPU work on this queue to finish.

Create a new MTLCommandBuffer from the global command queue, commit it to the queue, and simply wait for it to be completed. Since command buffers should execute in a First-In-First-Out manner, this synchronizes the GPU.

source
diff --git a/previews/PR304/api/kernel/index.html b/previews/PR304/api/kernel/index.html index 9afeb9905..e59f4264e 100644 --- a/previews/PR304/api/kernel/index.html +++ b/previews/PR304/api/kernel/index.html @@ -1,24 +1,24 @@ -Kernel programming · Metal.jl

Kernel programming

This section lists the package's public functionality that corresponds to special Metal functions for use in device code. For more information about these functions, please consult the Metal Shading Language specification.

This is made possible by interfacing with the Metal libraries through a small C library that wraps the ObjectiveC APIs. These low-level wrappers, along with some slightly higher-level Julia wrappers, are available in the MTL submodule exported by Metal.jl. All wrapped C functions and types start with the mt prefix, whereas the Julia wrappers are prefixed with Mtl:

Indexing and dimensions

Metal.thread_position_in_grid_1dFunction
thread_position_in_grid_1d()::UInt32
+Kernel programming · Metal.jl

Kernel programming

This section lists the package's public functionality that corresponds to special Metal functions for use in device code. For more information about these functions, please consult the Metal Shading Language specification.

This is made possible by interfacing with the Metal libraries through a small C library that wraps the ObjectiveC APIs. These low-level wrappers, along with some slightly higher-level Julia wrappers, are available in the MTL submodule exported by Metal.jl. All wrapped C functions and types start with the mt prefix, whereas the Julia wrappers are prefixed with Mtl:

Indexing and dimensions

Metal.thread_position_in_grid_1dFunction
thread_position_in_grid_1d()::UInt32
 thread_position_in_grid_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-thread_position_in_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the current thread's position in an N-dimensional grid of threads.

source
Metal.thread_position_in_threadgroup_1dFunction
thread_position_in_threadgroup_1d()::UInt32
+thread_position_in_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the current thread's position in an N-dimensional grid of threads.

source
Metal.thread_position_in_threadgroup_1dFunction
thread_position_in_threadgroup_1d()::UInt32
 thread_position_in_threadgroup_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-thread_position_in_threadgroup_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the current thread's unique position within a threadgroup.

source
Metal.threadgroup_position_in_grid_1dFunction
threadgroup_position_in_grid_1d()::UInt32
+thread_position_in_threadgroup_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the current thread's unique position within a threadgroup.

source
Metal.threadgroup_position_in_grid_1dFunction
threadgroup_position_in_grid_1d()::UInt32
 threadgroup_position_in_grid_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-threadgroup_position_in_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the current threadgroup's unique position within the grid.

source
Metal.threadgroups_per_grid_1dFunction
threadgroups_per_grid_1d()::UInt32
+threadgroup_position_in_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the current threadgroup's unique position within the grid.

source
Metal.threadgroups_per_grid_1dFunction
threadgroups_per_grid_1d()::UInt32
 threadgroups_per_grid_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-threadgroups_per_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the number of threadgroups per grid.

source
Metal.threads_per_grid_1dFunction
threads_per_grid_1d()::UInt32
+threadgroups_per_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the number of threadgroups per grid.

source
Metal.threads_per_grid_1dFunction
threads_per_grid_1d()::UInt32
 threads_per_grid_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-threads_per_grid_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the grid size.

source
Metal.threads_per_threadgroup_1dFunction
threads_per_threadgroup_1d()::UInt32
 threads_per_threadgroup_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-threads_per_threadgroup_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the thread execution width of a threadgroup.

source
Metal.grid_size_1dFunction
grid_size_1d()::UInt32
+threads_per_threadgroup_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the thread execution width of a threadgroup.

source
Metal.grid_size_1dFunction
grid_size_1d()::UInt32
 grid_size_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-grid_size_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return maximum size of the grid for threads that read per-thread stage-in data.

source
Metal.grid_origin_1dFunction
grid_origin_1d()::UInt32
+grid_size_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return maximum size of the grid for threads that read per-thread stage-in data.

source
Metal.grid_origin_1dFunction
grid_origin_1d()::UInt32
 grid_origin_2d()::NamedTuple{(:x, :y), Tuple{UInt32, UInt32}}
-grid_origin_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the origin offset of the grid for threads that read per-thread stage-in data.

source

Device arrays

Metal.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the MtlArray, and implements (part of) the array interface as well as other functionality for use on the GPU:

Metal.MtlDeviceArrayType
MtlDeviceArray(dims, ptr)
+grid_origin_3d()::NamedTuple{(:x, :y, :z), Tuple{UInt32, UInt32, UInt32}}

Return the origin offset of the grid for threads that read per-thread stage-in data.

source

Device arrays

Metal.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the MtlArray, and implements (part of) the array interface as well as other functionality for use on the GPU:

Metal.MtlDeviceArrayType
MtlDeviceArray(dims, ptr)
 MtlDeviceArray{T}(dims, ptr)
 MtlDeviceArray{T,A}(dims, ptr)
-MtlDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense Metal device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr.

dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

source
Metal.ConstType
Const(A::MtlDeviceArray)

Mark a MtlDeviceArray as constant/read-only and to use the constant address space.

Warning

Experimental API. Subject to change without deprecation.

source

Shared memory

Metal.MtlThreadGroupArrayFunction
MtlThreadGroupArray(::Type{T}, dims)

Create an array local to each threadgroup launched during kernel execution.

source

Synchronization

Metal.MemoryFlagsType
MemoryFlags

Flags to set the memory synchronization behavior of threadgroup_barrier and simdgroup_barrier.

Possible values:

None: Set barriers to only act as an execution barrier and not apply a memory fence.
+MtlDeviceArray{T,A,N}(dims, ptr)

Construct an N-dimensional dense Metal device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr.

dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

source
Metal.ConstType
Const(A::MtlDeviceArray)

Mark a MtlDeviceArray as constant/read-only and to use the constant address space.

Warning

Experimental API. Subject to change without deprecation.

source

Shared memory

Metal.MtlThreadGroupArrayFunction
MtlThreadGroupArray(::Type{T}, dims)

Create an array local to each threadgroup launched during kernel execution.

source

Synchronization

Metal.MemoryFlagsType
MemoryFlags

Flags to set the memory synchronization behavior of threadgroup_barrier and simdgroup_barrier.

Possible values:

None: Set barriers to only act as an execution barrier and not apply a memory fence.
 
 Device: Ensure the GPU correctly orders the memory operations to device memory
         for threads in the threadgroup or simdgroup.
@@ -30,4 +30,4 @@
         threads in a threadgroup or simdgroup for a texture with the read_write access qualifier.
 
 ThreadGroup_ImgBlock: Ensure the GPU correctly orders the memory operations to threadgroup imageblock memory
-        for threads in a threadgroup or simdgroup.
source
Metal.threadgroup_barrierFunction
threadgroup_barrier(flag::MemoryFlags=MemoryFlagNone)

Synchronize all threads in a threadgroup.

Possible flags that affect the memory synchronization behavior are found in MemoryFlags

source
Metal.simdgroup_barrierFunction
simdgroup_barrier(flag::MemoryFlags=MemoryFlagNone)

Synchronize all threads in a SIMD-group.

Possible flags that affect the memory synchronization behavior are found in MemoryFlags

source
+ for threads in a threadgroup or simdgroup.
source
Metal.threadgroup_barrierFunction
threadgroup_barrier(flag::MemoryFlags=MemoryFlagNone)

Synchronize all threads in a threadgroup.

Possible flags that affect the memory synchronization behavior are found in MemoryFlags

source
Metal.simdgroup_barrierFunction
simdgroup_barrier(flag::MemoryFlags=MemoryFlagNone)

Synchronize all threads in a SIMD-group.

Possible flags that affect the memory synchronization behavior are found in MemoryFlags

source
diff --git a/previews/PR304/api/mps/index.html b/previews/PR304/api/mps/index.html index ae26351c9..0d6d02342 100644 --- a/previews/PR304/api/mps/index.html +++ b/previews/PR304/api/mps/index.html @@ -1,5 +1,5 @@ -Metal Performance Shaders · Metal.jl

Metal Performance Shaders

This section lists the package's public functionality that corresponds to the Metal Performance Shaders functions. For more information about these functions, or to see which functions have yet to be implemented in this package, please consult the Metal Performance Shaders Documentation.

Matrices and Vectors

Metal.MPS.MPSMatrixType
MPSMatrix(arr::MtlMatrix)

Metal matrix representation used in Performance Shaders.

Note that this results in a transposed view of the input, as Metal stores matrices row-major instead of column-major.

source
MPSMatrix(arr::MtlArray{T,3})

Metal batched matrix representation used in Performance Shaders.

Note that this results in a transposed view of the input, as Metal stores matrices row-major instead of column-major.

source

Matrix Arithmetic Operators

Metal.MPS.matmul!Function
matMulMPS(a::MtlMatrix, b::MtlMatrix, c::MtlMatrix, alpha=1, beta=1,
-          transpose_left=false, transpose_right=false)

A MPSMatrixMultiplication kernel thay computes: c = alpha * op(a) * beta * op(b) + beta * C

This function should not typically be used. Rather, use the normal LinearAlgebra interface with any MtlArray and it should be accelerated using Metal Performance Shaders.

source
Metal.MPS.matvecmul!Function
matVecMulMPS(c::MtlVector, a::MtlMatrix, b::MtlVector, alpha=1, beta=1,
-             transpose=false)

A MPSMatrixVectorMultiplication kernel thay computes: c = alpha * op(a) * b + beta * c

This function should not typically be used. Rather, use the normal LinearAlgebra interface with any MtlArray and it should be accelerated using Metal Performance Shaders.

source
Metal.MPS.topkFunction
topk(A::MtlMatrix{T}, k) where {T<:MtlFloat}

Compute the top k values and their corresponding indices column-wise in a matrix A. Return the indices in I and the values in V.

k cannot be greater than 16.

Uses MPSMatrixFindTopK.

See also: topk!.

source
Metal.MPS.topk!Function
topk!(A::MtlMatrix{T}, I::MtlMatrix{Int32}, V::MtlMatrix{T}, k)
-                                                 where {T<:MtlFloat}

Compute the top k values and their corresponding indices column-wise in a matrix A. Return the indices in I and the values in V.

k cannot be greater than 16.

Uses MPSMatrixFindTopK.

See also: topk.

source
+Metal Performance Shaders · Metal.jl

Metal Performance Shaders

This section lists the package's public functionality that corresponds to the Metal Performance Shaders functions. For more information about these functions, or to see which functions have yet to be implemented in this package, please consult the Metal Performance Shaders Documentation.

Matrices and Vectors

Metal.MPS.MPSMatrixType
MPSMatrix(arr::MtlMatrix)

Metal matrix representation used in Performance Shaders.

Note that this results in a transposed view of the input, as Metal stores matrices row-major instead of column-major.

source
MPSMatrix(arr::MtlArray{T,3})

Metal batched matrix representation used in Performance Shaders.

Note that this results in a transposed view of the input, as Metal stores matrices row-major instead of column-major.

source

Matrix Arithmetic Operators

Metal.MPS.matmul!Function
matMulMPS(a::MtlMatrix, b::MtlMatrix, c::MtlMatrix, alpha=1, beta=1,
+          transpose_left=false, transpose_right=false)

A MPSMatrixMultiplication kernel thay computes: c = alpha * op(a) * beta * op(b) + beta * C

This function should not typically be used. Rather, use the normal LinearAlgebra interface with any MtlArray and it should be accelerated using Metal Performance Shaders.

source
Metal.MPS.matvecmul!Function
matVecMulMPS(c::MtlVector, a::MtlMatrix, b::MtlVector, alpha=1, beta=1,
+             transpose=false)

A MPSMatrixVectorMultiplication kernel thay computes: c = alpha * op(a) * b + beta * c

This function should not typically be used. Rather, use the normal LinearAlgebra interface with any MtlArray and it should be accelerated using Metal Performance Shaders.

source
Metal.MPS.topkFunction
topk(A::MtlMatrix{T}, k) where {T<:MtlFloat}

Compute the top k values and their corresponding indices column-wise in a matrix A. Return the indices in I and the values in V.

k cannot be greater than 16.

Uses MPSMatrixFindTopK.

See also: topk!.

source
Metal.MPS.topk!Function
topk!(A::MtlMatrix{T}, I::MtlMatrix{Int32}, V::MtlMatrix{T}, k)
+                                                 where {T<:MtlFloat}

Compute the top k values and their corresponding indices column-wise in a matrix A. Return the indices in I and the values in V.

k cannot be greater than 16.

Uses MPSMatrixFindTopK.

See also: topk.

source
diff --git a/previews/PR304/faq/contributing/index.html b/previews/PR304/faq/contributing/index.html index d680968c6..16ce660e7 100644 --- a/previews/PR304/faq/contributing/index.html +++ b/previews/PR304/faq/contributing/index.html @@ -7,4 +7,4 @@ uint i [[thread_position_in_grid]]) { atomic_store_explicit(&out[i], 0.0f, memory_order_relaxed); -}

To compile with Metal's tools and emit human-readable IR, run something roughly along the lines of: xcrun metal -S -emit-llvm dummy_kernel.metal

This will create a .ll file that you can then parse for whatever information you need. Be sure to double-check the metadata at the bottom for any significant changes your functionality introduces.

Test with different types and configurations to see what changes are caused. Also ensure that when writing very simple kernels, whatever you're interested in doesn't get optimized away. Double-check that the kernel's IR makes sense for what you wrote.

Metal Performance Shaders

Metal exposes a special interface to its library of optimized kernels. Rather than accepting the normal set of input GPU data structures, it requires special MPS datatypes that assume row-major memory layout. As this is not the Julia default, adapt accordingly. Adding MPS functionality should be mostly straightforward, so this can be an easy entry point to helping. To get started, you can have a look at the Metal Performance Shaders Documentation from Apple.

Exposing your Interface

There are varying degrees of user-facing interfaces from Metal.jl. At the lowest level is Metal.MTL.xxx. This is for low-level functionality close to or at bare Objective-C, or things that a normal user wouldn't directly be using. Metal.MPS.xxx is for Metal Performance Shader specifics (like MPSMatrix). Next, is Metal.xxx. This is for higher-level, usually pure-Julian functionality (like current_device()). The only thing beyond this is exporting into the global namespace. That would be useful for uniquely-named functions/structures/macros with clear and common use-cases (MtlArray or @metal).

Additionally, you can override non-Metal.jl functions like LinearAlgebra.mul! seen here. This is essentially (ab)using multiple dispatch to specialize for certain cases (usually for more performant execution).

If your function is only available from within GPU kernels (like thread indexing intrinsics). Be sure to properly annotate with @device_function to ensure that calling from the host doesn't kill your Julia process.

Generally, think about how frequently you expect your addition to be used, how complex its use-case is, and whether or not it clashes/reimplements/optimizes existing functionality from outside Metal.jl. Put it behind the corresponding interface.

Creating Tests

As it's good practice, and JuliaGPU has great CI/CD workflows, your addition should have associated tests to ensure correctness and edge cases. Look to existing examples under the test folder for initial guidance, and be sure to create tests for all valid types. Any new Julia file in this folder will be ran as its own testset. If you feel your tests don't fit in any existing place, you'll probably want to create a new file with an appropriate name.

Running a Subset of the Existing Tests

Sometimes you won't want to run the entire testsuite. You may just want to run the tests for your new functionality. To do that, you can either pass the name of the testset to the test/runtests.jl script: julia --project=test test/runtests.jl metal or you can isolate test files by running them alone after running the test/setup.jl script: julia --project=test -L test/setup.jl test/metal.jl

Thank You and Good Luck

Open-source projects like this only happen because people like you are willing to spend their free time helping out. Most anything you're able to do is helpful, but if you get stuck, seek guidance from Slack or Discourse. Don't feel like your contribution has to be perfect. If you put in effort and make progress, there will likely be some senior developer willing to polish your code before merging. Open-source software is a team effort...welcome to the team!

+}

To compile with Metal's tools and emit human-readable IR, run something roughly along the lines of: xcrun metal -S -emit-llvm dummy_kernel.metal

This will create a .ll file that you can then parse for whatever information you need. Be sure to double-check the metadata at the bottom for any significant changes your functionality introduces.

Test with different types and configurations to see what changes are caused. Also ensure that when writing very simple kernels, whatever you're interested in doesn't get optimized away. Double-check that the kernel's IR makes sense for what you wrote.

Metal Performance Shaders

Metal exposes a special interface to its library of optimized kernels. Rather than accepting the normal set of input GPU data structures, it requires special MPS datatypes that assume row-major memory layout. As this is not the Julia default, adapt accordingly. Adding MPS functionality should be mostly straightforward, so this can be an easy entry point to helping. To get started, you can have a look at the Metal Performance Shaders Documentation from Apple.

Exposing your Interface

There are varying degrees of user-facing interfaces from Metal.jl. At the lowest level is Metal.MTL.xxx. This is for low-level functionality close to or at bare Objective-C, or things that a normal user wouldn't directly be using. Metal.MPS.xxx is for Metal Performance Shader specifics (like MPSMatrix). Next, is Metal.xxx. This is for higher-level, usually pure-Julian functionality (like current_device()). The only thing beyond this is exporting into the global namespace. That would be useful for uniquely-named functions/structures/macros with clear and common use-cases (MtlArray or @metal).

Additionally, you can override non-Metal.jl functions like LinearAlgebra.mul! seen here. This is essentially (ab)using multiple dispatch to specialize for certain cases (usually for more performant execution).

If your function is only available from within GPU kernels (like thread indexing intrinsics). Be sure to properly annotate with @device_function to ensure that calling from the host doesn't kill your Julia process.

Generally, think about how frequently you expect your addition to be used, how complex its use-case is, and whether or not it clashes/reimplements/optimizes existing functionality from outside Metal.jl. Put it behind the corresponding interface.

Creating Tests

As it's good practice, and JuliaGPU has great CI/CD workflows, your addition should have associated tests to ensure correctness and edge cases. Look to existing examples under the test folder for initial guidance, and be sure to create tests for all valid types. Any new Julia file in this folder will be ran as its own testset. If you feel your tests don't fit in any existing place, you'll probably want to create a new file with an appropriate name.

Running a Subset of the Existing Tests

Sometimes you won't want to run the entire testsuite. You may just want to run the tests for your new functionality. To do that, you can either pass the name of the testset to the test/runtests.jl script: julia --project=test test/runtests.jl metal or you can isolate test files by running them alone after running the test/setup.jl script: julia --project=test -L test/setup.jl test/metal.jl

Thank You and Good Luck

Open-source projects like this only happen because people like you are willing to spend their free time helping out. Most anything you're able to do is helpful, but if you get stuck, seek guidance from Slack or Discourse. Don't feel like your contribution has to be perfect. If you put in effort and make progress, there will likely be some senior developer willing to polish your code before merging. Open-source software is a team effort...welcome to the team!

diff --git a/previews/PR304/faq/faq/index.html b/previews/PR304/faq/faq/index.html index 363ef5bd3..a9ebf5fc7 100644 --- a/previews/PR304/faq/faq/index.html +++ b/previews/PR304/faq/faq/index.html @@ -1,2 +1,2 @@ -Frequently Asked Questions · Metal.jl
+Frequently Asked Questions · Metal.jl
diff --git a/previews/PR304/index.html b/previews/PR304/index.html index ff99ed794..70be98cb9 100644 --- a/previews/PR304/index.html +++ b/previews/PR304/index.html @@ -6,4 +6,4 @@ # smoke test using Metal Metal.versioninfo()

If you want to ensure everything works as expected, you can execute the test suite.

using Pkg
-Pkg.test("Metal")

The following resources may also be of interest (although are mainly focused on the CUDA GPU backend):

Contributing

If you want to help improve this package, look at the contributing page for more details.

Acknowledgements

The Julia Metal stack has been a collaborative effort by many individuals. Significant contributions have been made by the following individuals:

Supporting and Citing

Some of the software in this ecosystem was developed as part of academic research. If you would like to help support it, please star the repository as such metrics may help us secure funding in the future. If you use our software as part of your research, teaching, or other activities, we would be grateful if you could cite our work. The CITATION.cff file in the root of this repository lists the relevant papers.

+Pkg.test("Metal")

The following resources may also be of interest (although are mainly focused on the CUDA GPU backend):

Contributing

If you want to help improve this package, look at the contributing page for more details.

Acknowledgements

The Julia Metal stack has been a collaborative effort by many individuals. Significant contributions have been made by the following individuals:

Supporting and Citing

Some of the software in this ecosystem was developed as part of academic research. If you would like to help support it, please star the repository as such metrics may help us secure funding in the future. If you use our software as part of your research, teaching, or other activities, we would be grateful if you could cite our work. The CITATION.cff file in the root of this repository lists the relevant papers.

diff --git a/previews/PR304/profiling/index.html b/previews/PR304/profiling/index.html index b226b1103..64e5a90e4 100644 --- a/previews/PR304/profiling/index.html +++ b/previews/PR304/profiling/index.html @@ -31,4 +31,4 @@ julia> Metal.@capture @metal threads=length(c) vadd(a, b, c); ... -[ Info: GPU frame capture saved to julia_1.gputrace; open the resulting trace in Xcode +[ Info: GPU frame capture saved to julia_1.gputrace; open the resulting trace in Xcode diff --git a/previews/PR304/search/index.html b/previews/PR304/search/index.html index eed640a43..cefe360f7 100644 --- a/previews/PR304/search/index.html +++ b/previews/PR304/search/index.html @@ -1,2 +1,2 @@ -Search · Metal.jl

Loading search...

    +Search · Metal.jl

    Loading search...

      diff --git a/previews/PR304/usage/array/index.html b/previews/PR304/usage/array/index.html index dced4db50..181d9d5f8 100644 --- a/previews/PR304/usage/array/index.html +++ b/previews/PR304/usage/array/index.html @@ -50,4 +50,4 @@ julia> Base.mapreducedim!(identity, +, b, a) 1×1 MtlMatrix{Float32, Metal.MTL.MTLResourceStorageModePrivate}: - 6.0 + 6.0 diff --git a/previews/PR304/usage/kernel/index.html b/previews/PR304/usage/kernel/index.html index a56a0b7b0..c066ee0d5 100644 --- a/previews/PR304/usage/kernel/index.html +++ b/previews/PR304/usage/kernel/index.html @@ -4,4 +4,4 @@ c[i] = a[i] + b[i] return end

      This kernel takes in three vectors (a,b,c) all of the same length and stores the element-wise sum of a and b into c. Each thread in this kernel gets its unique position in the grid (arrangement of all threadgroups) and stores this value into the variable i which is then used as the index into the vectors. Thus, each thread is computing one sum and storing the result in the output vector.

      To ensure this kernel functions properly, we have to launch it with exactly as many threads as the length of the vectors. If we under or over-launch threads, the result could be incorrect.

      An example of a good launch:

      len = prod(size(d_a))
      -@metal threads=len vadd(d_a, d_b, d_c)

      Additional notes:

      Metal Shading Language Specification An Introduction to GPU Programming course from University of Illinois (primarily in CUDA, but the concepts are transferable)

      +@metal threads=len vadd(d_a, d_b, d_c)

      Additional notes:

      Metal Shading Language Specification An Introduction to GPU Programming course from University of Illinois (primarily in CUDA, but the concepts are transferable)

      diff --git a/previews/PR304/usage/overview/index.html b/previews/PR304/usage/overview/index.html index 52f5c7afb..56f7c11d6 100644 --- a/previews/PR304/usage/overview/index.html +++ b/previews/PR304/usage/overview/index.html @@ -9,4 +9,4 @@ # automatic memory management a = nothing

      Beyond memory management, there are a whole range of array operations to process your data. This includes several higher-order operations that take other code as arguments, such as map, reduce or broadcast. With these, it is possible to perform kernel-like operations without actually writing your own GPU kernels:

      a = Metal.zeros(1024)
       b = Metal.ones(1024)
      -a.^2 .+ sin.(b)
      +a.^2 .+ sin.(b)