add an option in insertGpuAllocs to skip the func args copy #916

zhczhong · 2024-10-09T05:54:59Z

Add an option in insertGpuAllocs to skip the func args copy so that the user could decide whether a gpu::memcopy is needed for the input args

Please review these guidelines to help with the review process:

Have you provided a meaningful PR description?
Have you added a test, a reproducer, or a reference to an issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
If this PR is a work in progress, are you filing the PR as a draft?
Have you organized your commits logically and ensured each can be built by itself?

silee2 · 2024-10-09T17:02:00Z

I don't think this is the right solution.
If caller functions args are directly used,
Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu.
USM vs Device mem shouldn't matter in this case.
Better way would be to use
https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces
For marking gpu or usm memory args.
And then in this pass skip func arg alloc and copy based on the address space attribute.

zhczhong · 2024-10-10T01:36:41Z

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

If the memory is usm-shared and we mark it with gpu-address-spaces, the copy on the device side is indeed skipped but the memory access on the host side will still need an additional explicit copy.

kurapov-peter · 2024-10-10T10:02:09Z

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

silee2 · 2024-10-11T17:12:30Z

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense.
Unfortunately, MLIR does not have an address space attribute for USM memory.
Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

silee2

Found one thing I overlooked.
Please add a test case for the new option "is-usm-args"

kurapov-peter · 2024-10-11T17:31:58Z

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense.

Unfortunately, MLIR does not have an address space attribute for USM memory.

Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

Actually, if we care about guarantees, there's an option to insert runtime checks. For the case at hand we know for sure those are usm allocations, so this is just overhead.

As to usm address space, this could be a viable solution on the long run. However, we'd need to revisit llvm type converter behavior as it makes assumptions about the default address space.

zhczhong · 2024-10-12T02:56:29Z

Found one thing I overlooked. Please add a test case for the new option "is-usm-args"

Thanks for reminding! A test case is added

mshahneo · 2024-10-14T21:54:25Z

Hi @zhczhong ,

An important addition, thank you so much :).
Could you please add a bf16 end-to-end test case? It could be a very simple test case, but uses the bf16-to-gpu pass. Wanted to check how it would interact with bf16-to-gpu transformation.
A reference to a simple test case is: https://github.com/intel/mlir-extensions/blob/main/test/Integration/Dialect/Gpu/EltwiseAdd_BF16.mlir.

You could just create a similar USM version of this test or you can create something else :)

mshahneo · 2024-10-14T22:55:53Z

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense. Unfortunately, MLIR does not have an address space attribute for USM memory. Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

This may be out of scope of this PR.
But the concern raised by SangIk is a valid one, adding this option to the pass does give us the option to not create gpu-allocs and memcpy for the USM memory, but adds a very tricky pitfall. If the user uses this option, then they need to absolutely make sure that, all their host function uses USM arguments that gets passed to the device. Otherwise, it would result in runtime crash/segfault.

What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute.

zhczhong · 2024-10-15T03:29:32Z

Hi @zhczhong ,

An important addition, thank you so much :). Could you please add a bf16 end-to-end test case? It could be a very simple test case, but uses the bf16-to-gpu pass. Wanted to check how it would interact with bf16-to-gpu transformation. A reference to a simple test case is: https://github.com/intel/mlir-extensions/blob/main/test/Integration/Dialect/Gpu/EltwiseAdd_BF16.mlir.

You could just create a similar USM version of this test or you can create something else :)

Thanks for the suggestion! The test has been added but the bf16-to-gpu indeed failed to process the USM case.

BF16ToGPU/EltwiseAdd.bf16.mlir:81:5: error: 'func.return' op has 1 operands, but enclosing function (@test) returns 0
    return %alloc : memref<10x20xbf16>
    ^
"builtin.module"() <{sym_name = "eltwise_add_usm"}> ({
    "memref.global"() <{constant, initial_value = dense<5.000000e-01> : tensor<10x20xbf16>, sym_name = "__constant_10x20xbf16", sym_visibility = "private", type = memref<10x20xbf16>}> : () -> ()
    "func.func"() <{function_type = (memref<10x20xi16>, memref<10x20xi16>) -> (), sym_name = "test"}> ({
    ^bb0(%arg3: memref<10x20xi16>, %arg4: memref<10x20xi16>):
      %23 = "arith.constant"() <{value = 20 : index}> : () -> index
      %24 = "arith.constant"() <{value = 10 : index}> : () -> index
      %25 = "arith.constant"() <{value = 1 : index}> : () -> index
      %26 = "arith.constant"() <{value = 0 : index}> : () -> index
      %27 = "gpu.alloc"() <{hostShared, operandSegmentSizes = array<i32: 0, 0, 0>}> : () -> memref<400xi8>
      %28 = "memref.view"(%27, %26) : (memref<400xi8>, index) -> memref<10x20xbf16>
      %29 = "memref.view"(%27, %26) : (memref<400xi8>, index) -> memref<10x20xi16>
      "gpu.launch_func"(%24, %23, %25, %25, %25, %25, %arg3, %arg4, %29) <{kernel = @test_kernel::@test_kernel, operandSegmentSizes = array<i32: 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 3, 0>}> : (index, index, index, index, index, index, memref<10x20xi16>, memref<10x20xi16>, memref<10x20xi16>) -> ()
      %30 = "memref.alloc"() <{operandSegmentSizes = array<i32: 0, 0>}> : () -> memref<10x20xbf16>
      "memref.copy"(%28, %30) : (memref<10x20xbf16>, memref<10x20xbf16>) -> ()
      "gpu.dealloc"(%27) : (memref<400xi8>) -> ()
      "func.return"(%30) : (memref<10x20xbf16>) -> ()
    }) : () -> ()
    "gpu.module"() <{sym_name = "test_kernel"}> ({
      "gpu.func"() <{function_type = (memref<10x20xi16>, memref<10x20xi16>, memref<10x20xi16>) -> ()}> ({
      ^bb0(%arg0: memref<10x20xi16>, %arg1: memref<10x20xi16>, %arg2: memref<10x20xi16>):
        %4 = "gpu.block_id"() <{dimension = #gpu<dim x>}> : () -> index
        %5 = "gpu.block_id"() <{dimension = #gpu<dim y>}> : () -> index
        %6 = "arith.constant"() <{value = 16128 : i16}> : () -> i16
        %7 = "memref.load"(%arg0, %4, %5) : (memref<10x20xi16>, index, index) -> i16
        %8 = "memref.load"(%arg1, %4, %5) : (memref<10x20xi16>, index, index) -> i16
        %9 = "arith.bitcast"(%7) : (i16) -> bf16
        %10 = "arith.extf"(%9) : (bf16) -> f32
        %11 = "arith.bitcast"(%8) : (i16) -> bf16
        %12 = "arith.extf"(%11) : (bf16) -> f32
        %13 = "arith.addf"(%10, %12) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
        %14 = "arith.truncf"(%13) : (f32) -> bf16
        %15 = "arith.bitcast"(%14) : (bf16) -> i16
        %16 = "arith.bitcast"(%15) : (i16) -> bf16
        %17 = "arith.extf"(%16) : (bf16) -> f32
        %18 = "arith.bitcast"(%6) : (i16) -> bf16
        %19 = "arith.extf"(%18) : (bf16) -> f32
        %20 = "arith.addf"(%17, %19) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
        %21 = "arith.truncf"(%20) : (f32) -> bf16
        %22 = "arith.bitcast"(%21) : (bf16) -> i16
        "memref.store"(%22, %arg2, %4, %5) : (i16, memref<10x20xi16>, index, index) -> ()
        "gpu.return"() : () -> ()
      }) {VectorComputeFunctionINTEL, gpu.kernel, gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 10, 20, 1>, spirv.entry_point_abi = #spirv.entry_point_abi<>, sym_name = "test_kernel", workgroup_attributions = 0 : i64} : () -> ()
    }) {spirv.target_env = #spirv.target_env<#spirv.vce<v1.0, [Addresses, Float16Buffer, Int64, Int16, Int8, Kernel, Linkage, Vector16, GenericPointer, Groups, Float16, Float64, AtomicFloat32AddEXT, ExpectAssumeKHR, SubgroupDispatch, VectorComputeINTEL, VectorAnyINTEL, Bfloat16ConversionINTEL], [SPV_EXT_shader_atomic_float_add, SPV_KHR_expect_assume, SPV_INTEL_vector_compute, SPV_INTEL_bfloat16_conversion]>, api=OpenCL, #spirv.resource_limits<>>} : () -> ()
    "func.func"() <{function_type = () -> (), sym_name = "main"}> ({
      %0 = "memref.get_global"() <{name = @__constant_10x20xbf16}> : () -> memref<10x20xbf16>
      %1 = "memref.get_global"() <{name = @__constant_10x20xbf16}> : () -> memref<10x20xbf16>
      %2 = "func.call"(%0, %1) <{callee = @test}> : (memref<10x20xbf16>, memref<10x20xbf16>) -> memref<10x20xbf16>
      %3 = "memref.cast"(%2) : (memref<10x20xbf16>) -> memref<*xbf16>
      "func.call"(%3) <{callee = @printMemrefBF16}> : (memref<*xbf16>) -> ()
      "func.return"() : () -> ()
    }) : () -> ()
    "func.func"() <{function_type = (memref<*xbf16>) -> (), sym_name = "printMemrefBF16", sym_visibility = "private"}> ({
    }) {llvm.emit_c_interface} : () -> ()
  }) {gpu.container_module} : () -> ()

zhczhong · 2024-10-15T03:39:50Z

I don't think this is the right solution. If caller functions args are directly used, Gpu mem allocation and copy is only need if the arg's (a memref) memspace is not gpu. USM vs Device mem shouldn't matter in this case. Better way would be to use https://mlir.llvm.org/docs/Dialects/GPU/#gpu-address-spaces For marking gpu or usm memory args. And then in this pass skip func arg alloc and copy based on the address space attribute.

Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior.

Makes sense. Unfortunately, MLIR does not have an address space attribute for USM memory. Pass option is not fail safe either. In assumes, all caller direct forwarded args are USM memory. There is no way for the pass to check, so user needs to make that guarantee. Especially if there are multiple callees and callers involved. I will approve but would like to see one more approval before merging.

This may be out of scope of this PR. But the concern raised by SangIk is a valid one, adding this option to the pass does give us the option to not create gpu-allocs and memcpy for the USM memory, but adds a very tricky pitfall. If the user uses this option, then they need to absolutely make sure that, all their host function uses USM arguments that gets passed to the device. Otherwise, it would result in runtime crash/segfault.

What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute.

Yes, marking the memref with a usm memory space could make the semantic correct and let the pass behavior more precise. But it also requires the front end doing the corresponding change. @kurapov-peter What do you think abut this way?

kurapov-peter · 2024-10-15T09:58:07Z

What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute.

I agree there should be a proper attribute that would ensure type consistency for USM buffers (and not violate host-side semantic). XeGPU does not seem a good place for it since that would be the same as putting a gpu attribute to a host-side buffer. Host-device shared memory is not unique to intel gpus, the abstraction should be vendor-agnostic.

add an option in insertGpuAllocs to skip the func args copy

32a6408

kurapov-peter requested a review from chencha3 October 9, 2024 10:25

silee2 approved these changes Oct 11, 2024

View reviewed changes

silee2 requested changes Oct 11, 2024

View reviewed changes

add test

aecf9ab

silee2 approved these changes Oct 14, 2024

View reviewed changes

add bf16 test

4df3eb8

zhczhong force-pushed the main branch from 6f0b716 to 4df3eb8 Compare October 15, 2024 03:23

zhczhong closed this Oct 15, 2024

zhczhong reopened this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add an option in insertGpuAllocs to skip the func args copy #916

add an option in insertGpuAllocs to skip the func args copy #916

zhczhong commented Oct 9, 2024 •

edited

Loading

silee2 commented Oct 9, 2024

zhczhong commented Oct 10, 2024

kurapov-peter commented Oct 10, 2024

silee2 commented Oct 11, 2024

silee2 left a comment •

edited

Loading

kurapov-peter commented Oct 11, 2024

zhczhong commented Oct 12, 2024

mshahneo commented Oct 14, 2024

mshahneo commented Oct 14, 2024

zhczhong commented Oct 15, 2024

zhczhong commented Oct 15, 2024

kurapov-peter commented Oct 15, 2024

add an option in insertGpuAllocs to skip the func args copy #916

Are you sure you want to change the base?

add an option in insertGpuAllocs to skip the func args copy #916

Conversation

zhczhong commented Oct 9, 2024 • edited Loading

silee2 commented Oct 9, 2024

zhczhong commented Oct 10, 2024

kurapov-peter commented Oct 10, 2024

silee2 commented Oct 11, 2024

silee2 left a comment • edited Loading

Choose a reason for hiding this comment

kurapov-peter commented Oct 11, 2024

zhczhong commented Oct 12, 2024

mshahneo commented Oct 14, 2024

mshahneo commented Oct 14, 2024

zhczhong commented Oct 15, 2024

zhczhong commented Oct 15, 2024

kurapov-peter commented Oct 15, 2024

zhczhong commented Oct 9, 2024 •

edited

Loading

silee2 left a comment •

edited

Loading