-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add an option in insertGpuAllocs to skip the func args copy #916
base: main
Are you sure you want to change the base?
Conversation
I don't think this is the right solution. |
If the memory is usm-shared and we mark it with gpu-address-spaces, the copy on the device side is indeed skipped but the memory access on the host side will still need an additional explicit copy. |
Marking host-side usm memory with a gpu address space is semantically incorrect. You are relying on the cpu lowering to ignore a gpu address space that shouldn't have been there in the first place. The solution to mark the callee only (this is what actually happens after the lowering) would be feasible if not for type-checking between the kernel caller and the callee. This is a simple way to avoid the problem and does not change the default behavior. |
Makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one thing I overlooked.
Please add a test case for the new option "is-usm-args"
Actually, if we care about guarantees, there's an option to insert runtime checks. For the case at hand we know for sure those are usm allocations, so this is just overhead. As to usm address space, this could be a viable solution on the long run. However, we'd need to revisit llvm type converter behavior as it makes assumptions about the default address space. |
Thanks for reminding! A test case is added |
Hi @zhczhong , An important addition, thank you so much :). You could just create a similar USM version of this test or you can create something else :) |
This may be out of scope of this PR. What do you think if we could add an attribute to XeGPU (similar to XeGPU_MemoryScope) for USM, and mark memrefs with this attribute whenever we pass a USM memref to a function. insert-gpu-alloc could inturn use that attribute. |
Thanks for the suggestion! The test has been added but the BF16ToGPU/EltwiseAdd.bf16.mlir:81:5: error: 'func.return' op has 1 operands, but enclosing function (@test) returns 0
return %alloc : memref<10x20xbf16>
^
"builtin.module"() <{sym_name = "eltwise_add_usm"}> ({
"memref.global"() <{constant, initial_value = dense<5.000000e-01> : tensor<10x20xbf16>, sym_name = "__constant_10x20xbf16", sym_visibility = "private", type = memref<10x20xbf16>}> : () -> ()
"func.func"() <{function_type = (memref<10x20xi16>, memref<10x20xi16>) -> (), sym_name = "test"}> ({
^bb0(%arg3: memref<10x20xi16>, %arg4: memref<10x20xi16>):
%23 = "arith.constant"() <{value = 20 : index}> : () -> index
%24 = "arith.constant"() <{value = 10 : index}> : () -> index
%25 = "arith.constant"() <{value = 1 : index}> : () -> index
%26 = "arith.constant"() <{value = 0 : index}> : () -> index
%27 = "gpu.alloc"() <{hostShared, operandSegmentSizes = array<i32: 0, 0, 0>}> : () -> memref<400xi8>
%28 = "memref.view"(%27, %26) : (memref<400xi8>, index) -> memref<10x20xbf16>
%29 = "memref.view"(%27, %26) : (memref<400xi8>, index) -> memref<10x20xi16>
"gpu.launch_func"(%24, %23, %25, %25, %25, %25, %arg3, %arg4, %29) <{kernel = @test_kernel::@test_kernel, operandSegmentSizes = array<i32: 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 3, 0>}> : (index, index, index, index, index, index, memref<10x20xi16>, memref<10x20xi16>, memref<10x20xi16>) -> ()
%30 = "memref.alloc"() <{operandSegmentSizes = array<i32: 0, 0>}> : () -> memref<10x20xbf16>
"memref.copy"(%28, %30) : (memref<10x20xbf16>, memref<10x20xbf16>) -> ()
"gpu.dealloc"(%27) : (memref<400xi8>) -> ()
"func.return"(%30) : (memref<10x20xbf16>) -> ()
}) : () -> ()
"gpu.module"() <{sym_name = "test_kernel"}> ({
"gpu.func"() <{function_type = (memref<10x20xi16>, memref<10x20xi16>, memref<10x20xi16>) -> ()}> ({
^bb0(%arg0: memref<10x20xi16>, %arg1: memref<10x20xi16>, %arg2: memref<10x20xi16>):
%4 = "gpu.block_id"() <{dimension = #gpu<dim x>}> : () -> index
%5 = "gpu.block_id"() <{dimension = #gpu<dim y>}> : () -> index
%6 = "arith.constant"() <{value = 16128 : i16}> : () -> i16
%7 = "memref.load"(%arg0, %4, %5) : (memref<10x20xi16>, index, index) -> i16
%8 = "memref.load"(%arg1, %4, %5) : (memref<10x20xi16>, index, index) -> i16
%9 = "arith.bitcast"(%7) : (i16) -> bf16
%10 = "arith.extf"(%9) : (bf16) -> f32
%11 = "arith.bitcast"(%8) : (i16) -> bf16
%12 = "arith.extf"(%11) : (bf16) -> f32
%13 = "arith.addf"(%10, %12) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
%14 = "arith.truncf"(%13) : (f32) -> bf16
%15 = "arith.bitcast"(%14) : (bf16) -> i16
%16 = "arith.bitcast"(%15) : (i16) -> bf16
%17 = "arith.extf"(%16) : (bf16) -> f32
%18 = "arith.bitcast"(%6) : (i16) -> bf16
%19 = "arith.extf"(%18) : (bf16) -> f32
%20 = "arith.addf"(%17, %19) <{fastmath = #arith.fastmath<none>}> : (f32, f32) -> f32
%21 = "arith.truncf"(%20) : (f32) -> bf16
%22 = "arith.bitcast"(%21) : (bf16) -> i16
"memref.store"(%22, %arg2, %4, %5) : (i16, memref<10x20xi16>, index, index) -> ()
"gpu.return"() : () -> ()
}) {VectorComputeFunctionINTEL, gpu.kernel, gpu.known_block_size = array<i32: 1, 1, 1>, gpu.known_grid_size = array<i32: 10, 20, 1>, spirv.entry_point_abi = #spirv.entry_point_abi<>, sym_name = "test_kernel", workgroup_attributions = 0 : i64} : () -> ()
}) {spirv.target_env = #spirv.target_env<#spirv.vce<v1.0, [Addresses, Float16Buffer, Int64, Int16, Int8, Kernel, Linkage, Vector16, GenericPointer, Groups, Float16, Float64, AtomicFloat32AddEXT, ExpectAssumeKHR, SubgroupDispatch, VectorComputeINTEL, VectorAnyINTEL, Bfloat16ConversionINTEL], [SPV_EXT_shader_atomic_float_add, SPV_KHR_expect_assume, SPV_INTEL_vector_compute, SPV_INTEL_bfloat16_conversion]>, api=OpenCL, #spirv.resource_limits<>>} : () -> ()
"func.func"() <{function_type = () -> (), sym_name = "main"}> ({
%0 = "memref.get_global"() <{name = @__constant_10x20xbf16}> : () -> memref<10x20xbf16>
%1 = "memref.get_global"() <{name = @__constant_10x20xbf16}> : () -> memref<10x20xbf16>
%2 = "func.call"(%0, %1) <{callee = @test}> : (memref<10x20xbf16>, memref<10x20xbf16>) -> memref<10x20xbf16>
%3 = "memref.cast"(%2) : (memref<10x20xbf16>) -> memref<*xbf16>
"func.call"(%3) <{callee = @printMemrefBF16}> : (memref<*xbf16>) -> ()
"func.return"() : () -> ()
}) : () -> ()
"func.func"() <{function_type = (memref<*xbf16>) -> (), sym_name = "printMemrefBF16", sym_visibility = "private"}> ({
}) {llvm.emit_c_interface} : () -> ()
}) {gpu.container_module} : () -> () |
Yes, marking the memref with a usm memory space could make the semantic correct and let the pass behavior more precise. But it also requires the front end doing the corresponding change. @kurapov-peter What do you think abut this way? |
I agree there should be a proper attribute that would ensure type consistency for USM buffers (and not violate host-side semantic). XeGPU does not seem a good place for it since that would be the same as putting a gpu attribute to a host-side buffer. Host-device shared memory is not unique to intel gpus, the abstraction should be vendor-agnostic. |
Add an option in insertGpuAllocs to skip the func args copy so that the user could decide whether a gpu::memcopy is needed for the input args
Please review these guidelines to help with the review process: