-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVMGPU][ROCM] Disable polynomial approximation and use device libs #19672
base: main
Are you sure you want to change the base?
Conversation
The device lib implementation is selecected by the `convertToROCDL` pass. This implementation is much more efficient than the polynomial approximation in MLIR.
is there an issue tracking improving the mlir one? |
cool - would be good to get on the docket - all other targets use the MLIR one and relying on the device libs isn't great long-term nice job finding the delta - now we have something to target :) |
I think there is a deep truth disguised as an accident here: a generic polynomial approximation won't in general be the most efficient implementation on a given target. Achieving optimal results requires looking at the specifics of each math function and each target. For example, on gfx9, it so happens that 1/x and exp(x) are cheap to evaluate, defeating the basic assumption underpinning polynomial approximation, at least for functions such as tanh(x) which are easy to evaluate by 1/x and exp(x) steps. |
@saienduri how can I repro this test failure? https://github.com/iree-org/iree/actions/runs/12715481793/job/35448326412?pr=19672#step:7:241 |
OK, the remaining issue is that MathToROCDL doesn't handle Repro: func.func @main(%arg0: !torch.vtensor<[2,154,6144],f16>) -> !torch.vtensor<[2,154,6144],f16> {
%str_846 = torch.constant.str "tanh"
%395 = torch.aten.gelu %arg0, %str_846 : !torch.vtensor<[2,154,6144],f16>, !torch.str -> !torch.vtensor<[2,154,6144],f16>
return %395 : !torch.vtensor<[2,154,6144],f16>
} |
I can see a few builtins related to pow here: https://github.com/llvm/llvm-project/blob/3fbc344b49800bb0f70fd5af46c0a47f6d55bbd1/clang/lib/Headers/__clang_hip_libdevice_declares.h#L86-L87 __device__ __attribute__((pure)) float __ocml_pow_f32(float, float);
__device__ __attribute__((pure)) float __ocml_pown_f32(float, int); |
That covers fpowi. Guess for ipowi you would cast the result of fpowi from float to int? |
We need to make sure we are handling the whole input space, including large and negative number. We could also expand to muls when assumptions allow. We would have to benchmark and decide. Also, there's a bunch of packed math functions at the very bottom that MathToROCDL doesn't use but could potentially benefit from, especially with fp16. Should be lost of room for improvement! |
What about we just lower |
@krzysz00 Where can I find all the ROCDL functions? For example, those Also any context we should know why |
The implementation uses a templated pattern that asserts that the result and operand types are the same, and this is not true for |
There is just this one static assert to guard this assumption, guess we can relax it and make This should make |
Another thing I found is that there is This applies to |
yeah, one of many reasons why libm-like libraries are bad (for us) is that they assuming scalar everything above and below the libm call boundary - native versions that we can represent in IR as vectors have the most potential, but we as we see here a totally untuned/unoptimized vector version can't beat a highly tuned/optimized scalar version, but it's useful to keep in mind that a tuned/optimized vector version always has the potential to beat a scalar version especially as dispatches scale (you don't want to mix vectorized and scalarized stuff in the same lowering flow and the chance of that happening goes up a lot with fusion) |
IMO we should take it one step at a time. First, let's enable lowering to the remaining device lib calls -- this will unblock this PR and fix known performance issues in IREE on mi300-series cards. Then, we can follow the other prongs concurrently:
|
Signed-off-by: Jakub Kuderski <[email protected]>
I added a local llvmgpu pass that handles fpowi and ipowi only to unblock this. |
I am taking it from here to address the remaining bug in this PR. |
Failure is caused by a type mismatch while converting math functions:
nested arrays are flattened to a single vector. |
This patch also fixes #18570 , for LLVMGPU atleast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -0,0 +1,34 @@ | |||
// Copyright 2023 The IREE Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2025
|
@kuhar This should get the PR to get pass compilation issues: patch.patch I cannot submit it to your branch, can you update your branch with this patch? And then there is an issue with |
The device lib implementation is selected by the
convertToROCDL
pass. This implementation is much more efficient than the polynomial approximation in MLIR.Issue: #19673