You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the related updates! I just updated my kernels with latest aot and found that there're no big difference when it comes to the performance. Here're 2 little questions:
Have you ever tested the performance of kernels with aot wrapped? How about the gain after reducing the launching overhead?
Take the kernel attn_fwd as an example. My first implementation of aot + kernels (similar to attn_fwd but with paged_attention setting) was based on 24a3fe9cb57. Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,
I just noticed that the original attn_fwd.py has been split into fwd_kernel_common.py, fwd_kernel_inner.py and fwd_kernel.py. Is it helpful to aot?
I appreciate it if you could take some time and help answer these questions.
The text was updated successfully, but these errors were encountered:
Have you ever tested the performance of kernels with aot wrapped
Yes, they are close in TFLOPS numbers. Corresponding tests can be found in test/performance_forward.py and tritonsrc/performance_forward.py. (They are mostly identical, but using different backend because they are under different directory and consequently loading different attn_torch_function)
Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,
A few notable changes (but not merged yet)
Bump the Triton compiler to the latest upstream
Migrate from tl.make_block_ptr since upstream Triton is not maintaining it
Extra autotune Configs when generating tuning database
Thanks for the related updates! I just updated my kernels with latest aot and found that there're no big difference when it comes to the performance. Here're 2 little questions:
attn_fwd
as an example. My first implementation ofaot + kernels
(similar toattn_fwd
but with paged_attention setting) was based on 24a3fe9cb57. Is there any performance boosting code changes from 24a3fe9cb57 to latest branch? Specifically,attn_fwd.py
has been split intofwd_kernel_common.py
,fwd_kernel_inner.py
andfwd_kernel.py
. Is it helpful to aot?I appreciate it if you could take some time and help answer these questions.
The text was updated successfully, but these errors were encountered: