Performance problem of gemm_a16w8 #12

xiaonans · 2024-07-31T10:03:49Z

I tested the performance of gemm_a16w8 kernel on AMD MI200, and found the performance is worse than pytorch(rocmblas) and triton's gemm example (https://github.com/xiaonans/triton-gemm-benchmark/blob/main/03-matrix-multiplication.py), when M is large.

I attached my performance testing results below:

In my performance testing, I added some codes so that I can run autotune at the first time, and do benchmark with the saved best_config. The changes I made are main...xiaonans:FLASHNN:main. I run the test with python tests/quant_gemm/test_gemm_weight_only.py.

I want to ask whether my performance testing results are expected, or there is some thing I missed?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problem of gemm_a16w8 #12

Performance problem of gemm_a16w8 #12

xiaonans commented Jul 31, 2024 •

edited

Loading

Performance problem of gemm_a16w8 #12

Performance problem of gemm_a16w8 #12

Comments

xiaonans commented Jul 31, 2024 • edited Loading

xiaonans commented Jul 31, 2024 •

edited

Loading