-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores #7921
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores #7921
Conversation
I am not sure how big of a problem this is, but extending the
|
Sorry, it seems my testing setup for precision was broken due to #7809 ; I had checked the KL divergence using
For q2_K_M I think this is fine but for q2_K_S I would like a better solution; it would also be possible to quantize the activations as q8_1 with a block size of 16 in the first place but that will require comparatively more changes. Since right now q2_K_S is I think not worth using over iq2_M and iq2_S anyways the way I would like to proceed is to merge this PR as-is, optimize MMQ for those quant formats that are already implemented, figure out how to best refactor and simplify the code, and then add the unsupported quantization formats during which I will revisit this. |
racecheck reports an error with mmq:
I also noticed that the precision issue happens rarely happens with low batch sizes (mmvq). |
In this particular case it doesn't matter because multiple threads write the same value multiple times (that's faster than a conditional statement). But I think I can rewrite the code in such a way that this doesn't happen (at the cost of more register usage which at this point in the kernel is probably irrelevant).
That sounds to me like a different issue. What command are you using for testing? |
I'm using
But this test fails almost always (batch size=32):
|
I believe this still classifies as UB. But regardless, it would be better to not have any data races even if they are benign in practice |
The idea I had turned out to be slower than a conditional statement so I used that instead.
How does that relate to MMVQ? That kernel is only used for batch sizes <= 8. Unrelated to that, should we maybe set a seed for the precision tests? (To my understanding the data that is used for testing is currently unseeded RNG output.) I think that would make it easier to differentiate between race conditions and other bugs. |
Sorry, what I meant was the the issue does not happen when mmvq is used, only with mmq. I am assuming that mmvq also uses the new q8_1 formats and is affected by these changes, but maybe I am wrong. |
No, MMVQ uses the exact same data layout that it did prior to my changes. |
I would prefer if the CI tests are run with random seeds. I am concerned that if we always use the same data, then some bugs will never be detected just because it happens to work with the random values that were generated. If you think that having a fixed seed would be useful during development, maybe instead we could add a command line option to set the seed. Currently, some quant formats are only tested with bs=1. We should add another test for the The increase in ppl also seems very high to me. But I do not think that very low BPW formats like q2_K have many practical applications anyway, even with very large models the error is usually too high to be usable. So I am not even convinced that it is worth spending much effort on this. |
A fair point. My first choice for testing correctness is
We should also test with batch sizes that are not powers of 2 since those are typically the ones that suffer from out-of-bounds issues.
If this is causing too many issues I can for now revert the implementation back to the way it was before (on-the-fly calculation of the per-halfblock sums). It would be a simple change since I know exactly how to do it (but the performance would be worse). |
I reverted the changes related to q2_K precision. Precision looks like this:
The performance will be worse since you now need to do twice as many int8 operations. |
Performance vs. master MMQ
Performance vs. master cuBLAS
As of right now the performance difference when using tensor cores is relatively small (~10%) since the overall utilization is poor. It will likely become more of a bottleneck as more optimizations are added. |
For me q2_K_M it is about 12% slower with bs=512. I think is still a good tradeoff compared to the memory cost of cuBLAS, since the only reason to use q2_K in the first place is in memory limited situations. Especially since current models that tend to have a very large output tensors, which increases memory usage.
|
Just for curiousity, what is Q2_K_M? This PR and a couple others from Johannes are the only places I see it mentioned |
It's just how llama.cpp calls "q2_K" as opposed to q2_K_S. |
This PR overhauls the q2_K and q3_K mul_mat_q kernels and adds int8 tensor core support. Now that all MMQ-supported quantization formats have int8 tensor core support it was possible to simplify the code a little. To make q2_K work I had to implement an alternative data format for q8_1 where instead of the per-block sum the per-halfblock sums are stored (as int8 relative to the max value). The precision loss from doing this seems to be negligible.
ggml-cuda.cu
now queries and stores the maximum opt-in shared memory per streaming multiprocessor. This is the actual value that should be used as the upper limit for shared memory for kernel launches (needs an additional function call to explicitly raise).Performance vs. master MMQ
Performance vs. master cuBLAS
The performance on Ampere and Ada Lovelace seems to still be suboptimal relative to FP16 cuBLAS GEMM.