-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: scale caching for k quants + misc fixes #11081
Conversation
RTX 4070 results. Keep in mind there's a lot of variability in the results, but at first glance it seems like an improvement for Q3_K but worse for the others:
|
For multiple ns I'm seeing clear improvements with Q3_K and Q6_K, but Q2_K is much less consistent and is in some cases slower than master. PR:
Master:
I tried calculating the A * scale multiplication ahead of time for Q2_K, but it didn't do much. That also should reduce the number of shared memory reads as the products are stored in registers. A * scale multiplication cached in registers:
|
I'll post benchmarks at a later point, but this reduces performance on RTX 3090 for q2_k and q6_k. I see small improvements on Radeon Pro VII. Intel still crashes, but only in |
IMO the crash is still very likely related to the barriers in nonuniform control flow. It really needs to be fixed if we're going to use shared memory here. If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf (likely making it worse, I'd guess). |
To get rid of the branches we could just have the main i loop run with no checks as long as we have enough blocks remaining to use all threads, and then switch to a separate code path for the final multiplications. There's no need to redo the algorithm. |
Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash? |
I tested the latest Q6_K changes on RTX 4070. For llama-bench with llama-2-7b.Q6_K, the perf is basically unchanged, which is not surprising since it's just memory bandwidth-limited. The directed perf results are more interesting:
So there's a nice boost for larger n, but it just falls off a cliff for n=8. I looked into this, and what's happening is the barriers are causing all the loads of the B matrix to be bunched together, and it's using too many registers. I tried moving all the B loads to the start of the function and saving them in local arrays, and that seems to resolve the issue:
|
Hmm this looks like an Nvidia only issue, and I didn't see this on my end when I was testing my changes. AMD's tools report that 82/256 vector registers are used in the 64 block size, 4 rows, and 8 columns case.
I checked the assembly and at least for me the compiler is interleaving the B loads and the sum FMAs rather than doing them all at once. Also if I do a quick estimation:
That's 72 vector registers, and I guess we can go up to 100 ish when we include the indexes and so forth. If we assume that the compiler is loading all the B columns together then that's 16*8 = 128 registers, which would bring the total number over 200. However the compiler in this case doesn't need to load all the B values into registers in one go, and it should be smart enough to not spill over. BTW I definitely can make this change to fix the n=8 performance, and I'll do these tweaks in one go once I get confirmation that Intel is working. It's just weird that the compiler is running out of registers in this case, which hinders performance more than smaller loads would. |
I tested it on my A770, now Q6_K passes and it crashes on a later Q2_K test, so it was correct. |
New numbers after fixing the early returns and making some more changes:
|
Okay this should be everything I think. |
Looks good overall now, the biggest positive effect I saw was on q2_k and q3_k on RX 6800 XT and A770. RTX 3090 has some regressions on q2_k, but the model benchmark didn't show a big difference. I think it's fine. RTX 3090llama-bench
test-backend-ops perfq2_KMaster:
PR:
q3_KMaster:
PR:
q6_KMaster:
PR:
AMD Radeon RX 6800 XTllama-bench
test-backend-ops perfq2_KMaster:
PR:
q3_KMaster:
PR:
q6_KMaster:
PR:
AMD Radeon Pro VIIllama-bench
test-backend-ops perfq2_KMaster:
PR:
q3_KMaster:
PR:
q6_KMaster:
PR:
Intel A770llama-bench
test-backend-ops perfq2_KMaster:
PR:
q3_KMaster:
PR:
q6_KMaster:
PR:
|
I generally expect this to benefit the most on compute limited hardware but the drop in performance for the 3090 is really strange indeed. My guess is that the GPU runs so fast that the bit operations and four unpacks end up taking less time than the barrier overhead. At least on AMD I'm seeing that the sccache values end up being kept in registers for the FMA loop so it's not like the GPU is reading from shared memory every time it does a calculation. |
Results with the latest change on RTX 4070. I tried the phi3 Q4_K model (includes Q5_K and Q6_K) and perf was basically unchanged, which is expected. For the directed tests:
Still a regression for NUM_COLS==8 with Q6_K, maybe also to a lesser extent for Q4_K? I guess it doesn't need to block this change. |
The Q6_K case is getting stranger especially since the 3090 is handling the 8 column case fine. I still think it's a compiler issue though. For Q4_K I have no idea why this is happening when we do the same types of loads and literally reduce the instruction count for the scale calculation. |
I don't know why the full CI isn't starting up after the approval, so I've run it in my fork and it's passing there. |
The CI only runs when a source file changes, it should be extended to include the vulkan shaders. llama.cpp/.github/workflows/build.yml Lines 14 to 16 in adc5dd9
|
We can make inference run a bit faster by extracting the scales in parallel and saving them to shared memory, where they'll be used by all the threads working on the superblock. This came out of the experiments in #10999.
This was not done for Q4_K and Q5_K as their scales are packed in a complicated way which makes this method even slower.
PR:
Master: