Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: experimental coalesced read to shared memory before dequantization #10999

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

netrunnereve
Copy link
Collaborator

This is an ugly and hacky experiment that's kept as a draft for a reason, but it makes Q6_K run almost as fast as Q4_K_M. Basically the idea is to first read the superblocks into shared memory using perfectly coalesced loads and then do the actual dequantizations using the fast shared memory. As this completely separates the memory reads from the dequantization code this also allows us to explore different algorithms for dequantization that don't rely on coalescing.

With any other language I'd be able to easily do this, but GLSL has no pointers, memcpy, and all that. For some reason though it lets me index an array of structs beyond it's stated size to access the next element, and that doesn't cause an error or crash my computer. As a result this is not suitable for release unless we have a way of doing this properly, and I'm just leaving this here for feedback and suggestions.

It's possible to make this go faster by using 32-bit loads per thread if the alignment allows for it, but again it's the language that's holding me back. The current code runs with a fixed 64 subgroup size.

model size params backend ngl threads main_gpu sm test t/s
llama 7B Q6_K (Master) 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 20.39 ± 0.00
llama 7B Q6_K (PR) 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 25.15 ± 0.02
Master
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   368.04 us/run - 117.44 MFLOP/run - 319.10 GFLOPS

PR
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   294.77 us/run - 117.44 MFLOP/run - 398.41 GFLOPS

In 158ab15 I also have a proper version of this which runs at 23 t/s with the same Q6_K model. Here the loads are coalesced for each 16 thread superblock only and no hacks are required, but I think it only helps with AMD GCN as it's 64 wide SIMD is actually 4 16 wide units internally (and the loads are done in blocks of 16 threads).

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 27, 2024
@netrunnereve
Copy link
Collaborator Author

Turns out the CI is failing as llvmpipe isn't happy with this change, though strangely enough it provides good output and tests clean with RADV.

GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it.

@jeffbolznv
Copy link
Collaborator

On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master and commit 158ab15 is maybe half a percent faster than master.

GL_EXT_shared_memory_block might be a way to do this properly provided I can find some examples on how to use it.

Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it. IIRC the syntax is just:

shared Block1 {
   // members, optionally with offset decorations
};
shared Block2 {
   // members, optionally with offset decorations
};

and then data in each block can potentially alias.

@netrunnereve
Copy link
Collaborator Author

On my RTX 4070, commit b3b30e4 is maybe 1-2% slower than master

This might actually be fine considering how I've forced it to use a subgroup size of 64. It should go faster with 32, but then again if you're memory bound this may not help much.

Yeah, I was going to suggest this if you need to treat the data differently when storing it than when loading it.

Thanks. I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU.

---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ----------
index fab4ff5f..dfcae382 100644
@@ -1,6 +1,7 @@
 #version 450
 
 #extension GL_EXT_shader_explicit_arithmetic_types : require
+#extension GL_EXT_shared_memory_block : require
 
 #include "mul_mat_vec_base.comp"
 
@@ -9,7 +10,9 @@ layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
 layout (constant_id = 0) const uint BLOCK_SIZE = 32;
 layout (constant_id = 1) const uint NUM_ROWS = 1;
 
-shared FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+shared shblk {
+	FLOAT_TYPE tmpsh[NUM_ROWS][BLOCK_SIZE];
+};
 
 void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
     uint a_offset, b_offset, d_offset;

@jeffbolznv
Copy link
Collaborator

I managed to get GL_EXT_shared_memory_block working properly with llvmpipe but it spits out gibberish on AMD, it looks like the driver has some bugs with this extension. Even the simple change below on master causes incorrect output on my GPU.

I looked at the SPIR-V that gets generated and I think what's happening is the ArrayStride for the inner array of tmpsh is being set to 128. This is a known limitation of SPIR-V where structure layouts that depend on spec constants don't really work because the layout doesn't get updated based on the spec constant. If you had BLOCK_SIZE = 64 in the shader it would probably work on AMD.

Keep in mind that GL_EXT_shared_memory_block is probably not supported everywhere, so we'd need to have a fallback path. So need to decide whether the perf we get from it is worth the additional maintenance cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants