ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

smpurkis · 2024-12-18T17:20:27Z

I have read the contributing guidelines

Self-reported review complexity:

Low
Medium
High

This PR improves performance (to before a level before a set of performance regressions, see #10757).

It does two main things:

Allows ggml_gemv_q4_0_4x4_q8_0() and ggml_gemm_q4_0_4x4_q8_0() to be used without dotprod, e.g. on Ampere A1 CPU.
Re-adds NEON asm GEMV that works without dotprod.

llama-bench runs to show performance between two commits:
Prompt processing running

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -t 
2 -n 0 -p 256 -b 128,256 -r 3

Before

| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     128 |         pp256 |         34.75 ± 0.28 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     256 |         pp256 |         34.21 ± 0.39 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     128 |         pp256 |          4.73 ± 0.16 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     256 |         pp256 |          4.95 ± 0.02 |

build: 7bbb5acf (4356)

After

| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     128 |         pp256 |       152.38 ± 12.49 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     256 |         pp256 |        149.81 ± 4.66 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     128 |         pp256 |         24.91 ± 0.59 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     256 |         pp256 |         24.30 ± 0.23 |

build: 3f90d2ab (4357)

Generation running

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -t 2 -p 0 -n 64,128 -r 3

Before

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg64 |         14.59 ± 3.85 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |         tg128 |         15.03 ± 2.11 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg64 |          2.95 ± 0.08 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |         tg128 |          2.97 ± 0.09 |

build: 7bbb5acf (4356)

After

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg64 |         31.63 ± 7.80 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |         tg128 |         38.21 ± 0.14 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg64 |          8.41 ± 0.16 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |         tg128 |          8.37 ± 0.05 |

build: 3f90d2ab (4357)

Multiple threads

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf  -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -n
 0 -n 16 -p 64 -t 1,2,4 -r 3

Before

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          pp64 |         19.26 ± 0.02 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          tg16 |         12.86 ± 0.20 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          pp64 |         38.14 ± 0.06 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg16 |         23.32 ± 0.07 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          pp64 |         67.33 ± 1.66 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          tg16 |         33.77 ± 4.54 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          pp64 |          2.54 ± 0.00 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          tg16 |          2.13 ± 0.01 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          pp64 |          5.03 ± 0.04 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg16 |          3.99 ± 0.06 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          pp64 |          8.55 ± 0.52 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          tg16 |          5.58 ± 0.99 |

build: 7bbb5acf (4356)

After

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          pp64 |         98.35 ± 0.23 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          tg16 |         22.00 ± 0.01 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          pp64 |        190.73 ± 0.65 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg16 |         36.80 ± 0.92 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          pp64 |       354.89 ± 18.44 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          tg16 |        55.59 ± 22.61 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          pp64 |         14.39 ± 0.09 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          tg16 |          4.53 ± 0.02 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          pp64 |         28.75 ± 0.04 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg16 |          8.47 ± 0.25 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          pp64 |        41.35 ± 12.93 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          tg16 |         11.12 ± 2.08 |

build: 3f90d2ab (4357)

…for non-dotprod

slaren · 2024-12-18T17:30:31Z

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

+            ".inst 0x4fbbe27a  // sdot v26.4s, v19.16b, v27.4b[1]\n"
+            ".inst 0x4fb9e31a  // sdot v26.4s, v24.16b, v25.4b[1]\n"
+            ".inst 0x4f9bea5a  // sdot v26.4s, v18.16b, v27.4b[2]\n"
+            ".inst 0x4f99eafa  // sdot v26.4s, v23.16b, v25.4b[2]\n"
+            ".inst 0x4fbbea3a  // sdot v26.4s, v17.16b, v27.4b[3]\n"
+            ".inst 0x4fb9eada  // sdot v26.4s, v22.16b, v25.4b[3]\n"


Isn't the sdot instruction part of the dotprod feature?

Forgive me, I don't know assembly or intrisincs that well.
All I can say is that defined(__ARM_FEATURE_DOTPROD) is not allowing this code to be used, adding CMAKE_ARGS="-D__ARM_FEATURE_DOTPROD=1" didn't seem to make a difference. Inference runs fine on Ampere A1 CPU, when defined(__ARM_FEATURE_DOTPROD) is apparently not supported

angt · 2024-12-18T18:29:32Z

I think it's related to the current build system, can you try this ?

git remote add angt https://github.com/angt/llama.cpp
git fetch angt
git checkout angt/ggml-allow-march-native-on-generic-arm-platforms 
rm -rf build
cmake -B build
cmake --build build --config Release -j
build/bin/llama-bench ...

smpurkis · 2024-12-18T18:38:01Z

@angt Here are the results for running

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf  -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -n
 0 -n 16 -p 64 -t 1,2,4 -r 3

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          pp64 |        114.20 ± 0.36 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          tg16 |         26.35 ± 0.07 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          pp64 |        213.19 ± 3.13 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg16 |         47.02 ± 0.31 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          pp64 |       350.69 ± 78.24 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          tg16 |        63.72 ± 24.09 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          pp64 |         16.18 ± 0.05 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          tg16 |          4.81 ± 0.04 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          pp64 |         32.09 ± 0.05 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg16 |          9.08 ± 0.16 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          pp64 |         53.11 ± 4.26 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          tg16 |         12.29 ± 3.92 |

build: 1dae1d88 (4353)

Looks like you are right.
I'll close this in favour of #10752

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() …

3f90d2a

…for non-dotprod

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 18, 2024

smpurkis mentioned this pull request Dec 18, 2024

Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED) #10757

Closed

slaren reviewed Dec 18, 2024

View reviewed changes

smpurkis closed this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

smpurkis commented Dec 18, 2024 •

edited

Loading

slaren Dec 18, 2024

smpurkis Dec 18, 2024 •

edited

Loading

angt commented Dec 18, 2024

smpurkis commented Dec 18, 2024

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

Conversation

smpurkis commented Dec 18, 2024 • edited Loading

slaren Dec 18, 2024

Choose a reason for hiding this comment

smpurkis Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

angt commented Dec 18, 2024

smpurkis commented Dec 18, 2024

smpurkis commented Dec 18, 2024 •

edited

Loading

smpurkis Dec 18, 2024 •

edited

Loading