Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: re-add AArch64 NEON assembly for ggml_gemv_q4_0_4x4_q8_0() for non-dotprod #10889

Conversation

smpurkis
Copy link

@smpurkis smpurkis commented Dec 18, 2024

Self-reported review complexity:

  • Low
  • Medium
  • High

This PR improves performance (to before a level before a set of performance regressions, see #10757).

It does two main things:

  1. Allows ggml_gemv_q4_0_4x4_q8_0() and ggml_gemm_q4_0_4x4_q8_0() to be used without dotprod, e.g. on Ampere A1 CPU.
  2. Re-adds NEON asm GEMV that works without dotprod.

llama-bench runs to show performance between two commits:
Prompt processing running

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -t 
2 -n 0 -p 256 -b 128,256 -r 3

Before

| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     128 |         pp256 |         34.75 ± 0.28 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     256 |         pp256 |         34.21 ± 0.39 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     128 |         pp256 |          4.73 ± 0.16 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     256 |         pp256 |          4.95 ± 0.02 |

build: 7bbb5acf (4356)

After

| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     128 |         pp256 |       152.38 ± 12.49 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |     256 |         pp256 |        149.81 ± 4.66 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     128 |         pp256 |         24.91 ± 0.59 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |     256 |         pp256 |         24.30 ± 0.23 |

build: 3f90d2ab (4357)

Generation running

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -t 2 -p 0 -n 64,128 -r 3

Before

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg64 |         14.59 ± 3.85 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |         tg128 |         15.03 ± 2.11 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg64 |          2.95 ± 0.08 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |         tg128 |          2.97 ± 0.09 |

build: 7bbb5acf (4356)

After

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg64 |         31.63 ± 7.80 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |         tg128 |         38.21 ± 0.14 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg64 |          8.41 ± 0.16 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |         tg128 |          8.37 ± 0.05 |

build: 3f90d2ab (4357)

Multiple threads

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf  -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -n
 0 -n 16 -p 64 -t 1,2,4 -r 3

Before

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          pp64 |         19.26 ± 0.02 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          tg16 |         12.86 ± 0.20 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          pp64 |         38.14 ± 0.06 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg16 |         23.32 ± 0.07 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          pp64 |         67.33 ± 1.66 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          tg16 |         33.77 ± 4.54 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          pp64 |          2.54 ± 0.00 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          tg16 |          2.13 ± 0.01 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          pp64 |          5.03 ± 0.04 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg16 |          3.99 ± 0.06 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          pp64 |          8.55 ± 0.52 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          tg16 |          5.58 ± 0.99 |

build: 7bbb5acf (4356)

After

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          pp64 |         98.35 ± 0.23 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          tg16 |         22.00 ± 0.01 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          pp64 |        190.73 ± 0.65 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg16 |         36.80 ± 0.92 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          pp64 |       354.89 ± 18.44 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          tg16 |        55.59 ± 22.61 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          pp64 |         14.39 ± 0.09 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          tg16 |          4.53 ± 0.02 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          pp64 |         28.75 ± 0.04 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg16 |          8.47 ± 0.25 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          pp64 |        41.35 ± 12.93 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          tg16 |         11.12 ± 2.08 |

build: 3f90d2ab (4357)

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 18, 2024
Comment on lines +607 to +612
".inst 0x4fbbe27a // sdot v26.4s, v19.16b, v27.4b[1]\n"
".inst 0x4fb9e31a // sdot v26.4s, v24.16b, v25.4b[1]\n"
".inst 0x4f9bea5a // sdot v26.4s, v18.16b, v27.4b[2]\n"
".inst 0x4f99eafa // sdot v26.4s, v23.16b, v25.4b[2]\n"
".inst 0x4fbbea3a // sdot v26.4s, v17.16b, v27.4b[3]\n"
".inst 0x4fb9eada // sdot v26.4s, v22.16b, v25.4b[3]\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the sdot instruction part of the dotprod feature?

Copy link
Author

@smpurkis smpurkis Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me, I don't know assembly or intrisincs that well.
All I can say is that defined(__ARM_FEATURE_DOTPROD) is not allowing this code to be used, adding CMAKE_ARGS="-D__ARM_FEATURE_DOTPROD=1" didn't seem to make a difference. Inference runs fine on Ampere A1 CPU, when defined(__ARM_FEATURE_DOTPROD) is apparently not supported

@angt
Copy link
Contributor

angt commented Dec 18, 2024

I think it's related to the current build system, can you try this ?

git remote add angt https://github.com/angt/llama.cpp
git fetch angt
git checkout angt/ggml-allow-march-native-on-generic-arm-platforms 
rm -rf build
cmake -B build
cmake --build build --config Release -j
build/bin/llama-bench ...

@smpurkis
Copy link
Author

@angt Here are the results for running

llama-bench -m models/Qwen2.5-Coder-0.5B-Instruct-Q4_0.gguf  -m models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -n
 0 -n 16 -p 64 -t 1,2,4 -r 3
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          pp64 |        114.20 ± 0.36 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       1 |          tg16 |         26.35 ± 0.07 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          pp64 |        213.19 ± 3.13 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       2 |          tg16 |         47.02 ± 0.31 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          pp64 |       350.69 ± 78.24 |
| qwen2 1B Q4_0                  | 330.95 MiB |   494.03 M | CPU        |       4 |          tg16 |        63.72 ± 24.09 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          pp64 |         16.18 ± 0.05 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       1 |          tg16 |          4.81 ± 0.04 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          pp64 |         32.09 ± 0.05 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       2 |          tg16 |          9.08 ± 0.16 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          pp64 |         53.11 ± 4.26 |
| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |       4 |          tg16 |         12.29 ± 3.92 |

build: 1dae1d88 (4353)

Looks like you are right.
I'll close this in favour of #10752

@smpurkis smpurkis closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants