Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

Merged
merged 2 commits into from
Jan 16, 2025

Conversation

fj-y-saito
Copy link
Contributor

@fj-y-saito fj-y-saito commented Jan 14, 2025

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #7433.

Verifying Features

This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
The values of the NEON and SVE implementations were compared one after another, and it was confirmed that the values always match.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON SVE (this PR)
6.5772 +/- 0.04061 6.5772 +/- 0.04061

performance check

Performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.

batch size Original(NEON) This PR(SVE) Ratio
1 5.02 8.79 1.75
2 5.26 9.30 1.77
4 5.32 9.55 1.80
8 5.19 9.47 1.82

The command used to measure the performance is

llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 14, 2025
ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

Could you post the results of the following command:

llama-bench --model ${PATH_TO_MODEL} -p 1,2,4,8,512 -t 8,16,32,48

@fj-y-saito
Copy link
Contributor Author

Thank you for your quick reply.
The result is below.

Original(NEON)

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp1 |          0.97 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp2 |          1.05 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp4 |          1.10 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp8 |          1.12 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         pp512 |          1.10 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         tg128 |          0.97 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp1 |          1.93 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp2 |          2.08 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp4 |          2.17 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp8 |          2.23 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         pp512 |          2.21 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         tg128 |          1.91 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp1 |          3.74 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp2 |          4.08 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp4 |          4.32 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp8 |          4.44 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         pp512 |          4.41 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         tg128 |          3.71 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |          2.74 ± 2.23 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |          3.58 ± 1.87 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |          4.16 ± 1.57 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |          6.05 ± 0.14 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |          6.56 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |          3.40 ± 1.02 |

This PR(SVE)

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp1 |          1.86 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp2 |          2.09 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp4 |          2.25 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp8 |          2.34 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         pp512 |          2.26 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         tg128 |          1.83 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp1 |          3.56 ± 0.15 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp2 |          4.12 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp4 |          4.46 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp8 |          4.63 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         pp512 |          4.52 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         tg128 |          3.56 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp1 |          6.75 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp2 |          7.91 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp4 |          8.73 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp8 |          9.20 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         pp512 |          9.00 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         tg128 |          6.66 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |          7.71 ± 3.27 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |          7.44 ± 4.56 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |          4.13 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |          8.48 ± 3.17 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |         13.33 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |          4.41 ± 0.69 |

change to use K_SCALE_SIZE

Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov ggerganov merged commit c67cc98 into ggerganov:master Jan 16, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants