Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Large performance regression since version b4365 #10977

Open
GlasslessPizza opened this issue Dec 25, 2024 · 2 comments
Open

Misc. bug: Large performance regression since version b4365 #10977

GlasslessPizza opened this issue Dec 25, 2024 · 2 comments

Comments

@GlasslessPizza
Copy link

Name and Version

b4365 onward

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

I'm observing a slowdown between b4363 and b4365 that persists to this day. I tried two models:

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/blob/main/gemma-2-27b-it-Q5_K_L.gguf
https://huggingface.co/tensorblock/Qwen2.5-32B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-32B-Instruct-abliterated-Q5_K_M.gguf

Results:

      |   qwen   |   gemma
-----------------------------
b4363 | 31.7 t/s | 36.1 t/s
b4365 | 24.5 t/s | 22.7 t/s
-----------------------------
      |  -23%    |  -37%

Command used:

.\llama-server.exe --model <model> --ctx-size 8192 --threads 10 --no-mmap --mlock --n-gpu-layers 999 --log-disable --flash-attn --cache-type-k q8_0 --cache-type-v q8_0

Windows 10

First Bad Commit

between b4363 and b4365

Relevant log output

No response

@slaren
Copy link
Collaborator

slaren commented Dec 25, 2024

How are you measuring the performance? What queries are you performing? The only relevant commits that I see in that range is #10783, if you are requesting token probabilities the change in performance may be expected.

@GlasslessPizza
Copy link
Author

How are you measuring the performance? What queries are you performing? The only relevant commits that I see in that range is #10783, if you are requesting token probabilities the change in performance may be expected.

The query is a basic q&a task in mikupad. I'm using it's token speed counter to measure. Now that you mention it, i know that Mikupad does request token probabilities internally as some functions use them like "show on hover", but I personally keep it on "hide" as I don't need them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants