Poor performance - How do I use LlamCpp in python correctly? #6322

rsoika · 2024-03-26T10:43:50Z

rsoika
Mar 26, 2024

Hi,
I have a general question about how to use llama.cpp. Maybe that I am to naive but I have simply done this:

Created a new Docker Image based on the official Python image
Installed llama-cpp-python via pip install
Run my example with the following code on an Intel i5-1340P without GPU

model = Llama(
    model_path="/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    temperature=0.8,
    max_tokens=200,
    n_ctx=3048,
    ctx_size=3000,
    top_p=0.1,
    echo=False
)
 max_tokens = 2000


prompt = f"""<s>[INST] <<SYS>><</SYS>>What is the capital city of France? [/INST]"""
output = model(prompt, max_tokens=max_tokens, echo=False)

So I did not install llama.cpp via make as explained in some tutorials. I just install llama-cpp-python via pip.

The model works as expected. But the reason why I am asking this question is the poor performance.
The prompt above takes 20 seconds

llama_print_timings:        load time =    1931.26 ms
llama_print_timings:      sample time =      16.63 ms /    84 runs   (    0.20 ms per token,  5051.42 tokens per second)
llama_print_timings: prompt eval time =    1931.20 ms /    25 tokens (   77.25 ms per token,    12.95 tokens per second)
llama_print_timings:        eval time =   18520.56 ms /    83 runs   (  223.14 ms per token,     4.48 tokens per second)
llama_print_timings:       total time =   20634.53 ms /   108 tokens

Is this an expected normal response time in my dev environment? When I start testing prompts from my application with more than 2000 tokens the response time rises up to 6 Minutes!

I plan to run my application on a Intel Core i7-7700 and a GPU GeForce GTX 1080. I know that in this case I need to activate the GPU devices, but apart form all the fine tuning, I'm wondering if I'm testing right on my CPU or if I'm doing something fundamentally wrong. The serer with the additional GPU costs a lot of money and I would like to know what speed increase can be expected? Or is there something I should fix first in my Docker Container?

Thanks for any hints! Maybe someone can give me some rough values for the prompt evaluation under similar conditions CPU without GPU?

===
Ralph

Jeximo · 2024-03-28T19:14:43Z

Jeximo
Mar 28, 2024

llama-cpp-python is best to figure out the issue. It's a different project from llama.cpp.

I'm uncertain if that speed is normal for your system. llama.cpp has a paramater --threads,

llama.cpp/examples/main/README.md

Line 273 in 4e9a7f7

    
           -   `-t N, --threads N`: Set the number of threads to use during generation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has (as opposed to the logical number of cores). Using the correct number of threads can greatly improve performance.

Using the correct number of threads can greatly improve performance.

4 replies

rsoika Mar 28, 2024
Author

@Jeximo, thanks for your reply. Yes I am already using the llama-ccp-python project. I also played around with the threads option witch is n_threads in llama-cpp-python:

model = Llama(
    model_path=model_path,
    temperature=0.8,
    max_tokens=200,
    n_ctx=3048,
    ctx_size=3000,
    seed=-1, 
    n_threads=8,
    echo=False
)

But at the end it does not really show much effect. I think I will now test with GPU....

The main thing I figured out so far was, that the instructions should only be given in English. For example, if I do the instruction in German language, the response time doubles. ...But this probably has more to do with my LLM (Mistra 7b) than with llama-ccp

Jeximo Mar 29, 2024

Yes I am already using the llama-ccp-python project.

I understand you're using llama-cpp-python. My point is that you should post at https://github.com/abetlen/llama-cpp-python if you really want help with that installation.

llama.cpp and llama-cpp-python have different performance because they're seperate repos, developed by different folks.

the instructions should only be given in English
...But this probably has more to do with my LLM

Yes, language is determined by the model, and your system prompt.

beartell Dec 7, 2024

The math behind giving thread parameter is extremely important because it directly has effects on general performance of llama.cpp..
If you have a total 24 cores per processor socket in this case you need to both set both -t and -tb... Sum of these two parameters should equal to the total real processor cores. If you exceed thread size greater than your cpu core size, performans will suffer. Please set both of these parameters. If you not, setting for instance -t 24 only, set default -tb 24 and this will add up to 48 threads on 24 cores... Think about that.

For prompt evaluation/processing : -t 12
For sampling/generation : -tb 12

Or depend on your use task (RAG, Summarization, etc)
-t 8 -tb 16
-t 4 -tb 20

ddh0 Dec 18, 2024

@beartell

For prompt evaluation/processing : -t 12
For sampling/generation : -tb 12

No. That's backwards. -t is for threads and -tb is for threads batched.

Or depend on your use task (RAG, Summarization, etc)
-t 8 -tb 16
-t 4 -tb 20

Also no.

The rule of thumb is to set -t to the number of physical cores (for homogenous CPUs) or P-cores (for heterogenous CPUs), and set -tb to total number of cores, regardless of their type. The reason is that with large batch sizes, you are compute bound, but for small batch sizes, you are memory-bandwidth-bound.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance - How do I use LlamCpp in python correctly? #6322

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Poor performance - How do I use LlamCpp in python correctly? #6322

rsoika Mar 26, 2024

Replies: 1 comment · 4 replies

Jeximo Mar 28, 2024

rsoika Mar 28, 2024 Author

Jeximo Mar 29, 2024

beartell Dec 7, 2024

ddh0 Dec 18, 2024

rsoika
Mar 26, 2024

Replies: 1 comment 4 replies

Jeximo
Mar 28, 2024

rsoika Mar 28, 2024
Author