Replies: 1 comment 4 replies
-
llama-cpp-python is best to figure out the issue. It's a different project from I'm uncertain if that speed is normal for your system. llama.cpp/examples/main/README.md Line 273 in 4e9a7f7 |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I have a general question about how to use llama.cpp. Maybe that I am to naive but I have simply done this:
pip install
So I did not install
llama.cpp
via make as explained in some tutorials. I just installllama-cpp-python
via pip.The model works as expected. But the reason why I am asking this question is the poor performance.
The prompt above takes 20 seconds
Is this an expected normal response time in my dev environment? When I start testing prompts from my application with more than 2000 tokens the response time rises up to 6 Minutes!
I plan to run my application on a Intel Core i7-7700 and a GPU GeForce GTX 1080. I know that in this case I need to activate the GPU devices, but apart form all the fine tuning, I'm wondering if I'm testing right on my CPU or if I'm doing something fundamentally wrong. The serer with the additional GPU costs a lot of money and I would like to know what speed increase can be expected? Or is there something I should fix first in my Docker Container?
Thanks for any hints! Maybe someone can give me some rough values for the prompt evaluation under similar conditions CPU without GPU?
===
Ralph
Beta Was this translation helpful? Give feedback.
All reactions