-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat]: gpu inference #63
Comments
Which device are you referring to? If iPhone, Metal is already supported. |
Android, if I am remembering correctly i was able to compile llamacpp on my device and it ran fairly quick. But even the 3b model feels very slow |
Also interesting, I downloaded Qwen-2.5-3B and it worked great, then downloaded some more models and all a sudden all models are extremely slow, only producing 0.5 token per second. it is 6 t/s when removing and installing again. Pixel 8 / Android 15 |
Android GPU support would be very welcomed |
@kevkid what backed did you use for gpu support? do you still have the numbers re performance improvements? Seemingly, llama.cpp's Vulkan implementation is currently designed for desktop and has not (yet) been optimized for Android:
Even when it works (either Vulkan or OpenCL) on certain GPUs/drivers/Quants (OpenCL currently only supports
We need to wait a bit until Android GPU support settles down for llama.cpp, drivers (as some of the reported issues appear to be driver-related) ... and become more stable and better supported. I will close this for now, but we can reopen it once we have a clearer path forward for the gpu support. |
Just wanted to chime in @a-ghorbani, I've tested the new experimental OpenCL implementation and its quiet weird, as text-gen speeds dip while prompt processing improves for Q4_0. I've also conversed quiet a bit with some one of the vulkan contributors for llama.cpp, and as it stands, vulkan support also somewhat pessimistic. |
Thanks for the input! It’s good to have additional confirmation that closing this for now makes sense until the ecosystem around GPU/Android stabilizes. |
Description
Will we see gpu inference to speed up generation
Use Case
In all use cases we want more speed.
The text was updated successfully, but these errors were encountered: