Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: gpu inference #63

Closed
kevkid opened this issue Oct 22, 2024 · 7 comments
Closed

[Feat]: gpu inference #63

kevkid opened this issue Oct 22, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@kevkid
Copy link

kevkid commented Oct 22, 2024

Description
Will we see gpu inference to speed up generation

Use Case
In all use cases we want more speed.

@kevkid kevkid added the enhancement New feature or request label Oct 22, 2024
@a-ghorbani
Copy link
Owner

Which device are you referring to? If iPhone, Metal is already supported.

@kevkid
Copy link
Author

kevkid commented Oct 22, 2024

Android, if I am remembering correctly i was able to compile llamacpp on my device and it ran fairly quick. But even the 3b model feels very slow

@JasonOSX
Copy link

JasonOSX commented Nov 2, 2024

Also interesting, I downloaded Qwen-2.5-3B and it worked great, then downloaded some more models and all a sudden all models are extremely slow, only producing 0.5 token per second. it is 6 t/s when removing and installing again. Pixel 8 / Android 15

@sotwi
Copy link

sotwi commented Dec 4, 2024

Android GPU support would be very welcomed

@a-ghorbani
Copy link
Owner

@kevkid what backed did you use for gpu support? do you still have the numbers re performance improvements?

Seemingly, llama.cpp's Vulkan implementation is currently designed for desktop and has not (yet) been optimized for Android:

Even when it works (either Vulkan or OpenCL) on certain GPUs/drivers/Quants (OpenCL currently only supports Q4_0 and Q6_K), users do not consistently report significant performance improvements. For Vulkan at least, this is likely due to the shaders not being optimized for Android GPUs. some links on convos around this:

We need to wait a bit until Android GPU support settles down for llama.cpp, drivers (as some of the reported issues appear to be driver-related) ... and become more stable and better supported.

I will close this for now, but we can reopen it once we have a clearer path forward for the gpu support.

@Vali-98
Copy link

Vali-98 commented Jan 5, 2025

Just wanted to chime in @a-ghorbani, I've tested the new experimental OpenCL implementation and its quiet weird, as text-gen speeds dip while prompt processing improves for Q4_0. I've also conversed quiet a bit with some one of the vulkan contributors for llama.cpp, and as it stands, vulkan support also somewhat pessimistic.

@a-ghorbani
Copy link
Owner

Just wanted to chime in @a-ghorbani, I've tested the new experimental OpenCL implementation and its quiet weird, as text-gen speeds dip while prompt processing improves for Q4_0. I've also conversed quiet a bit with some one of the vulkan contributors for llama.cpp, and as it stands, vulkan support also somewhat pessimistic.

Thanks for the input! It’s good to have additional confirmation that closing this for now makes sense until the ecosystem around GPU/Android stabilizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants