[Feat]: gpu inference #63

kevkid · 2024-10-22T17:42:22Z

Description
Will we see gpu inference to speed up generation

Use Case
In all use cases we want more speed.

a-ghorbani · 2024-10-22T18:34:00Z

Which device are you referring to? If iPhone, Metal is already supported.

kevkid · 2024-10-22T18:47:17Z

Android, if I am remembering correctly i was able to compile llamacpp on my device and it ran fairly quick. But even the 3b model feels very slow

JasonOSX · 2024-11-02T18:13:13Z

Also interesting, I downloaded Qwen-2.5-3B and it worked great, then downloaded some more models and all a sudden all models are extremely slow, only producing 0.5 token per second. it is 6 t/s when removing and installing again. Pixel 8 / Android 15

sotwi · 2024-12-04T12:13:25Z

Android GPU support would be very welcomed

a-ghorbani · 2025-01-05T09:00:11Z

@kevkid what backed did you use for gpu support? do you still have the numbers re performance improvements?

Seemingly, llama.cpp's Vulkan implementation is currently designed for desktop and has not (yet) been optimized for Android:

Bug: run llama.cpp failed with Vulkan-supported and quantized model in Android Termux ggerganov/llama.cpp#10406
Bug: llama.cpp with Vulkan not running on Snapdragon X + Windows (Copilot+PCs) ggerganov/llama.cpp#8455

Even when it works (either Vulkan or OpenCL) on certain GPUs/drivers/Quants (OpenCL currently only supports Q4_0 and Q6_K), users do not consistently report significant performance improvements. For Vulkan at least, this is likely due to the shaders not being optimized for Android GPUs. some links on convos around this:

We need to wait a bit until Android GPU support settles down for llama.cpp, drivers (as some of the reported issues appear to be driver-related) ... and become more stable and better supported.

I will close this for now, but we can reopen it once we have a clearer path forward for the gpu support.

Vali-98 · 2025-01-05T15:18:26Z

Just wanted to chime in @a-ghorbani, I've tested the new experimental OpenCL implementation and its quiet weird, as text-gen speeds dip while prompt processing improves for Q4_0. I've also conversed quiet a bit with some one of the vulkan contributors for llama.cpp, and as it stands, vulkan support also somewhat pessimistic.

a-ghorbani · 2025-01-05T16:27:37Z

Just wanted to chime in @a-ghorbani, I've tested the new experimental OpenCL implementation and its quiet weird, as text-gen speeds dip while prompt processing improves for Q4_0. I've also conversed quiet a bit with some one of the vulkan contributors for llama.cpp, and as it stands, vulkan support also somewhat pessimistic.

Thanks for the input! It’s good to have additional confirmation that closing this for now makes sense until the ecosystem around GPU/Android stabilizes.

kevkid added the enhancement New feature or request label Oct 22, 2024

a-ghorbani closed this as completed Jan 5, 2025

a-ghorbani mentioned this issue Jan 5, 2025

[Feat]: Speed up App #130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: gpu inference #63

[Feat]: gpu inference #63

kevkid commented Oct 22, 2024

a-ghorbani commented Oct 22, 2024

kevkid commented Oct 22, 2024 •

edited

Loading

JasonOSX commented Nov 2, 2024

sotwi commented Dec 4, 2024

a-ghorbani commented Jan 5, 2025

Vali-98 commented Jan 5, 2025

a-ghorbani commented Jan 5, 2025

[Feat]: gpu inference #63

[Feat]: gpu inference #63

Comments

kevkid commented Oct 22, 2024

a-ghorbani commented Oct 22, 2024

kevkid commented Oct 22, 2024 • edited Loading

JasonOSX commented Nov 2, 2024

sotwi commented Dec 4, 2024

a-ghorbani commented Jan 5, 2025

Vali-98 commented Jan 5, 2025

a-ghorbani commented Jan 5, 2025

kevkid commented Oct 22, 2024 •

edited

Loading