Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693

lhez · 2024-12-06T19:27:36Z

This PR introduces a new experimental OpenCL backend for Adreno GPUs. Through OpenCL, we can tap into the computational power of Adreno GPUs, which are widely used in many mobile devices, allowing us to optimize llama.cpp for better performance and efficiency on these devices.

This backend has been tuned for and tested on the latest Snapdragon Gen 3, X-Elite, 8-Elite Android, Linux and Windows ARM64 platforms.

Key supported features:

Full and partial GPU offload (-ngl 0 ... 99) of LLaMa-based and similar models
Offloaded data types : F32, F16, Q4_0 and Q6_K
- Optimization has been performed for Q4_0 for Adreno GPUs
- Q4_0 tensors are optionally repacked when Adreno-specific kernels are enabled
- Q6_K, F32 and F16 are supported but are not optimized

Not yet supported features:

Q4_K and other K- and I- quants
Flash attention and KV quantization
Further performance tuning

Compatibility

OpenCL subgroup is used in this backend (e.g., sub_group_broadcast, sub_group_reduce_add). Hence, it needs OpenCL 2.x or OpenCL 3.0 with subgroup support.

Although this backend is developed and tuned for Adreno GPUs, the portability of OpenCL allows it to run on GPUs from other vendor with minimum changes as long as subgroup support is available.

For example, when Adreno-specific kernels are disabled (-DGGML_OPENCL_USE_ADRENO_KERNELS=OFF), this backend works on certain Intel GPUs (e.g., Xe GPU found in Core i7-12700H) with functional correctness (although performance is not guaranteed).

How to build

In addition to the standard build tools (CMake, LLVM, Visual Studio, etc) the build requires official OpenCL Headers and OpenCL ICD-Loader. Please see .github/workflow/build.yml for examples on how to install those.

The following CMake options are added,

GGML_OPENCL - enables the new OpenCL backend
GGML_OPENCL_USE_ADRENO_KERNELS - enables Adreno-specific kernels

Windows on Snapdragon

cmake --preset arm64-windows-llvm-release  \
     -D CMAKE_PREFIX_PATH=<path-to-opencl> \
     -D GGML_OPENCL=ON -D GGML_OPENCL_USE_ADRENO_KERNELS=ON \
     -B build
cmake --build build

Snapdragon-based Android device

The build requires Android NDK and OpenCL Headers & ICD-Loader mentioned above.

cmake \
    -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-31" \
    -D CMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -D CMAKE_PREFIX_PATH=<path-to-opencl> \
    -D CMAKE_C_FLAGS="-march=armv8.7a"   \
    -D CMAKE_CXX_FLAGS="-march=armv8.7a" \
    -D GGML_OPENCL=ON -D GGML_OPENCL_USE_ADRENO_KERNELS=ON \
    -B build

cmake --build build

Example runs

X-Elite-based laptop

PS llama.cpp> .\build\bin\llama-cli.exe -m 'gguf\llama-v3.2-3b-instruct.q4_0.gguf' -f Hawaii-128.txt --seed 42 --ctx-size 4096 -t 2 -ngl 99

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'Qualcomm(R) Adreno(TM) X1-85 GPU'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler DX.17.75.00
ggml_opencl: vector subgroup broadcast support: true
...
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
...
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
llm_load_tensors:      OpenCL2 model buffer size =  1820.90 MiB
...
Hawaii (/həˈwaɪ.i/ hə-WY-ee;[9] Hawaiian: Hawaiʻi[həˈvɐjʔi, həˈwɐjʔi]) is an
island state of the United States, in the Pacific Ocean about 2,000 miles
3,200 km: southwest of the U.S. mainland.
It is the only state not on the North American mainland, the only state that is an archipelago, and the only state in the
tropics.

Please summarize previos passage.

The passage describes Hawaii as an island state located in the Pacific Ocean and southwest of the U.S. mainland, and that 
it is different from the other 50 states in several ways, including being the only state not on the North American mainland, 
being an archipelago, and being located in the tropics.

Note: The passage is very short and only provides basic information about Hawaii, but it does not provide much detail 
or depth about the state. [end of text]

llama_perf_sampler_print:    sampling time =       6.38 ms /   222 runs   (    0.03 ms per token, 34807.15 tokens per second)
llama_perf_context_print:        load time =    1505.39 ms
llama_perf_context_print: prompt eval time =     476.23 ms /   128 tokens (    3.72 ms per token,   268.78 tokens per second)
llama_perf_context_print:        eval time =    4082.55 ms /    93 runs   (   43.90 ms per token,    22.78 tokens per second)
llama_perf_context_print:       total time =    4592.73 ms /   221 tokens

@max-krasnyansky @wanghqc @quic-sszot @shawngu-quic @quic-aangus

max-krasnyansky · 2024-12-06T22:11:35Z

@slaren @ggerganov
When you get the chance please take a look for any quick (or extended) feedback.
Should be ready to merge and we can iterate further after that.

ggml/include/ggml-opencl2.h

ggml/src/ggml-opencl2/ggml-opencl2.cpp

oscarbg · 2024-12-07T14:43:15Z

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support):
https://www.phoronix.com/news/Freedreno-Rusticl-Mesa-24.3

netrunnereve · 2024-12-07T17:04:32Z

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support):

With Ubuntu x86 Rusticl on an AMD RX 470 the OpenCL kernels fail to compile. Note that I had to force the code to think that I was using Intel to make it run in the first place.

-------------------- ggml/src/ggml-opencl2/ggml-opencl2.cpp --------------------
index 6df5625a..e01df046 100644
@@ -443,7 +443,7 @@ static ggml_backend_opencl2_context * ggml_cl2_init(ggml_backend_dev_t dev) {
                 "may not work as expected\n",
                 backend_ctx->device_name.c_str(), backend_ctx->adreno_wave_size);
         }
-    } else if (strstr(default_device->name, "Intel")) {
+    } else if (strstr(default_device->name, "AMD")) {
         backend_ctx->gpu_family = GPU_FAMILY::INTEL;
     } else {
         fprintf(stderr, "Unknown GPU: %s\n", default_device->name);

Build command: cmake .. -DGGML_OPENCL_USE_ADRENO_KERNELS=OFF -DGGML_OPENCL=ON

./bin/llama-bench -t 8 -ngl 100 -m <Q4_0 model>
ggml_opencl: selecting platform: 'Clover'
ggml_opencl: selecting device: 'AMD Radeon RX 470 Graphics (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.8.0-49-generic)'
ggml_opencl: OpenCL driver: 24.0.9-0ubuntu0.2
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: false
ggml_opencl: mem base addr align: 32768
ggml_opencl: max mem alloc size: 2048 MB
ggml_opencl: SVM coarse grain buffer support: false
ggml_opencl: SVM fine grain buffer support: false
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: false
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: kernel compile error:

I'm not sure if this is a Rusticl issue or an AMD/x86 one though.

myan-o · 2024-12-08T05:55:09Z

I tried it with ~/gguf/Marco-o1-Q4_K_M.gguf, and Q4_0_4_8(CPU only) runs much faster. Is this a specification?

When I tried it with Q4_0, it was about 1.5 times faster than Q4_0_4_8.

max-krasnyansky · 2024-12-09T04:02:20Z

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support): https://www.phoronix.com/news/Freedreno-Rusticl-Mesa-24.3

I don't know what OpenCL extensions are supported by RustiCL.
The Adreno kernels (GGML_OPENCL_USE_ADRENO_KERNELS=ON) most likely won't work.
The generic kernels should work though.

max-krasnyansky · 2024-12-09T04:29:47Z

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support):

With Ubuntu x86 Rusticl on an AMD RX 470 the OpenCL kernels fail to compile. ...

Thanks for trying it out.
Based on the reported capabilities (no FP16 support) that setup is probably no going to work that well with the current implementation. FP16 (ie OpenCL half types) is probably what causes kernel compilation errors.

max-krasnyansky · 2024-12-09T04:32:05Z

I tried it with ~/gguf/Marco-o1-Q4_K_M.gguf, and Q4_0_4_8(CPU only) runs much faster. Is this a specification?

When I tried it with Q4_0, it was about 1.5 times faster than Q4_0_4_8.

I'm not quite sure what the question is. Currently Q4_0, Q6_K are the only supported data types.
Please see the PR description above.

myan-o · 2024-12-09T05:33:08Z

Tested on Q4_0.(-ngl 99 29/29layer)

The speed is not much different from Q4_0_4_8, but is this the normal speed? Or is it just not working in my environment(Gen3)?

ggml/src/CMakeLists.txt

ggml/CMakeLists.txt

ggml/src/ggml-opencl2/ggml-opencl2.cpp

myan-o · 2024-12-11T01:46:26Z

During inference, a memory error message appears and the termux crashes.

max-krasnyansky · 2024-12-11T01:53:23Z

It frequently crashes with memory errors.

Sorry. I meant to follow up on your earlier questions about perf numbers with Gen 3.
I'll retest on Galaxy S24 Ultra shortly and share the numbers you should get.
And will also see if we can reproduce the memory errors.
Please share the exact scenario (which model, which device, and the command line for llama-cli or llama-bench).

myan-o · 2024-12-11T02:13:28Z

Thank you for replying to me.

device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

[~]$ llama-server --version
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(T
M)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750
'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build:
commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 1024
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representatio
n as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_O
PENCL_USE_ADRENO_KERNELS)
version: 4277 (e9ae5a14)
built with clang version 19.1.4 for aarch64-unknown-lin
ux-android24

loading shared library

[~]$ ldd $(which llama-server)
  ┆ ┆ ┆ libc.so => /system/lib64/libc.so
  ┆ ┆ ┆ libllama.so => /data/data/com.termux/files/usr/
lib/libllama.so
  ┆ ┆ ┆ libggml.so => /data/data/com.termux/files/usr/l
ib/libggml.so
  ┆ ┆ ┆ libggml-cpu.so => /data/data/com.termux/files/u
sr/lib/libggml-cpu.so
  ┆ ┆ ┆ libggml-rpc.so => /data/data/com.termux/files/u
sr/lib/libggml-rpc.so
  ┆ ┆ ┆ libggml-opencl2.so => /data/data/com.termux/fil
es/usr/lib/libggml-opencl2.so
  ┆ ┆ ┆ libggml-base.so => /data/data/com.termux/files/
usr/lib/libggml-base.so
  ┆ ┆ ┆ libc++_shared.so => /data/data/com.termux/files
/usr/lib/libc++_shared.so
  ┆ ┆ ┆ libdl.so => /system/lib64/libdl.so
  ┆ ┆ ┆ libm.so => /system/lib64/libm.so
  ┆ ┆ ┆ ld-android.so => /system/lib64/ld-android.so
  ┆ ┆ ┆ libOpenCL.so => /vendor/lib64/libOpenCL.so
  ┆ ┆ ┆ libc++.so => /system/lib64/libc++.so

lhez · 2024-12-11T04:05:39Z

Thank you for replying to me.

device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

myan-o · 2024-12-11T05:13:15Z

Thank you for replying to me.

device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model:
https://huggingface.co/bartowski/Marco-o1-GGUF

ggml/src/ggml-opencl/CMakeLists.txt

ggml/src/ggml-opencl/ggml-opencl.cpp

ggml/include/ggml-opencl.h

netrunnereve · 2024-12-11T16:08:57Z

@lhez Just curious but is there a reason why you went with a brand new OpenCL backend rather than extending/optimizing the Vulkan one for Qualcomm? I'm pretty sure your GPUs support Vulkan as well.

max-krasnyansky · 2024-12-12T00:30:20Z

Tested on Q4_0.(-ngl 99 29/29layer)

The speed is not much different from Q4_0_4_8, but is this the normal speed? Or is it just not working in my environment(Gen3)?

Sorry for the delay. Finally got a chance to run that model on Galaxy S24 Ultra (Snapdragon Gen 3).
Here are the numbers I get

./adreno/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/Marco-o1-Q4_0.gguf -t 6 -ngl 99 -p 128 -n 16

model	size	params	backend	ngl	threads	mmap	test	t/s
qwen2 7B Q4_0	4.13 GiB	7.62 B	OpenCL	99	6	0	pp128	84.74 ± 2.06
qwen2 7B Q4_0	4.13 GiB	7.62 B	OpenCL	99	6	0	tg16	6.29 ± 0.14

./master/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/Marco-o1-Q4_0.gguf -t 6 -ngl 0 -p 128 -n 16

model	size	params	backend	threads	mmap	test	t/s
qwen2 7B Q4_0	4.13 GiB	7.62 B	CPU	6	0	pp128	50.21 ± 1.55
qwen2 7B Q4_0	4.13 GiB	7.62 B	CPU	6	0	tg16	9.88 ± 0.93

Looks like 6 CPU cores are a bit faster on token gen for this model.
More optimizations are in the works as we mentioned in the PR description.

max-krasnyansky · 2024-12-12T00:39:06Z

@myan-o

run option: -ngl 999 -fa -t 8 -b 512

These are not the best options.
Flash Attention -fa is not yet supported for offload to OpenCL/Adreno so that will run on the CPU.

Using all 8 cores for compute intensive workload on Snapdragon Gen 3 is not a good idea.
Two of the cores are Efficiency cores and will only slow things down in this case.
You'll be much better off running with -t 6 or -t 4.

See the commands I used above in the previous reply.
I cannot reproduce memory issues you mentioned.

max-krasnyansky · 2024-12-12T00:57:46Z

@lhez Just curious but is there a reason why you went with a brand new OpenCL backend rather than extending/optimizing the Vulkan one for Qualcomm? I'm pretty sure your GPUs support Vulkan as well.

Yep. Current Snapdragon platforms support both Vulkan 1.3, and OpenCL 3.0 full profile.
We started with OpenCL and wanted to enable that first. Vulkan backend updates are in the plans :)

max-krasnyansky · 2024-12-12T19:04:43Z

@slaren All comments/suggestions so far are in. OK with me merging it?

Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately.

Also for unsupported GPUs.

max-krasnyansky · 2024-12-13T19:57:39Z

@slaren update on the previous requests

MSVC build has been fixed
GCC and LLVM warnings have been fixed
Init now fails gracefully if
- OpenCL drivers are not present (i.e no GPU docker)
- We detect an unsupported GPU (NVidia and AMD for now) and/or OpenCL driver is missing key features (FP16, etc)
GGML_BACKEND_DL build and runtime now work as expected

Tested the following combos:

Ubuntu 24.04 x64 GCC and LLVM : no-GPU (docker), unsupported-GPU (nvidia)
Ubuntu 24.04 arm64 LLVM: no-GPU
Windows arm64 LLVM and MSVC: Snapdragon X-Elite GPU
Android arm64 LLVM : Snapdragon Gen 3 GPU, 8 Elite GPU

AndreasKunar · 2024-12-13T20:57:46Z

Thanks a lot, great work!

I used also -D GGML_OPENMP=OFF for building, or is it negatively impacting performance?

Good performance on a Surface Laptop 7 / Snapdragon X Eite with the "standard benchmark run" llama2 7B Q4_0. But the CPU due to Q4_0_4_x re-pack+optimize still seems faster.

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	OpenCL	99	pp512	100.65 ± 0.17
llama 7B Q4_0	3.56 GiB	6.74 B	OpenCL	99	tg128	17.95 ± 0.16

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp512	173.82 ± 8.35
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg128	23.33 ± 0.54

max-krasnyansky · 2024-12-13T23:04:54Z

I used also -D GGML_OPENMP=OFF for building, or is it negatively impacting performance?

Internal threadpool (OPENMP=OFF) is a bit faster but only for the CPU backend.

Good performance on a Surface Laptop 7 / Snapdragon X Eite with the "standard benchmark run" llama2 7B Q4_0. But the CPU due to Q4_0_4_x re-pack+optimize still seems faster.

Yep, 12 CPU cores on X-Elite are hard to beat :)
We mentioned in the PR description that not all Adreno perf optimizations are in yet.
Updates are in the works.

AndreasKunar · 2024-12-14T11:08:15Z

Another input which might be interesting. Even if the OpenCL backend is a first/experimental version (but it works great, thanks!!!) and the 12-CPU horsepower is hard to beat (a lower -ngl actually increases the performance).

The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison:

lama-cli generating response to "What is a cat?" (build: a76c56f (4325)):

OpenCL (Qualcomm(R) Adreno(TM) X1-85 GPU, OpenCL 3.0 QUALCOMM driver VDX.17.75.00):

llama_perf_context_print: load time = 4268.57 ms ("cold" model load, goes down to ~2500 on 2nd load)
llama_perf_context_print: prompt eval time = 9045.39 ms / 15 tokens ( 603.03 ms per token, 1.66 tokens per second)
llama_perf_context_print: eval time = 39172.04 ms / 428 runs ( 91.52 ms per token, 10.93 tokens per second)

Vulkan (Qualcomm(R) Adreno(TM) X1-85 GPU driver V0.780.0):

llama_perf_context_print: load time = 14573.55 ms (bug: compiles shaders at every run)
llama_perf_context_print: prompt eval time = 16378.12 ms / 15 tokens ( 1091.87 ms per token, 0.92 tokens per second)
llama_perf_context_print: eval time = 45705.69 ms / 723 runs ( 63.22 ms per token, 15.82 tokens per second)
Note: it did generate way more tokens due to end-token problem + having to break

CPU (Snapdragon(R) X 12-core X1E80100 3.40 GHz):

llama_perf_context_print: load time = 1619.64 ms
lllama_perf_context_print: prompt eval time = 5780.04 ms / 15 tokens ( 385.34 ms per token, 2.60 tokens per second)
llama_perf_context_print: eval time = 13050.10 ms / 333 runs ( 39.19 ms per token, 25.52 tokens per second)

sherylynn · 2024-12-17T17:22:30Z

@slaren update on the previous requests

* MSVC build has been fixed

* GCC and LLVM warnings have been fixed

* Init now fails gracefully if
  
  * OpenCL drivers are not present (i.e no GPU docker)
  * We detect an unsupported GPU (NVidia and AMD for now) and/or OpenCL driver is missing key features (FP16, etc)

* GGML_BACKEND_DL build and runtime now work as expected

Tested the following combos:

* Ubuntu 24.04 x64 GCC and LLVM : no-GPU (docker), unsupported-GPU (nvidia)

* Ubuntu 24.04 arm64 LLVM: no-GPU

* Windows arm64 LLVM and MSVC: Snapdragon X-Elite GPU

* Android arm64 LLVM : Snapdragon Gen 3 GPU, 8 Elite GPU

my device: Oneplus 24gb + 1tb
CPU :8 Elite
build tool:termux + apt install clang
run option:-ngl 99
run model:Qwen2.5-7B-Q4_0.gguf

but show
ggml_opencl: plaform IDs not available.
warning: no usable GPU found, --gpu-layers option will be ignored

sherylynn · 2024-12-17T17:23:52Z

Thank you for replying to me.
device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model: https://huggingface.co/bartowski/Marco-o1-GGUF

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available.
warning: no usable GPU found, --gpu-layers option will be ignored"
As I use oneplus 13 8elite

myan-o · 2024-12-18T02:24:17Z

Thank you for replying to me.
device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model: https://huggingface.co/bartowski/Marco-o1-GGUF

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available.
warning: no usable GPU found, --gpu-layers option will be ignored"
As I use oneplus 13 8elite

LD_LIBRARY_PATH="/vendor/lib64" ./bin/llama-server

sherylynn · 2024-12-18T03:29:31Z

Thank you for replying to me.
device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

想确认一下;这是您使用的 Marco-O1 吗 - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model: https://huggingface.co/bartowski/Marco-o1-GGUF

你能分享一下你是如何构建llama.cpp的，我的构建显示“ggml_opencl：平台ID不可用。警告：未找到可用的 GPU，--gpu-layers 选项将被忽略”因为使用了 oneplus 13 8elite 所以

LD_LIBRARY_PATH=“/vendor/lib64” ./bin/llama-server

I find cpu 8elite can't get opencl by LD_LIBRARY_PATH=“/vendor/lib64”

my old phone with cpu 870 can found by LD_LIBRARY_PATH=“/vendor/lib64”

maybe your 8gen3 not same with 8 elite

max-krasnyansky · 2024-12-18T05:41:57Z

@AndreasKunar

The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison:

Q8_0 is not offloaded. Only Q4_0, Q6_K, F16, F32 for now (see PR description).
So your tests are running on the CPU.
End-token is probably due to incomplete prompt. You're using the Instruct Model but without the chat template.

Here is a quick check with Q4_0 and a bit better prompt

.\build-wos-opencl\bin\llama-cli.exe -m 'gguf\llama-v3.2-3b-instruct.q4_0.gguf' -p "What is a cat? (please be brief)" -ngl 99 --seed 42 -t 2
...
What is a cat? (please be brief) - 3 words
A domesticated mammal. - David Attenborough
A cat is a domesticated mammal of the species Felis catus, commonly known as a housecat. - Merriam-Webster
A cat is a small, furry, carnivorous mammal. - Oxford Dictionaries
A cat is a small, carnivorous mammal. - Cambridge Dictionary

All of these answers are very brief, but different from each other. The ones from Merriam-Webster and Cambridge Dictionary are even more concise! [end of text]

llama_perf_sampler_print:    sampling time =       7.46 ms /   124 runs   (    0.06 ms per token, 16626.44 tokens per second)
llama_perf_context_print:        load time =    1497.69 ms
llama_perf_context_print: prompt eval time =     163.73 ms /    11 tokens (   14.88 ms per token,    67.19 tokens per second)
llama_perf_context_print:        eval time =    4873.82 ms /   112 runs   (   43.52 ms per token,    22.98 tokens per second)
llama_perf_context_print:       total time =    5057.22 ms /   123 tokens

max-krasnyansky · 2024-12-18T06:07:30Z

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. warning: no usable GPU found, --gpu-layers option will be ignored" As I use oneplus 13 8elite

The PR description (scroll to the top) includes instructions how to build with the Android NDK.
I'm not familiar with Termux internals and how the env is setup. I typically just use NDK and ADB to build and run things.

AndreasKunar · 2024-12-18T10:06:59Z

@AndreasKunar

The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison:

Q8_0 is not offloaded. Only Q4_0, Q6_K, F16, F32 for now (see PR description). So your tests are running on the CPU. End-token is probably due to incomplete prompt. You're using the Instruct Model but without the chat template.

Here is a quick check with Q4_0 and a bit better prompt
.\build-wos-opencl\bin\llama-cli.exe -m 'gguf\llama-v3.2-3b-instruct.q4_0.gguf' -p "What is a cat? (please be brief)" -ngl 99 --seed 42 -t 2
...
What is a cat? (please be brief) - 3 words
A domesticated mammal. - David Attenborough
A cat is a domesticated mammal of the species Felis catus, commonly known as a housecat. - Merriam-Webster
A cat is a small, furry, carnivorous mammal. - Oxford Dictionaries
A cat is a small, carnivorous mammal. - Cambridge Dictionary

All of these answers are very brief, but different from each other. The ones from Merriam-Webster and Cambridge Dictionary are even more concise! [end of text]

llama_perf_sampler_print:    sampling time =       7.46 ms /   124 runs   (    0.06 ms per token, 16626.44 tokens per second)
llama_perf_context_print:        load time =    1497.69 ms
llama_perf_context_print: prompt eval time =     163.73 ms /    11 tokens (   14.88 ms per token,    67.19 tokens per second)
llama_perf_context_print:        eval time =    4873.82 ms /   112 runs   (   43.52 ms per token,    22.98 tokens per second)
llama_perf_context_print:       total time =    5057.22 ms /   123 tokens

Sorry for my unclear earlier posting. Your openCL backend runs great, no issues/errors.

I just wanted to compare its performance to the now newly (partially-)running Vulkan backend (which still has major accuracy/hang issues with K_quants). The Vulkan backend hangs at the end of Q8_0 generation, not openCL! BTW, your prompt/parameters still throw the Vulkan backend with Q8_0 into a (multi-token) endless loop at the end.

I did not remember to verify, that the openCL backend does not yet support Q8_0, my fault.

slaren · 2024-12-19T20:42:55Z

Some results with Xiaomi 14 (snapdragon 8 gen 3):

model	size	params	backend	ngl	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	OpenCL	99	pp128	327.75 ± 1.58
llama 1B Q4_0	727.75 MiB	1.24 B	OpenCL	99	tg32	23.10 ± 0.34
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	pp128	33.68 ± 0.40
llama 1B F16	2.30 GiB	1.24 B	OpenCL	99	tg32	18.78 ± 0.13

With -ngl 0 -t 4 (CPU):

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	OpenCL	4	pp128	259.01 ± 3.98
llama 1B Q4_0	727.75 MiB	1.24 B	OpenCL	4	tg32	48.16 ± 0.47
llama 1B F16	2.30 GiB	1.24 B	OpenCL	4	pp128	63.23 ± 0.31
llama 1B F16	2.30 GiB	1.24 B	OpenCL	4	tg32	19.84 ± 1.45

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750'
ggml_opencl: device OpenCL version: OpenCL 3.0 Adreno(TM) 750
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 1024
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)

sherylynn · 2024-12-20T13:34:03Z

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. warning: no usable GPU found, --gpu-layers option will be ignored" As I use oneplus 13 8elite

The PR description (scroll to the top) includes instructions how to build with the Android NDK. I'm not familiar with Termux internals and how the env is setup. I typically just use NDK and ADB to build and run things.

build with the Android NDK and push to /data/local/tmp can solve problem, thank you!!

…eno GPUs (ggerganov#10693) * [cl][adreno] Add Adreno GPU support Add new OpenCL backend to support Adreno GPUs --------- Co-authored-by: Skyler Szot <[email protected]> Co-authored-by: Shangqing Gu <[email protected]> Co-authored-by: Alexander Angus <[email protected]> Co-authored-by: Hongqiang Wang <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]> * [cl][ci] Add workflow for CL * [cl][adreno] Fix memory leak for non SMALL_ALLOC path * opencl: integrate backend dyn.load interface and fix compiler and format warnings * opencl: remove small-alloc support and fix build errors for non-opencl platforms * opencl: fixed merge conflict (MUSA added twice in cmake) * opencl-ci: use RUNNER_TEMP instead of github.workspace * opencl: fix embed tool invocation with python3 * opencl: CI workflow fixes * opencl: Clean up small-alloc in CMake files * opencl: cleanup ggml-opencl2 header file * opencl: use ulong for offsets and strides in ADD kernel * opencl: use cl_ulong for all offsets * opencl: use cl_ulong for sizes and strides * opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)` * opencl: rename backend `opencl2` -> `opencl` * opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl` * opencl: make OpenCL required, remove redundant lib and inc directories * `ggml-base`, `..` and `.` are added by `ggml_add_backend_library` * opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl` * opencl: remove copyright marker since main license already covers * opencl: replace some more OPENCL2 leftovers * opencl: remove limits on `tensor_extra` * opencl: use pools for `tensor_extra` * opencl: fix compiler warnings with GCC and Clang Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately. * opencl: fail gracefully if opencl devices are not available Also for unsupported GPUs. * opencl: fix MSVC builds (string length error) * opencl: check for various requirements, allow deprecated API * opencl: update log message for unsupported GPUs --------- Co-authored-by: Skyler Szot <[email protected]> Co-authored-by: Shangqing Gu <[email protected]> Co-authored-by: Alexander Angus <[email protected]> Co-authored-by: Hongqiang Wang <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]>

myan-o · 2024-12-20T17:05:04Z

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. warning: no usable GPU found, --gpu-layers option will be ignored" As I use oneplus 13 8elite

The PR description (scroll to the top) includes instructions how to build with the Android NDK. I'm not familiar with Termux internals and how the env is setup. I typically just use NDK and ADB to build and run things.

build with the Android NDK and push to /data/local/tmp can solve problem, thank you!!

The android ndk is also available in termux, so you can build it termux.
https://github.com/lzhiyong/termux-ndk

github-actions bot added python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2024

max-krasnyansky requested review from slaren and ggerganov and removed request for slaren December 6, 2024 22:04

max-krasnyansky requested a review from slaren December 6, 2024 22:23

ggerganov reviewed Dec 7, 2024

View reviewed changes

ggml/include/ggml-opencl2.h Outdated Show resolved Hide resolved

ggml/src/ggml-opencl2/ggml-opencl2.cpp Outdated Show resolved Hide resolved

slaren reviewed Dec 10, 2024

View reviewed changes

lhez force-pushed the adreno-support branch from cf62118 to 62caba4 Compare December 11, 2024 08:29

slaren reviewed Dec 11, 2024

View reviewed changes

ggml/src/ggml-opencl/CMakeLists.txt Outdated Show resolved Hide resolved

ggml/src/ggml-opencl/CMakeLists.txt Outdated Show resolved Hide resolved

ggml/src/ggml-opencl/ggml-opencl.cpp Outdated Show resolved Hide resolved

ggml/include/ggml-opencl.h Outdated Show resolved Hide resolved

max-krasnyansky approved these changes Dec 12, 2024

View reviewed changes

lhez and others added 9 commits December 13, 2024 11:14

opencl: remove copyright marker since main license already covers

c64ef0f

opencl: replace some more OPENCL2 leftovers

70063c6

opencl: remove limits on tensor_extra

74a9baf

opencl: use pools for tensor_extra

3bc085b

opencl: fix compiler warnings with GCC and Clang

c971a18

Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately.

opencl: fail gracefully if opencl devices are not available

b25a4ca

Also for unsupported GPUs.

opencl: fix MSVC builds (string length error)

b41b6e6

opencl: check for various requirements, allow deprecated API

dbaa360

opencl: update log message for unsupported GPUs

9697d07

max-krasnyansky force-pushed the adreno-support branch from c8f46be to 9697d07 Compare December 13, 2024 19:26

slaren approved these changes Dec 13, 2024

View reviewed changes

max-krasnyansky merged commit a76c56f into ggerganov:master Dec 13, 2024
50 checks passed

minzdrav mentioned this pull request Dec 16, 2024

Windows arm64 - gpu runner lmstudio-ai/lmstudio-bug-tracker#255

Open

a-ghorbani mentioned this pull request Jan 5, 2025

[Feat]: gpu inference a-ghorbani/pocketpal-ai#63

Closed

bandoti mentioned this pull request Jan 6, 2025

Vulkan related question: what's the different between server and cli? #11099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693

Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693

lhez commented Dec 6, 2024 •

edited by max-krasnyansky

Loading

max-krasnyansky commented Dec 6, 2024

oscarbg commented Dec 7, 2024

netrunnereve commented Dec 7, 2024

myan-o commented Dec 8, 2024 •

edited

Loading

max-krasnyansky commented Dec 9, 2024

max-krasnyansky commented Dec 9, 2024

max-krasnyansky commented Dec 9, 2024 •

edited

Loading

myan-o commented Dec 9, 2024 •

edited

Loading

myan-o commented Dec 11, 2024 •

edited

Loading

max-krasnyansky commented Dec 11, 2024 •

edited

Loading

myan-o commented Dec 11, 2024 •

edited

Loading

lhez commented Dec 11, 2024

myan-o commented Dec 11, 2024

netrunnereve commented Dec 11, 2024 •

edited

Loading

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 13, 2024

AndreasKunar commented Dec 13, 2024

max-krasnyansky commented Dec 13, 2024

AndreasKunar commented Dec 14, 2024

sherylynn commented Dec 17, 2024

sherylynn commented Dec 17, 2024

myan-o commented Dec 18, 2024

sherylynn commented Dec 18, 2024

max-krasnyansky commented Dec 18, 2024

max-krasnyansky commented Dec 18, 2024

AndreasKunar commented Dec 18, 2024 •

edited

Loading

slaren commented Dec 19, 2024

sherylynn commented Dec 20, 2024

myan-o commented Dec 20, 2024

Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693

Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693

Conversation

lhez commented Dec 6, 2024 • edited by max-krasnyansky Loading

Key supported features:

Not yet supported features:

Compatibility

How to build

Windows on Snapdragon

Snapdragon-based Android device

Example runs

X-Elite-based laptop

max-krasnyansky commented Dec 6, 2024

oscarbg commented Dec 7, 2024

netrunnereve commented Dec 7, 2024

myan-o commented Dec 8, 2024 • edited Loading

max-krasnyansky commented Dec 9, 2024

max-krasnyansky commented Dec 9, 2024

max-krasnyansky commented Dec 9, 2024 • edited Loading

myan-o commented Dec 9, 2024 • edited Loading

myan-o commented Dec 11, 2024 • edited Loading

max-krasnyansky commented Dec 11, 2024 • edited Loading

myan-o commented Dec 11, 2024 • edited Loading

lhez commented Dec 11, 2024

myan-o commented Dec 11, 2024

netrunnereve commented Dec 11, 2024 • edited Loading

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 12, 2024

max-krasnyansky commented Dec 13, 2024

AndreasKunar commented Dec 13, 2024

max-krasnyansky commented Dec 13, 2024

AndreasKunar commented Dec 14, 2024

OpenCL (Qualcomm(R) Adreno(TM) X1-85 GPU, OpenCL 3.0 QUALCOMM driver VDX.17.75.00):

Vulkan (Qualcomm(R) Adreno(TM) X1-85 GPU driver V0.780.0):

CPU (Snapdragon(R) X 12-core X1E80100 3.40 GHz):

sherylynn commented Dec 17, 2024

sherylynn commented Dec 17, 2024

myan-o commented Dec 18, 2024

sherylynn commented Dec 18, 2024

max-krasnyansky commented Dec 18, 2024

max-krasnyansky commented Dec 18, 2024

AndreasKunar commented Dec 18, 2024 • edited Loading

slaren commented Dec 19, 2024

sherylynn commented Dec 20, 2024

myan-o commented Dec 20, 2024

lhez commented Dec 6, 2024 •

edited by max-krasnyansky

Loading

myan-o commented Dec 8, 2024 •

edited

Loading

max-krasnyansky commented Dec 9, 2024 •

edited

Loading

myan-o commented Dec 9, 2024 •

edited

Loading

myan-o commented Dec 11, 2024 •

edited

Loading

max-krasnyansky commented Dec 11, 2024 •

edited

Loading

myan-o commented Dec 11, 2024 •

edited

Loading

netrunnereve commented Dec 11, 2024 •

edited

Loading

AndreasKunar commented Dec 18, 2024 •

edited

Loading