-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693
Conversation
@slaren @ggerganov |
sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support): |
With Ubuntu x86 Rusticl on an AMD RX 470 the OpenCL kernels fail to compile. Note that I had to force the code to think that I was using Intel to make it run in the first place. -------------------- ggml/src/ggml-opencl2/ggml-opencl2.cpp --------------------
index 6df5625a..e01df046 100644
@@ -443,7 +443,7 @@ static ggml_backend_opencl2_context * ggml_cl2_init(ggml_backend_dev_t dev) {
"may not work as expected\n",
backend_ctx->device_name.c_str(), backend_ctx->adreno_wave_size);
}
- } else if (strstr(default_device->name, "Intel")) {
+ } else if (strstr(default_device->name, "AMD")) {
backend_ctx->gpu_family = GPU_FAMILY::INTEL;
} else {
fprintf(stderr, "Unknown GPU: %s\n", default_device->name); Build command:
I'm not sure if this is a Rusticl issue or an AMD/x86 one though. |
I tried it with ~/gguf/Marco-o1-Q4_K_M.gguf, and Q4_0_4_8(CPU only) runs much faster. Is this a specification? When I tried it with Q4_0, it was about 1.5 times faster than Q4_0_4_8. |
I don't know what OpenCL extensions are supported by RustiCL. |
Thanks for trying it out. |
I'm not quite sure what the question is. Currently Q4_0, Q6_K are the only supported data types. |
Tested on Q4_0.(-ngl 99 29/29layer) The speed is not much different from Q4_0_4_8, but is this the normal speed? Or is it just not working in my environment(Gen3)? |
During inference, a memory error message appears and the termux crashes. |
Sorry. I meant to follow up on your earlier questions about perf numbers with Gen 3. |
Thank you for replying to me. device: Realme gt5 pro 16gb + 1tb [~]$ llama-server --version
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(T
M)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750
'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build:
commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 1024
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representatio
n as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_O
PENCL_USE_ADRENO_KERNELS)
version: 4277 (e9ae5a14)
built with clang version 19.1.4 for aarch64-unknown-lin
ux-android24 loading shared library [~]$ ldd $(which llama-server)
┆ ┆ ┆ libc.so => /system/lib64/libc.so
┆ ┆ ┆ libllama.so => /data/data/com.termux/files/usr/
lib/libllama.so
┆ ┆ ┆ libggml.so => /data/data/com.termux/files/usr/l
ib/libggml.so
┆ ┆ ┆ libggml-cpu.so => /data/data/com.termux/files/u
sr/lib/libggml-cpu.so
┆ ┆ ┆ libggml-rpc.so => /data/data/com.termux/files/u
sr/lib/libggml-rpc.so
┆ ┆ ┆ libggml-opencl2.so => /data/data/com.termux/fil
es/usr/lib/libggml-opencl2.so
┆ ┆ ┆ libggml-base.so => /data/data/com.termux/files/
usr/lib/libggml-base.so
┆ ┆ ┆ libc++_shared.so => /data/data/com.termux/files
/usr/lib/libc++_shared.so
┆ ┆ ┆ libdl.so => /system/lib64/libdl.so
┆ ┆ ┆ libm.so => /system/lib64/libm.so
┆ ┆ ┆ ld-android.so => /system/lib64/ld-android.so
┆ ┆ ┆ libOpenCL.so => /vendor/lib64/libOpenCL.so
┆ ┆ ┆ libc++.so => /system/lib64/libc++.so |
Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1 |
I used the following model: |
@lhez Just curious but is there a reason why you went with a brand new OpenCL backend rather than extending/optimizing the Vulkan one for Qualcomm? I'm pretty sure your GPUs support Vulkan as well. |
Sorry for the delay. Finally got a chance to run that model on Galaxy S24 Ultra (Snapdragon Gen 3).
Looks like 6 CPU cores are a bit faster on token gen for this model. |
These are not the best options. Using all 8 cores for compute intensive workload on Snapdragon Gen 3 is not a good idea. See the commands I used above in the previous reply. |
Yep. Current Snapdragon platforms support both Vulkan 1.3, and OpenCL 3.0 full profile. |
@slaren All comments/suggestions so far are in. OK with me merging it? |
Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately.
Also for unsupported GPUs.
c8f46be
to
9697d07
Compare
@slaren update on the previous requests
Tested the following combos:
|
Thanks a lot, great work! I used also -D GGML_OPENMP=OFF for building, or is it negatively impacting performance? Good performance on a Surface Laptop 7 / Snapdragon X Eite with the "standard benchmark run" llama2 7B Q4_0. But the CPU due to Q4_0_4_x re-pack+optimize still seems faster.
|
Internal threadpool (OPENMP=OFF) is a bit faster but only for the CPU backend.
Yep, 12 CPU cores on X-Elite are hard to beat :) |
Another input which might be interesting. Even if the OpenCL backend is a first/experimental version (but it works great, thanks!!!) and the 12-CPU horsepower is hard to beat (a lower -ngl actually increases the performance). The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison: lama-cli generating response to "What is a cat?" (build: a76c56f (4325)): OpenCL (Qualcomm(R) Adreno(TM) X1-85 GPU, OpenCL 3.0 QUALCOMM driver VDX.17.75.00):llama_perf_context_print: load time = 4268.57 ms ("cold" model load, goes down to ~2500 on 2nd load) Vulkan (Qualcomm(R) Adreno(TM) X1-85 GPU driver V0.780.0):llama_perf_context_print: load time = 14573.55 ms (bug: compiles shaders at every run) CPU (Snapdragon(R) X 12-core X1E80100 3.40 GHz):llama_perf_context_print: load time = 1619.64 ms |
my device: Oneplus 24gb + 1tb but show |
can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. |
LD_LIBRARY_PATH="/vendor/lib64" ./bin/llama-server |
I find cpu 8elite can't get opencl by LD_LIBRARY_PATH=“/vendor/lib64” my old phone with cpu 870 can found by LD_LIBRARY_PATH=“/vendor/lib64” maybe your 8gen3 not same with 8 elite |
Here is a quick check with Q4_0 and a bit better prompt
|
The PR description (scroll to the top) includes instructions how to build with the Android NDK. |
Sorry for my unclear earlier posting. Your openCL backend runs great, no issues/errors. I just wanted to compare its performance to the now newly (partially-)running Vulkan backend (which still has major accuracy/hang issues with K_quants). The Vulkan backend hangs at the end of Q8_0 generation, not openCL! BTW, your prompt/parameters still throw the Vulkan backend with Q8_0 into a (multi-token) endless loop at the end. I did not remember to verify, that the openCL backend does not yet support Q8_0, my fault. |
Some results with Xiaomi 14 (snapdragon 8 gen 3):
With
|
build with the Android NDK and push to /data/local/tmp can solve problem, thank you!! |
…eno GPUs (ggerganov#10693) * [cl][adreno] Add Adreno GPU support Add new OpenCL backend to support Adreno GPUs --------- Co-authored-by: Skyler Szot <[email protected]> Co-authored-by: Shangqing Gu <[email protected]> Co-authored-by: Alexander Angus <[email protected]> Co-authored-by: Hongqiang Wang <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]> * [cl][ci] Add workflow for CL * [cl][adreno] Fix memory leak for non SMALL_ALLOC path * opencl: integrate backend dyn.load interface and fix compiler and format warnings * opencl: remove small-alloc support and fix build errors for non-opencl platforms * opencl: fixed merge conflict (MUSA added twice in cmake) * opencl-ci: use RUNNER_TEMP instead of github.workspace * opencl: fix embed tool invocation with python3 * opencl: CI workflow fixes * opencl: Clean up small-alloc in CMake files * opencl: cleanup ggml-opencl2 header file * opencl: use ulong for offsets and strides in ADD kernel * opencl: use cl_ulong for all offsets * opencl: use cl_ulong for sizes and strides * opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)` * opencl: rename backend `opencl2` -> `opencl` * opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl` * opencl: make OpenCL required, remove redundant lib and inc directories * `ggml-base`, `..` and `.` are added by `ggml_add_backend_library` * opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl` * opencl: remove copyright marker since main license already covers * opencl: replace some more OPENCL2 leftovers * opencl: remove limits on `tensor_extra` * opencl: use pools for `tensor_extra` * opencl: fix compiler warnings with GCC and Clang Still getting the warning about clCreateCmdQueue being obsolete. Will fix that separately. * opencl: fail gracefully if opencl devices are not available Also for unsupported GPUs. * opencl: fix MSVC builds (string length error) * opencl: check for various requirements, allow deprecated API * opencl: update log message for unsupported GPUs --------- Co-authored-by: Skyler Szot <[email protected]> Co-authored-by: Shangqing Gu <[email protected]> Co-authored-by: Alexander Angus <[email protected]> Co-authored-by: Hongqiang Wang <[email protected]> Co-authored-by: Max Krasnyansky <[email protected]>
The android ndk is also available in termux, so you can build it termux. |
This PR introduces a new experimental OpenCL backend for Adreno GPUs. Through OpenCL, we can tap into the computational power of Adreno GPUs, which are widely used in many mobile devices, allowing us to optimize llama.cpp for better performance and efficiency on these devices.
This backend has been tuned for and tested on the latest Snapdragon Gen 3, X-Elite, 8-Elite Android, Linux and Windows ARM64 platforms.
Key supported features:
-ngl
0
...99
) of LLaMa-based and similar modelsF32
,F16
,Q4_0
andQ6_K
Q4_0
for Adreno GPUsQ4_0
tensors are optionally repacked when Adreno-specific kernels are enabledQ6_K
,F32
andF16
are supported but are not optimizedNot yet supported features:
Q4_K
and otherK-
andI-
quantsCompatibility
OpenCL subgroup is used in this backend (e.g.,
sub_group_broadcast
,sub_group_reduce_add
). Hence, it needs OpenCL 2.x or OpenCL 3.0 with subgroup support.Although this backend is developed and tuned for Adreno GPUs, the portability of OpenCL allows it to run on GPUs from other vendor with minimum changes as long as subgroup support is available.
For example, when Adreno-specific kernels are disabled (
-DGGML_OPENCL_USE_ADRENO_KERNELS=OFF
), this backend works on certain Intel GPUs (e.g., Xe GPU found in Core i7-12700H) with functional correctness (although performance is not guaranteed).How to build
In addition to the standard build tools (CMake, LLVM, Visual Studio, etc) the build requires official OpenCL Headers and OpenCL ICD-Loader. Please see
.github/workflow/build.yml
for examples on how to install those.The following CMake options are added,
GGML_OPENCL
- enables the new OpenCL backendGGML_OPENCL_USE_ADRENO_KERNELS
- enables Adreno-specific kernelsWindows on Snapdragon
Snapdragon-based Android device
The build requires Android NDK and OpenCL Headers & ICD-Loader mentioned above.
Example runs
X-Elite-based laptop
@max-krasnyansky @wanghqc @quic-sszot @shawngu-quic @quic-aangus