Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing experimental OpenCL backend with support for Qualcomm Adreno GPUs #10693

Merged
merged 28 commits into from
Dec 13, 2024

Conversation

lhez
Copy link
Contributor

@lhez lhez commented Dec 6, 2024

This PR introduces a new experimental OpenCL backend for Adreno GPUs. Through OpenCL, we can tap into the computational power of Adreno GPUs, which are widely used in many mobile devices, allowing us to optimize llama.cpp for better performance and efficiency on these devices.

This backend has been tuned for and tested on the latest Snapdragon Gen 3, X-Elite, 8-Elite Android, Linux and Windows ARM64 platforms.

Key supported features:

  • Full and partial GPU offload (-ngl 0 ... 99) of LLaMa-based and similar models
  • Offloaded data types : F32, F16, Q4_0 and Q6_K
    • Optimization has been performed for Q4_0 for Adreno GPUs
    • Q4_0 tensors are optionally repacked when Adreno-specific kernels are enabled
    • Q6_K, F32 and F16 are supported but are not optimized

Not yet supported features:

  • Q4_K and other K- and I- quants
  • Flash attention and KV quantization
  • Further performance tuning

Compatibility

OpenCL subgroup is used in this backend (e.g., sub_group_broadcast, sub_group_reduce_add). Hence, it needs OpenCL 2.x or OpenCL 3.0 with subgroup support.

Although this backend is developed and tuned for Adreno GPUs, the portability of OpenCL allows it to run on GPUs from other vendor with minimum changes as long as subgroup support is available.

For example, when Adreno-specific kernels are disabled (-DGGML_OPENCL_USE_ADRENO_KERNELS=OFF), this backend works on certain Intel GPUs (e.g., Xe GPU found in Core i7-12700H) with functional correctness (although performance is not guaranteed).

How to build

In addition to the standard build tools (CMake, LLVM, Visual Studio, etc) the build requires official OpenCL Headers and OpenCL ICD-Loader. Please see .github/workflow/build.yml for examples on how to install those.

The following CMake options are added,

  • GGML_OPENCL - enables the new OpenCL backend
  • GGML_OPENCL_USE_ADRENO_KERNELS - enables Adreno-specific kernels

Windows on Snapdragon

cmake --preset arm64-windows-llvm-release  \
     -D CMAKE_PREFIX_PATH=<path-to-opencl> \
     -D GGML_OPENCL=ON -D GGML_OPENCL_USE_ADRENO_KERNELS=ON \
     -B build
cmake --build build

Snapdragon-based Android device

The build requires Android NDK and OpenCL Headers & ICD-Loader mentioned above.

cmake \
    -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-31" \
    -D CMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -D CMAKE_PREFIX_PATH=<path-to-opencl> \
    -D CMAKE_C_FLAGS="-march=armv8.7a"   \
    -D CMAKE_CXX_FLAGS="-march=armv8.7a" \
    -D GGML_OPENCL=ON -D GGML_OPENCL_USE_ADRENO_KERNELS=ON \
    -B build

cmake --build build

Example runs

X-Elite-based laptop

PS llama.cpp> .\build\bin\llama-cli.exe -m 'gguf\llama-v3.2-3b-instruct.q4_0.gguf' -f Hawaii-128.txt --seed 42 --ctx-size 4096 -t 2 -ngl 99

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'Qualcomm(R) Adreno(TM) X1-85 GPU'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler DX.17.75.00
ggml_opencl: vector subgroup broadcast support: true
...
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
...
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
llm_load_tensors:      OpenCL2 model buffer size =  1820.90 MiB
...
Hawaii (/həˈwaɪ.i/ hə-WY-ee;[9] Hawaiian: Hawaiʻi[həˈvɐjʔi, həˈwɐjʔi]) is an
island state of the United States, in the Pacific Ocean about 2,000 miles
3,200 km: southwest of the U.S. mainland.
It is the only state not on the North American mainland, the only state that is an archipelago, and the only state in the
tropics.

Please summarize previos passage.

The passage describes Hawaii as an island state located in the Pacific Ocean and southwest of the U.S. mainland, and that 
it is different from the other 50 states in several ways, including being the only state not on the North American mainland, 
being an archipelago, and being located in the tropics.

Note: The passage is very short and only provides basic information about Hawaii, but it does not provide much detail 
or depth about the state. [end of text]

llama_perf_sampler_print:    sampling time =       6.38 ms /   222 runs   (    0.03 ms per token, 34807.15 tokens per second)
llama_perf_context_print:        load time =    1505.39 ms
llama_perf_context_print: prompt eval time =     476.23 ms /   128 tokens (    3.72 ms per token,   268.78 tokens per second)
llama_perf_context_print:        eval time =    4082.55 ms /    93 runs   (   43.90 ms per token,    22.78 tokens per second)
llama_perf_context_print:       total time =    4592.73 ms /   221 tokens

@max-krasnyansky @wanghqc @quic-sszot @shawngu-quic @quic-aangus

@github-actions github-actions bot added python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2024
@max-krasnyansky max-krasnyansky requested review from slaren and ggerganov and removed request for slaren December 6, 2024 22:04
@max-krasnyansky
Copy link
Collaborator

@slaren @ggerganov
When you get the chance please take a look for any quick (or extended) feedback.
Should be ready to merge and we can iterate further after that.

ggml/include/ggml-opencl2.h Outdated Show resolved Hide resolved
ggml/src/ggml-opencl2/ggml-opencl2.cpp Outdated Show resolved Hide resolved
@oscarbg
Copy link

oscarbg commented Dec 7, 2024

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support):
https://www.phoronix.com/news/Freedreno-Rusticl-Mesa-24.3

@netrunnereve
Copy link
Collaborator

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support):

With Ubuntu x86 Rusticl on an AMD RX 470 the OpenCL kernels fail to compile. Note that I had to force the code to think that I was using Intel to make it run in the first place.

-------------------- ggml/src/ggml-opencl2/ggml-opencl2.cpp --------------------
index 6df5625a..e01df046 100644
@@ -443,7 +443,7 @@ static ggml_backend_opencl2_context * ggml_cl2_init(ggml_backend_dev_t dev) {
                 "may not work as expected\n",
                 backend_ctx->device_name.c_str(), backend_ctx->adreno_wave_size);
         }
-    } else if (strstr(default_device->name, "Intel")) {
+    } else if (strstr(default_device->name, "AMD")) {
         backend_ctx->gpu_family = GPU_FAMILY::INTEL;
     } else {
         fprintf(stderr, "Unknown GPU: %s\n", default_device->name);

Build command: cmake .. -DGGML_OPENCL_USE_ADRENO_KERNELS=OFF -DGGML_OPENCL=ON

./bin/llama-bench -t 8 -ngl 100 -m <Q4_0 model>
ggml_opencl: selecting platform: 'Clover'
ggml_opencl: selecting device: 'AMD Radeon RX 470 Graphics (radeonsi, polaris10, LLVM 17.0.6, DRM 3.57, 6.8.0-49-generic)'
ggml_opencl: OpenCL driver: 24.0.9-0ubuntu0.2
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: false
ggml_opencl: mem base addr align: 32768
ggml_opencl: max mem alloc size: 2048 MB
ggml_opencl: SVM coarse grain buffer support: false
ggml_opencl: SVM fine grain buffer support: false
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: false
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: kernel compile error:

I'm not sure if this is a Rusticl issue or an AMD/x86 one though.

@myan-o
Copy link

myan-o commented Dec 8, 2024

I tried it with ~/gguf/Marco-o1-Q4_K_M.gguf, and Q4_0_4_8(CPU only) runs much faster. Is this a specification?

When I tried it with Q4_0, it was about 1.5 times faster than Q4_0_4_8.

@max-krasnyansky
Copy link
Collaborator

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support): https://www.phoronix.com/news/Freedreno-Rusticl-Mesa-24.3

I don't know what OpenCL extensions are supported by RustiCL.
The Adreno kernels (GGML_OPENCL_USE_ADRENO_KERNELS=ON) most likely won't work.
The generic kernels should work though.

@max-krasnyansky
Copy link
Collaborator

sorry to ask here, but curious if it plays well also on Ubuntu ARM (so with new RustiCL freedreno support):

With Ubuntu x86 Rusticl on an AMD RX 470 the OpenCL kernels fail to compile. ...

Thanks for trying it out.
Based on the reported capabilities (no FP16 support) that setup is probably no going to work that well with the current implementation. FP16 (ie OpenCL half types) is probably what causes kernel compilation errors.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Dec 9, 2024

I tried it with ~/gguf/Marco-o1-Q4_K_M.gguf, and Q4_0_4_8(CPU only) runs much faster. Is this a specification?

When I tried it with Q4_0, it was about 1.5 times faster than Q4_0_4_8.

I'm not quite sure what the question is. Currently Q4_0, Q6_K are the only supported data types.
Please see the PR description above.

@myan-o
Copy link

myan-o commented Dec 9, 2024

Tested on Q4_0.(-ngl 99 29/29layer)

The speed is not much different from Q4_0_4_8, but is this the normal speed? Or is it just not working in my environment(Gen3)?

ggml/src/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/CMakeLists.txt Show resolved Hide resolved
ggml/src/ggml-opencl2/ggml-opencl2.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-opencl2/ggml-opencl2.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-opencl2/ggml-opencl2.cpp Outdated Show resolved Hide resolved
@myan-o
Copy link

myan-o commented Dec 11, 2024

During inference, a memory error message appears and the termux crashes.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Dec 11, 2024

It frequently crashes with memory errors.

Sorry. I meant to follow up on your earlier questions about perf numbers with Gen 3.
I'll retest on Galaxy S24 Ultra shortly and share the numbers you should get.
And will also see if we can reproduce the memory errors.
Please share the exact scenario (which model, which device, and the command line for llama-cli or llama-bench).

@myan-o
Copy link

myan-o commented Dec 11, 2024

Thank you for replying to me.

device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

[~]$ llama-server --version
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(T
M)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750
'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build:
commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 1024
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representatio
n as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_O
PENCL_USE_ADRENO_KERNELS)
version: 4277 (e9ae5a14)
built with clang version 19.1.4 for aarch64-unknown-lin
ux-android24

loading shared library

[~]$ ldd $(which llama-server)
  ┆ ┆ ┆ libc.so => /system/lib64/libc.so
  ┆ ┆ ┆ libllama.so => /data/data/com.termux/files/usr/
lib/libllama.so
  ┆ ┆ ┆ libggml.so => /data/data/com.termux/files/usr/l
ib/libggml.so
  ┆ ┆ ┆ libggml-cpu.so => /data/data/com.termux/files/u
sr/lib/libggml-cpu.so
  ┆ ┆ ┆ libggml-rpc.so => /data/data/com.termux/files/u
sr/lib/libggml-rpc.so
  ┆ ┆ ┆ libggml-opencl2.so => /data/data/com.termux/fil
es/usr/lib/libggml-opencl2.so
  ┆ ┆ ┆ libggml-base.so => /data/data/com.termux/files/
usr/lib/libggml-base.so
  ┆ ┆ ┆ libc++_shared.so => /data/data/com.termux/files
/usr/lib/libc++_shared.so
  ┆ ┆ ┆ libdl.so => /system/lib64/libdl.so
  ┆ ┆ ┆ libm.so => /system/lib64/libm.so
  ┆ ┆ ┆ ld-android.so => /system/lib64/ld-android.so
  ┆ ┆ ┆ libOpenCL.so => /vendor/lib64/libOpenCL.so
  ┆ ┆ ┆ libc++.so => /system/lib64/libc++.so

@lhez
Copy link
Contributor Author

lhez commented Dec 11, 2024

Thank you for replying to me.

device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

@myan-o
Copy link

myan-o commented Dec 11, 2024

Thank you for replying to me.

device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model:
https://huggingface.co/bartowski/Marco-o1-GGUF

ggml/src/ggml-opencl/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/src/ggml-opencl/CMakeLists.txt Outdated Show resolved Hide resolved
ggml/src/ggml-opencl/ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml/include/ggml-opencl.h Outdated Show resolved Hide resolved
@netrunnereve
Copy link
Collaborator

netrunnereve commented Dec 11, 2024

@lhez Just curious but is there a reason why you went with a brand new OpenCL backend rather than extending/optimizing the Vulkan one for Qualcomm? I'm pretty sure your GPUs support Vulkan as well.

@max-krasnyansky
Copy link
Collaborator

Tested on Q4_0.(-ngl 99 29/29layer)

The speed is not much different from Q4_0_4_8, but is this the normal speed? Or is it just not working in my environment(Gen3)?

Sorry for the delay. Finally got a chance to run that model on Galaxy S24 Ultra (Snapdragon Gen 3).
Here are the numbers I get

./adreno/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/Marco-o1-Q4_0.gguf -t 6 -ngl 99 -p 128 -n 16

model size params backend ngl threads mmap test t/s
qwen2 7B Q4_0 4.13 GiB 7.62 B OpenCL 99 6 0 pp128 84.74 ± 2.06
qwen2 7B Q4_0 4.13 GiB 7.62 B OpenCL 99 6 0 tg16 6.29 ± 0.14

./master/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/Marco-o1-Q4_0.gguf -t 6 -ngl 0 -p 128 -n 16

model size params backend threads mmap test t/s
qwen2 7B Q4_0 4.13 GiB 7.62 B CPU 6 0 pp128 50.21 ± 1.55
qwen2 7B Q4_0 4.13 GiB 7.62 B CPU 6 0 tg16 9.88 ± 0.93

Looks like 6 CPU cores are a bit faster on token gen for this model.
More optimizations are in the works as we mentioned in the PR description.

@max-krasnyansky
Copy link
Collaborator

@myan-o

run option: -ngl 999 -fa -t 8 -b 512

These are not the best options.
Flash Attention -fa is not yet supported for offload to OpenCL/Adreno so that will run on the CPU.

Using all 8 cores for compute intensive workload on Snapdragon Gen 3 is not a good idea.
Two of the cores are Efficiency cores and will only slow things down in this case.
You'll be much better off running with -t 6 or -t 4.

See the commands I used above in the previous reply.
I cannot reproduce memory issues you mentioned.

@max-krasnyansky
Copy link
Collaborator

@lhez Just curious but is there a reason why you went with a brand new OpenCL backend rather than extending/optimizing the Vulkan one for Qualcomm? I'm pretty sure your GPUs support Vulkan as well.

Yep. Current Snapdragon platforms support both Vulkan 1.3, and OpenCL 3.0 full profile.
We started with OpenCL and wanted to enable that first. Vulkan backend updates are in the plans :)

@max-krasnyansky
Copy link
Collaborator

@slaren All comments/suggestions so far are in. OK with me merging it?

@max-krasnyansky
Copy link
Collaborator

@slaren update on the previous requests

  • MSVC build has been fixed
  • GCC and LLVM warnings have been fixed
  • Init now fails gracefully if
    • OpenCL drivers are not present (i.e no GPU docker)
    • We detect an unsupported GPU (NVidia and AMD for now) and/or OpenCL driver is missing key features (FP16, etc)
  • GGML_BACKEND_DL build and runtime now work as expected

Tested the following combos:

  • Ubuntu 24.04 x64 GCC and LLVM : no-GPU (docker), unsupported-GPU (nvidia)
  • Ubuntu 24.04 arm64 LLVM: no-GPU
  • Windows arm64 LLVM and MSVC: Snapdragon X-Elite GPU
  • Android arm64 LLVM : Snapdragon Gen 3 GPU, 8 Elite GPU

@max-krasnyansky max-krasnyansky merged commit a76c56f into ggerganov:master Dec 13, 2024
50 checks passed
@AndreasKunar
Copy link
Contributor

Thanks a lot, great work!

I used also -D GGML_OPENMP=OFF for building, or is it negatively impacting performance?

Good performance on a Surface Laptop 7 / Snapdragon X Eite with the "standard benchmark run" llama2 7B Q4_0. But the CPU due to Q4_0_4_x re-pack+optimize still seems faster.

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B OpenCL 99 pp512 100.65 ± 0.17
llama 7B Q4_0 3.56 GiB 6.74 B OpenCL 99 tg128 17.95 ± 0.16
model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp512 173.82 ± 8.35
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg128 23.33 ± 0.54

@max-krasnyansky
Copy link
Collaborator

I used also -D GGML_OPENMP=OFF for building, or is it negatively impacting performance?

Internal threadpool (OPENMP=OFF) is a bit faster but only for the CPU backend.

Good performance on a Surface Laptop 7 / Snapdragon X Eite with the "standard benchmark run" llama2 7B Q4_0. But the CPU due to Q4_0_4_x re-pack+optimize still seems faster.

Yep, 12 CPU cores on X-Elite are hard to beat :)
We mentioned in the PR description that not all Adreno perf optimizations are in yet.
Updates are in the works.

@AndreasKunar
Copy link
Contributor

Another input which might be interesting. Even if the OpenCL backend is a first/experimental version (but it works great, thanks!!!) and the 12-CPU horsepower is hard to beat (a lower -ngl actually increases the performance).

The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison:

lama-cli generating response to "What is a cat?" (build: a76c56f (4325)):

OpenCL (Qualcomm(R) Adreno(TM) X1-85 GPU, OpenCL 3.0 QUALCOMM driver VDX.17.75.00):

llama_perf_context_print: load time = 4268.57 ms ("cold" model load, goes down to ~2500 on 2nd load)
llama_perf_context_print: prompt eval time = 9045.39 ms / 15 tokens ( 603.03 ms per token, 1.66 tokens per second)
llama_perf_context_print: eval time = 39172.04 ms / 428 runs ( 91.52 ms per token, 10.93 tokens per second)

Vulkan (Qualcomm(R) Adreno(TM) X1-85 GPU driver V0.780.0):

llama_perf_context_print: load time = 14573.55 ms (bug: compiles shaders at every run)
llama_perf_context_print: prompt eval time = 16378.12 ms / 15 tokens ( 1091.87 ms per token, 0.92 tokens per second)
llama_perf_context_print: eval time = 45705.69 ms / 723 runs ( 63.22 ms per token, 15.82 tokens per second)
Note: it did generate way more tokens due to end-token problem + having to break

CPU (Snapdragon(R) X 12-core X1E80100 3.40 GHz):

llama_perf_context_print: load time = 1619.64 ms
lllama_perf_context_print: prompt eval time = 5780.04 ms / 15 tokens ( 385.34 ms per token, 2.60 tokens per second)
llama_perf_context_print: eval time = 13050.10 ms / 333 runs ( 39.19 ms per token, 25.52 tokens per second)

@sherylynn
Copy link

@slaren update on the previous requests

* MSVC build has been fixed

* GCC and LLVM warnings have been fixed

* Init now fails gracefully if
  
  * OpenCL drivers are not present (i.e no GPU docker)
  * We detect an unsupported GPU (NVidia and AMD for now) and/or OpenCL driver is missing key features (FP16, etc)

* GGML_BACKEND_DL build and runtime now work as expected

Tested the following combos:

* Ubuntu 24.04 x64 GCC and LLVM : no-GPU (docker), unsupported-GPU (nvidia)

* Ubuntu 24.04 arm64 LLVM: no-GPU

* Windows arm64 LLVM and MSVC: Snapdragon X-Elite GPU

* Android arm64 LLVM : Snapdragon Gen 3 GPU, 8 Elite GPU

my device: Oneplus 24gb + 1tb
CPU :8 Elite
build tool:termux + apt install clang
run option:-ngl 99
run model:Qwen2.5-7B-Q4_0.gguf

but show
ggml_opencl: plaform IDs not available.
warning: no usable GPU found, --gpu-layers option will be ignored

@sherylynn
Copy link

Thank you for replying to me.
device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model: https://huggingface.co/bartowski/Marco-o1-GGUF

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available.
warning: no usable GPU found, --gpu-layers option will be ignored"
As I use oneplus 13 8elite

@myan-o
Copy link

myan-o commented Dec 18, 2024

Thank you for replying to me.
device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

Wanted to confirm; is this the marco-o1 you use - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model: https://huggingface.co/bartowski/Marco-o1-GGUF

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available.
warning: no usable GPU found, --gpu-layers option will be ignored"
As I use oneplus 13 8elite

LD_LIBRARY_PATH="/vendor/lib64" ./bin/llama-server

@sherylynn
Copy link

Thank you for replying to me.
device: Realme gt5 pro 16gb + 1tb
build tool:termux + apt install clang
run option:-ngl 999 -fa -t 8 -b 512
run model:Marco-o1-Q4_0.gguf

想确认一下;这是您使用的 Marco-O1 吗 - https://huggingface.co/AIDC-AI/Marco-o1

I used the following model: https://huggingface.co/bartowski/Marco-o1-GGUF

你能分享一下你是如何构建llama.cpp的,我的构建显示“ggml_opencl:平台ID不可用。警告:未找到可用的 GPU,--gpu-layers 选项将被忽略”因为使用了 oneplus 13 8elite 所以

LD_LIBRARY_PATH=“/vendor/lib64” ./bin/llama-server

I find cpu 8elite can't get opencl by LD_LIBRARY_PATH=“/vendor/lib64”

my old phone with cpu 870 can found by LD_LIBRARY_PATH=“/vendor/lib64”

maybe your 8gen3 not same with 8 elite

@max-krasnyansky
Copy link
Collaborator

@AndreasKunar

The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison:

Q8_0 is not offloaded. Only Q4_0, Q6_K, F16, F32 for now (see PR description).
So your tests are running on the CPU.
End-token is probably due to incomplete prompt. You're using the Instruct Model but without the chat template.

Here is a quick check with Q4_0 and a bit better prompt

.\build-wos-opencl\bin\llama-cli.exe -m 'gguf\llama-v3.2-3b-instruct.q4_0.gguf' -p "What is a cat? (please be brief)" -ngl 99 --seed 42 -t 2
...
What is a cat? (please be brief) - 3 words
A domesticated mammal. - David Attenborough
A cat is a domesticated mammal of the species Felis catus, commonly known as a housecat. - Merriam-Webster
A cat is a small, furry, carnivorous mammal. - Oxford Dictionaries
A cat is a small, carnivorous mammal. - Cambridge Dictionary

All of these answers are very brief, but different from each other. The ones from Merriam-Webster and Cambridge Dictionary are even more concise! [end of text]

llama_perf_sampler_print:    sampling time =       7.46 ms /   124 runs   (    0.06 ms per token, 16626.44 tokens per second)
llama_perf_context_print:        load time =    1497.69 ms
llama_perf_context_print: prompt eval time =     163.73 ms /    11 tokens (   14.88 ms per token,    67.19 tokens per second)
llama_perf_context_print:        eval time =    4873.82 ms /   112 runs   (   43.52 ms per token,    22.98 tokens per second)
llama_perf_context_print:       total time =    5057.22 ms /   123 tokens

@max-krasnyansky
Copy link
Collaborator

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. warning: no usable GPU found, --gpu-layers option will be ignored" As I use oneplus 13 8elite

The PR description (scroll to the top) includes instructions how to build with the Android NDK.
I'm not familiar with Termux internals and how the env is setup. I typically just use NDK and ADB to build and run things.

@AndreasKunar
Copy link
Contributor

AndreasKunar commented Dec 18, 2024

@AndreasKunar

The Vulkan backend seems to run with Llama-3.2-3B-Instruct-Q8_0.gguf (from huggingface.co bartowski/Llama-3.2-3B-Instruct-GGUF), but still has an end-token stop-issue. Here a performance comparison:

Q8_0 is not offloaded. Only Q4_0, Q6_K, F16, F32 for now (see PR description). So your tests are running on the CPU. End-token is probably due to incomplete prompt. You're using the Instruct Model but without the chat template.

Here is a quick check with Q4_0 and a bit better prompt

.\build-wos-opencl\bin\llama-cli.exe -m 'gguf\llama-v3.2-3b-instruct.q4_0.gguf' -p "What is a cat? (please be brief)" -ngl 99 --seed 42 -t 2
...
What is a cat? (please be brief) - 3 words
A domesticated mammal. - David Attenborough
A cat is a domesticated mammal of the species Felis catus, commonly known as a housecat. - Merriam-Webster
A cat is a small, furry, carnivorous mammal. - Oxford Dictionaries
A cat is a small, carnivorous mammal. - Cambridge Dictionary

All of these answers are very brief, but different from each other. The ones from Merriam-Webster and Cambridge Dictionary are even more concise! [end of text]

llama_perf_sampler_print:    sampling time =       7.46 ms /   124 runs   (    0.06 ms per token, 16626.44 tokens per second)
llama_perf_context_print:        load time =    1497.69 ms
llama_perf_context_print: prompt eval time =     163.73 ms /    11 tokens (   14.88 ms per token,    67.19 tokens per second)
llama_perf_context_print:        eval time =    4873.82 ms /   112 runs   (   43.52 ms per token,    22.98 tokens per second)
llama_perf_context_print:       total time =    5057.22 ms /   123 tokens

Sorry for my unclear earlier posting. Your openCL backend runs great, no issues/errors.

I just wanted to compare its performance to the now newly (partially-)running Vulkan backend (which still has major accuracy/hang issues with K_quants). The Vulkan backend hangs at the end of Q8_0 generation, not openCL! BTW, your prompt/parameters still throw the Vulkan backend with Q8_0 into a (multi-token) endless loop at the end.

I did not remember to verify, that the openCL backend does not yet support Q8_0, my fault.

@slaren
Copy link
Collaborator

slaren commented Dec 19, 2024

Some results with Xiaomi 14 (snapdragon 8 gen 3):

model size params backend ngl test t/s
llama 1B Q4_0 727.75 MiB 1.24 B OpenCL 99 pp128 327.75 ± 1.58
llama 1B Q4_0 727.75 MiB 1.24 B OpenCL 99 tg32 23.10 ± 0.34
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp128 33.68 ± 0.40
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg32 18.78 ± 0.13

With -ngl 0 -t 4 (CPU):

model size params backend ngl threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B OpenCL 0 4 pp128 259.01 ± 3.98
llama 1B Q4_0 727.75 MiB 1.24 B OpenCL 0 4 tg32 48.16 ± 0.47
llama 1B F16 2.30 GiB 1.24 B OpenCL 0 4 pp128 63.23 ± 0.31
llama 1B F16 2.30 GiB 1.24 B OpenCL 0 4 tg32 19.84 ± 1.45
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 750'
ggml_opencl: device OpenCL version: OpenCL 3.0 Adreno(TM) 750
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 1024
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)

@sherylynn
Copy link

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. warning: no usable GPU found, --gpu-layers option will be ignored" As I use oneplus 13 8elite

The PR description (scroll to the top) includes instructions how to build with the Android NDK. I'm not familiar with Termux internals and how the env is setup. I typically just use NDK and ADB to build and run things.

build with the Android NDK and push to /data/local/tmp can solve problem, thank you!!

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…eno GPUs (ggerganov#10693)

* [cl][adreno] Add Adreno GPU support

Add new OpenCL backend to support Adreno GPUs

---------

Co-authored-by: Skyler Szot <[email protected]>
Co-authored-by: Shangqing Gu <[email protected]>
Co-authored-by: Alexander Angus <[email protected]>
Co-authored-by: Hongqiang Wang <[email protected]>
Co-authored-by: Max Krasnyansky <[email protected]>

* [cl][ci] Add workflow for CL

* [cl][adreno] Fix memory leak for non SMALL_ALLOC path

* opencl: integrate backend dyn.load interface and fix compiler and format warnings

* opencl: remove small-alloc support and fix build errors for non-opencl platforms

* opencl: fixed merge conflict (MUSA added twice in cmake)

* opencl-ci: use RUNNER_TEMP instead of github.workspace

* opencl: fix embed tool invocation with python3

* opencl: CI workflow fixes

* opencl: Clean up small-alloc in CMake files

* opencl: cleanup ggml-opencl2 header file

* opencl: use ulong for offsets and strides in ADD kernel

* opencl: use cl_ulong for all offsets

* opencl: use cl_ulong for sizes and strides

* opencl: use `GGML_LOG_xxx` instead of `fprintf(stderr, ...)`

* opencl: rename backend `opencl2` -> `opencl`

* opencl: rename kernel files `ggml-opencl2` -> `ggml-opencl`

* opencl: make OpenCL required, remove redundant lib and inc directories

* `ggml-base`, `..` and `.` are added by `ggml_add_backend_library`

* opencl: rename backend - funcs, structs, etc `opencl2` -> `opencl`

* opencl: remove copyright marker since main license already covers

* opencl: replace some more OPENCL2 leftovers

* opencl: remove limits on `tensor_extra`

* opencl: use pools for `tensor_extra`

* opencl: fix compiler warnings with GCC and Clang

Still getting the warning about clCreateCmdQueue being obsolete.
Will fix that separately.

* opencl: fail gracefully if opencl devices are not available

Also for unsupported GPUs.

* opencl: fix MSVC builds (string length error)

* opencl: check for various requirements, allow deprecated API

* opencl: update log message for unsupported GPUs

---------

Co-authored-by: Skyler Szot <[email protected]>
Co-authored-by: Shangqing Gu <[email protected]>
Co-authored-by: Alexander Angus <[email protected]>
Co-authored-by: Hongqiang Wang <[email protected]>
Co-authored-by: Max Krasnyansky <[email protected]>
@myan-o
Copy link

myan-o commented Dec 20, 2024

can u share how u build llama.cpp , my build show "ggml_opencl: plaform IDs not available. warning: no usable GPU found, --gpu-layers option will be ignored" As I use oneplus 13 8elite

The PR description (scroll to the top) includes instructions how to build with the Android NDK. I'm not familiar with Termux internals and how the env is setup. I typically just use NDK and ADB to build and run things.

build with the Android NDK and push to /data/local/tmp can solve problem, thank you!!

The android ndk is also available in termux, so you can build it termux.
https://github.com/lzhiyong/termux-ndk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants