Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

akshatshah17 · 2024-12-13T10:17:30Z

Git commit

e36ad89

Operating systems

Linux

GGML backends

CPU

Problem description & steps to reproduce

I follow this procedure for build and convert the model into the quantized gguf format. But while running the model on device it is unable to load the model.

git clone https://github.com/chraac/llama.cpp.git --recursive
cd llama.cpp
git checkout dev-refactoring
export ANDROID_NDK=/home/code/Android/Ndk/android-ndk-r26d/
export QNN_SDK_PATH=/home/code/Android/qnn-sdk/qairt/2.27.5.241009/

Build for CPU
cmake -B build
cmake --build build --config Release -j16

Build for Android
cmake
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake
-DANDROID_ABI=arm64-v8a
-DANDROID_PLATFORM=android-28
-DCMAKE_C_FLAGS="-march=armv8.7a"
-DCMAKE_CXX_FLAGS="-march=armv8.7a"
-DGGML_OPENMP=OFF
-DGGML_LLAMAFILE=OFF
-DGGML_QNN=ON
-DGGML_QNN_DEFAULT_LIB_SEARCH_PATH=/data/local/tmp
-B build-android
cmake --build build-android --config Release -j4
cmake --install build-android --prefix install-android --config Release

Model conversion
python3 convert_hf_to_gguf.py ~/tiny_llama/ --outfile output_file_tiny_llama_fp32.gguf --outtype f32
./build/bin/llama-quantize output_file_tiny_llama_fp32.gguf output_file_tiny_llama_Q4_K_M.gguf Q4_K_M

On S24 QC
adb push install-android/ /data/local/tmp/
adb push output_file_tiny_llama_Q4_K_M.gguf /data/local/tmp/

export LD_LIBRARY_PATH=/data/local/tmp/install-android/lib/
./install-android/bin/llama-cli -m output_file_tiny_llama_Q4_K_M.gguf -c 512 -p "prompt"

First Bad Commit

No response

Relevant log output

build: 4396 (e36ad895) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device qnn-gpu (Qualcomm Adreno GPU) - 7630 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 273 tensors from output_file_SR_3B_Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SR_3B
llama_model_loader: - kv   3:                         general.size_label str              = 3.6B
llama_model_loader: - kv   4:                          llama.block_count u32              = 30
llama_model_loader: - kv   5:                       llama.context_length u32              = 1280
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  13:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  14:                          general.file_type u32              = 15
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 105900
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,105900]  = ["<|end_of_text|>", "<|begin_of_text|...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,105900]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,105604]  = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   61 tensors
llama_model_loader: - type q4_K:  183 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 53
llm_load_vocab: token to piece cache size = 0.6436 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 105900
llm_load_print_meta: n_merges         = 105604
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1280
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1280
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 3.58 B
llm_load_print_meta: model size       = 2.04 GiB (4.90 BPW)
llm_load_print_meta: general.name     = SR_3B
llm_load_print_meta: BOS token        = 1 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 0 '<|end_of_text|>'
llm_load_print_meta: UNK token        = 0 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 0 '<|end_of_text|>'
llm_load_print_meta: LF token         = 179 'Ä'
llm_load_print_meta: FIM PRE token    = 2 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 4 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 3 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 5 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 7 '<|repo_name|>'
llm_load_print_meta: EOG token        = 0 '<|end_of_text|>'
llm_load_print_meta: EOG token        = 5 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 7 '<|repo_name|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/31 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  2091.15 MiB
..................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (1280) -- the full capacity of the model will not be utilized
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 379]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 297]: failed to create QNN device
[qnn_init, 346]: why failed to initialize qnn context
[ggml_backend_qnn_init_with_device_context, 369]: init qnn subsystem failed with qnn backend qnn-npu, pls check why
llama_new_context_with_model: failed to initialize qnn-npu backend
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu
common_init_from_params: failed to create context with model 'output_file_SR_3B_Q4_K_M.gguf'
main: error: unable to load model

The text was updated successfully, but these errors were encountered:

akshatshah17 · 2025-01-03T07:11:01Z

@chraac can you please reply on this?

chraac · 2025-01-05T06:49:10Z

Hi @akshatshah17 ,
look through your error log, found that:

the initialization of npu was failed, please make sure to put the libQnnHtp*.so to the same directory along side the llama-cli, for more detail please have a look: docker-compose-compile.yml#L35
looks like you were trying to load the Q4 module, and the quantization module support of QNN is now still under construction, so maybe you should try F16/F32 model instead.

akshatshah17 · 2025-01-07T12:23:45Z

thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts

llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB
......................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported
[qnn_init, 299]: create QNN device successfully
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu
[ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default
[qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1
[qnn_init, 263]: deviceID:0, deviceType:0, numCores 1
[qnn_init, 268]: htp_type:0(ON_CHIP)
[qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB
[qnn_init, 299]: create QNN device successfully
[alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB
[qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB
[init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok
[init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type
[ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu
llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB
llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 280.01 MiB
llama_new_context_with_model: graph nodes = 710
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: model was trained on only 2048 context tokens (4096 specified)

sampler seed: 3467048278
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}]
You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal

[{<(ParagraphSummary)>}]
Ingredients:

1 rotisserie chicken (or two 4 oz. Fillets), cooked
2 asparagus, chopped
2 carrots, chopped
2 bananas, sliced
2 avocados, mashed
1/4 cup cereal

Instructions:

Preheat oven to 375°F. Line a baking dish with parchment paper.
Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.
In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.
Bake for 30-35 minutes, or until cooked through.
Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]

llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second)
llama_perf_context_print: load time = 637.56 ms
llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second)
llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second)
llama_perf_context_print: total time = 5202.37 ms / 440 tokens
[ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

chraac · 2025-01-11T08:24:41Z

thanks @chraac it's working, but from the logs below I can see that first it's offloads the layers to GPU and after that this log is comming qnn device name qnn-gpu that is fine but later in the logs I also see some NPU related logs as well so I am not sure whether the model is running on QNN GPU or NPU? I have highlighted the parts

llm_load_print_meta: max token length = 48 llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU_Mapped model buffer size = 636.18 MiB ...................................................................................... llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 4096 llama_new_context_with_model: n_ctx_per_seq = 4096 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_pre_seq (4096) > n_ctx_train (2048) -- possible training context overflow [ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default [qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 248]: device property is not supported [qnn_init, 299]: create QNN device successfully [ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-gpu [ggml_backend_qnn_init_with_device_context, 327]: extend_lib_search_path is nullptr, will use /data/local/tmp// as default [qnn_system_interface, 10]: initialize qnn system successfully

[qnn_init, 258]: device counts 1 [qnn_init, 263]: deviceID:0, deviceType:0, numCores 1 [qnn_init, 268]: htp_type:0(ON_CHIP) [qnn_init, 271]: qualcomm soc_model:69(unknown), htp_arch:79(unknown), vtcm_size:8 MB [qnn_init, 299]: create QNN device successfully [alloc_rpcmem, 594]: failed to allocate rpc memory, size: 2048 MB [qnn_init, 372]: capacity of QNN rpc ion memory is about 2000 MB [init_htp_perfinfra, 485]: HTP backend perf_infrastructure creation ok [init_htp_perfinfra, 497]: HTP infra type = 0, which is perf infra type [ggml_backend_qnn_init_with_device_context, 380]: qnn device name qnn-npu llama_kv_cache_init: qnn-gpu KV buffer size = 88.00 MiB llama_new_context_with_model: KV self size = 88.00 MiB, K (f16): 44.00 MiB, V (f16): 44.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: CPU compute buffer size = 280.01 MiB llama_new_context_with_model: graph nodes = 710 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 8 main: model was trained on only 2048 context tokens (4096 specified)

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | AARCH64_REPACK = 1 |

sampler seed: 3467048278 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

[{<(Task)>}] You are a summarization expert. Please read the provided carefully and summarize it in 3 sentences in English. The summary should comprehensively cover the entire content of the original text and be written with the same meaning as the source material.

[{<(Input)>}] Bread, milk, eggs, chicken, rice, pasta, tomatoes, spinach, bananas, apples, yogurt, cheese, toothpaste, soap, tissues, laundry detergent, coffee, Two proteins: a rotisserie chicken and two 4 oz. fillets of fresh salmon Two veggies: asparagus and carrots a handful of bananas and two avocados cereal
[{<(ParagraphSummary)>}] Ingredients:

1 rotisserie chicken (or two 4 oz. Fillets), cooked

2 asparagus, chopped

2 carrots, chopped

2 bananas, sliced

2 avocados, mashed

1/4 cup cereal

Instructions:

Preheat oven to 375°F. Line a baking dish with parchment paper.

Place chicken in a large bowl, add asparagus and carrots, and toss with olive oil, salt, and pepper. Spread in a single layer in the prepared baking dish.

In a small bowl, combine mashed banana, cereal, and chicken stock. Pour over chicken mixture.

Bake for 30-35 minutes, or until cooked through.

Serve with mashed avocado and top with toppings (such as sliced jalapeños or chopped cilantro). Enjoy! [end of text]

llama_perf_sampler_print: sampling time = 9.81 ms / 441 runs ( 0.02 ms per token, 44958.71 tokens per second) llama_perf_context_print: load time = 637.56 ms llama_perf_context_print: prompt eval time = 1246.73 ms / 190 tokens ( 6.56 ms per token, 152.40 tokens per second) llama_perf_context_print: eval time = 3927.85 ms / 250 runs ( 15.71 ms per token, 63.65 tokens per second) llama_perf_context_print: total time = 5202.37 ms / 440 tokens [ggml_backend_qnn_free, 208]: idx 2, name:qnn-npu [ggml_backend_qnn_free, 208]: idx 1, name:qnn-gpu

From you log, looks like ites running on qnn-gpu device, btw, the llama.cpp framework will decide which device to run for each layers based on device's supports_op interface

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

akshatshah17 commented Dec 13, 2024

akshatshah17 commented Jan 3, 2025

chraac commented Jan 5, 2025

akshatshah17 commented Jan 7, 2025

chraac commented Jan 11, 2025

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Compile bug: [QNN] Not able to run tiny llama model with QNN NPU #14

Comments

akshatshah17 commented Dec 13, 2024

Git commit

Operating systems

GGML backends

Problem description & steps to reproduce

First Bad Commit

Relevant log output

akshatshah17 commented Jan 3, 2025

chraac commented Jan 5, 2025

akshatshah17 commented Jan 7, 2025

chraac commented Jan 11, 2025