Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Closed
wants to merge 17 commits into from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Apr 24, 2024

Self Reported Review Complexity

  • Review Complexity : Low
  • Review Complexity : Medium
  • Review Complexity : High
  • I have read the contributing guidelines

Purpose

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent .

Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). Hexagon NPU in Qualcomm Snapdragon 8 Gen 3 was designed for generative AI and delivering 98% faster performance and 40% improved performance-per-watt for sustained AI inferencing, it make the Hexagon NPU the leading processor for on-device AI inferencing.

QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

  • TensorFlow: tf-1.15.0, or tf-2.10.1
  • TFLite: tflite-2.3.0
  • PyTorch: torch-1.13.1
  • ONNX: onnx-1.11.0

As a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml and focus on this accordingly:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Status

Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.

    319780607

    504893116

4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC equipped high-end Android phone(a flagship Qualcomm Snapdragon 8 Gen 3 mobile SoC which released on Oct 2023). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN NPU(aka Hexagon Tensor Processor) backend after we know the secrets(QNN RPC, multithreading in NPU backend......) of Qualcomm's NPU(aka Hexagon Tensor Processor).

    1922265373

    250505401

A dedicated Android command line program (for purpose of UT) works fine as expected on Qualcomm SM8650-AB Snapdragon 8 Gen 3 equipped high-end Android phone and other Qualcomm's low-end mobile SoC equipped low-end Android phone(QNN NPU backend not works on Qualcomm low-end Android phone).
    /data/local/tmp//libQnnCpu.so
    QNN libs already exist on Android phone
    ggml-qnn-test: 1 file pushed. 16.3 MB/s (4567168 bytes in 0.267s)
    [main, 344]: enter qnn_ggml_op
    
    [main, 345]: ggml op:2(ADD)
    [main, 359]: Allocating Memory of size 33554432 bytes, 32 MB
    
    [ggml_backend_qnn_init, 3955]: device 0
    [ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
    [qnn_init, 2172]: enter qni_init
    
    [load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so
    
    [load_system, 2082]: find a valid qnn system interface
    
    [load_system, 2092]: initialize qnn system successfully
    
    [qnn_init, 2180]: load QNN system lib successfully
    
    [load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so
    
    [load_backend, 1935]: num_providers=1
    
    [load_backend, 1960]: find a valid qnn interface
    
    [load_backend, 2005]: saver_initialize is null
    
    [qnn_init, 2213]: initialize qnn log successfully
    
    [qnn_init, 2224]: initialize qnn backend successfully
    
    [qnn_init, 2230]: device property is not supported
    
    [qnn_init, 2241]: create device successfully
    
    [qnn_init, 2245]: profiling turned on; level = 2
    [qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object
    
    [qnn_init, 2262]: initialize qnn profile successfully
    
    [qnn_init, 2272]: load rpcmem lib successfully
    
    [qnn_init, 2299]: initialize qnn context successfully
    
    [qnn_init, 2302]: leave qni_init
    
    [ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
    [init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a2a43bcdc2f
    
    [main, 395]: creating new tensors
    
    [main, 396]: ggml_blck_size(f32) 1
    [main, 397]: ggml_type_size(f32) 4
    [main, 436]: creating backend buffer
    
    [main, 448]: creating compute graph
    
    [ggml_qnn_can_handle_op, 2458]: op name:ADD, tensor type:f32
    [ggml_qnn_can_handle_op, 2460]: src0 type:f32
    [ggml_qnn_can_handle_op, 2463]: src1 type:f32
    [ggml_qnn_add, 2574]: call ggml_qnn_add
    
    [ggml_qnn_add, 2578]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_add, 2582]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_add, 2586]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_add, 2587]: 4, 4, 1, 1
    [ggml_qnn_add, 2588]: tensor0 name tensor_0
    [ggml_qnn_add, 2589]: tensor1 name tensor_1
    [ggml_qnn_add, 2590]: tensor2 name tensor_2
    [ggml_qnn_add, 2617]: graph name ggml_op_qnn_add_1tensor_0_tensor_1
    [ggml_qnn_logcallback, 2165]:     11.5ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseAdd 
    [ggml_qnn_logcallback, 2165]:     11.5ms [VERBOSE] validate	Node-Type : ElementWiseAdd	Node-Name : ggml_op_add 
    [ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::finalize 
    [ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 1 
    [ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 2 
    [ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 3 
    [ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::execute 
    [get_tensor_rank, 210]: tensor->rank 4
    
    [get_tensor_rank, 211]: get_tensor_rank 2
    
    [get_tensor_data_size, 223]: get_tensor_data_size 64
    [get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
    [main, 464]: dump:
    
    [tensor_dump, 191]: dump ggml tensor src0(tensor_0)
    [tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.84     0.23    -0.07    -0.25 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.10    -0.32    -0.96     0.28 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.63    -0.59     0.29    -1.00 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.01     0.10     0.92     0.54 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor src1(tensor_1)
    [tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.99    -0.43    -0.41    -0.44 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.06     0.64    -0.61    -0.98 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.86    -0.11     0.41     0.27 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.54    -0.70    -0.90    -0.13 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor dst(tensor_2)
    [tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.15    -0.19    -0.48    -0.69 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.04     0.32    -1.57    -0.70 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -1.49    -0.70     0.70    -0.73 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.53    -0.60     0.02     0.42 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
    [ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
    [ggml_backend_qnn_free, 3764]: graph type:ADD
    [qnn_finalize, 2318]: succeed to close rpcmem lib
    
    [ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free
    
    
    [ggml_backend_qnn_init, 3955]: device 0
    [ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
    [qnn_init, 2172]: enter qni_init
    
    [load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so
    
    [load_system, 2082]: find a valid qnn system interface
    
    [load_system, 2092]: initialize qnn system successfully
    
    [qnn_init, 2180]: load QNN system lib successfully
    
    [load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so
    
    [load_backend, 1935]: num_providers=1
    
    [load_backend, 1960]: find a valid qnn interface
    
    [load_backend, 2005]: saver_initialize is null
    
    [qnn_init, 2213]: initialize qnn log successfully
    
    [qnn_init, 2224]: initialize qnn backend successfully
    
    [qnn_init, 2230]: device property is not supported
    
    [qnn_init, 2241]: create device successfully
    
    [qnn_init, 2245]: profiling turned on; level = 2
    [qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object
    
    [qnn_init, 2262]: initialize qnn profile successfully
    
    [qnn_init, 2272]: load rpcmem lib successfully
    
    [qnn_init, 2299]: initialize qnn context successfully
    
    [qnn_init, 2302]: leave qni_init
    
    [ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
    [init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a5b40bcdc2f
    
    [main, 395]: creating new tensors
    
    [main, 396]: ggml_blck_size(f32) 1
    [main, 397]: ggml_type_size(f32) 4
    [main, 436]: creating backend buffer
    
    [main, 448]: creating compute graph
    
    [ggml_qnn_can_handle_op, 2458]: op name:MUL, tensor type:f32
    [ggml_qnn_can_handle_op, 2460]: src0 type:f32
    [ggml_qnn_can_handle_op, 2463]: src1 type:f32
    [ggml_qnn_hanlde_op, 2993]: call ggml_qnn_hanlde_op
    
    [ggml_qnn_hanlde_op, 2997]:        tensor_0: type = 0 (  f32)  ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_hanlde_op, 3001]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_hanlde_op, 3005]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_hanlde_op, 3006]: 4, 4, 1, 1
    [ggml_qnn_hanlde_op, 3007]: tensor0 name tensor_0
    [ggml_qnn_hanlde_op, 3008]: tensor1 name tensor_1
    [ggml_qnn_hanlde_op, 3009]: tensor2 name tensor_2
    [ggml_qnn_hanlde_op, 3033]: qnn graph name ggml_qnn_graph_MUL1tensor_0_tensor_1
    [ggml_qnn_hanlde_op, 3034]: qnn op_config name ggml_qnn_op_config_MUL1tensor_0_tensor_1
    [ggml_qnn_logcallback, 2165]:     17.7ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseMultiply 
    [ggml_qnn_logcallback, 2165]:     17.8ms [VERBOSE] validate	Node-Type : ElementWiseMultiply	Node-Name : ggml_qnn_op_config_MUL1tensor_0_tensor_1 
    [ggml_qnn_logcallback, 2165]:     18.0ms [  INFO ] CpuGraph::finalize 
    [ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 1 
    [ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 2 
    [ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 3 
    [ggml_qnn_logcallback, 2165]:     18.1ms [  INFO ] CpuGraph::execute 
    [ggml_qnn_hanlde_op, 3134]: duration of ggml_qnn_MUL : 0 milliseconds
    
    [ggml_qnn_hanlde_op, 3135]: call ggml_qnn_hanlde_op done
    
    [get_tensor_rank, 210]: tensor->rank 4
    
    [get_tensor_rank, 211]: get_tensor_rank 2
    
    [get_tensor_data_size, 223]: get_tensor_data_size 64
    [get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
    [main, 464]: dump:
    
    [tensor_dump, 191]: dump ggml tensor src0(tensor_0)
    [tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.62     0.59    -0.34     0.40 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.81     0.33     0.52     0.01 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.37     0.43     0.97     0.06 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.28     0.09    -0.57    -0.02 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor src1(tensor_1)
    [tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.24    -0.57    -0.17     0.36 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.83    -0.64     0.23    -0.87 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.25    -0.31     0.55     0.64 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.42     0.42     0.96     0.88 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor dst(tensor_2)
    [tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.15    -0.34     0.06     0.14 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.67    -0.21     0.12    -0.01 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.09    -0.13     0.53     0.04 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.12     0.04    -0.55    -0.01 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
    [ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
    [ggml_backend_qnn_free, 3764]: graph type:MUL
    [qnn_finalize, 2318]: succeed to close rpcmem lib
    
    [ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free
    
    /data/local/tmp//libQnnCpu.so
    QNN libs already exist on Android phone
    ggml-qnn-test: 1 file pushed. 20.3 MB/s (4567168 bytes in 0.215s)
    [main, 344]: enter qnn_ggml_op
    
    [main, 345]: ggml op:23(MUL_MAT)
    [main, 359]: Allocating Memory of size 33554432 bytes, 32 MB
    
    [ggml_backend_qnn_init, 3955]: device 0
    [ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
    [qnn_init, 2172]: enter qni_init
    
    [load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so
    
    [load_system, 2082]: find a valid qnn system interface
    
    [load_system, 2092]: initialize qnn system successfully
    
    [qnn_init, 2180]: load QNN system lib successfully
    
    [load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so
    
    [load_backend, 1935]: num_providers=1
    
    [load_backend, 1960]: find a valid qnn interface
    
    [load_backend, 2005]: saver_initialize is null
    
    [qnn_init, 2213]: initialize qnn log successfully
    
    [qnn_init, 2224]: initialize qnn backend successfully
    
    [qnn_init, 2230]: device property is not supported
    
    [qnn_init, 2241]: create device successfully
    
    [qnn_init, 2245]: profiling turned on; level = 2
    [qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object
    
    [qnn_init, 2262]: initialize qnn profile successfully
    
    [qnn_init, 2272]: load rpcmem lib successfully
    
    [qnn_init, 2299]: initialize qnn context successfully
    
    [qnn_init, 2302]: leave qni_init
    
    [ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
    [init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a50a2049bcdc2f
    
    [main, 395]: creating new tensors
    
    [main, 396]: ggml_blck_size(f32) 1
    [main, 397]: ggml_type_size(f32) 4
    [main, 436]: creating backend buffer
    
    [main, 448]: creating compute graph
    
    [ggml_qnn_can_handle_op, 2458]: op name:MUL_MAT, tensor type:f32
    [ggml_qnn_can_handle_op, 2460]: src0 type:f32
    [ggml_qnn_can_handle_op, 2463]: src1 type:f32
    [ggml_qnn_can_handle_op, 2467]: GGML_OP_MUL_MAT
    [ggml_qnn_can_handle_op, 2472]: src0        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_can_handle_op, 2477]: src1        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_can_handle_op, 2483]:             tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2785]: call ggml_qnn_mul_mat
    
    [ggml_qnn_mul_mat, 2789]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2793]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2797]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2798]: 4, 4, 1, 1
    [ggml_qnn_mul_mat, 2799]: tensor0 name tensor_0
    [ggml_qnn_mul_mat, 2800]: tensor1 name tensor_1
    [ggml_qnn_mul_mat, 2801]: tensor2 name tensor_2
    [ggml_qnn_mul_mat, 2828]: graph name ggml_op_qnn_mul_mat_1tensor_0_tensor_1
    [ggml_qnn_logcallback, 2165]:     16.9ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : MatMul 
    [ggml_qnn_logcallback, 2165]:     17.0ms [VERBOSE] validate	Node-Type : MatMul	Node-Name : ggml_op_mul_mat 
    [ggml_qnn_logcallback, 2165]:     17.1ms [  INFO ] CpuGraph::finalize 
    [ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 1 
    [ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 2 
    [ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 3 
    [ggml_qnn_logcallback, 2165]:     17.2ms [  INFO ] CpuGraph::execute 
    [ggml_qnn_mul_mat, 2927]: duration of ggml_qnn_mul_mat : 10 milliseconds
    
    [ggml_qnn_mul_mat, 2928]: call ggml_qnn_mul_mat done
    
    [get_tensor_rank, 210]: tensor->rank 4
    
    [get_tensor_rank, 211]: get_tensor_rank 2
    
    [get_tensor_data_size, 223]: get_tensor_data_size 64
    [get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
    [main, 464]: dump:
    
    [tensor_dump, 191]: dump ggml tensor src0(tensor_0)
    [tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.05     0.68    -0.27    -0.28 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.47     0.77     0.41     0.14 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.69    -0.71    -0.81    -0.23 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.37     0.36    -0.26     0.61 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor src1(tensor_1)
    [tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.48    -0.81    -0.61     0.53 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.04     0.87     0.64     0.17 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.22     0.94    -0.38    -0.78 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.97    -0.94    -0.35     0.94 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor dst(tensor_2)
    [tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.97    -0.79    -0.47     0.98 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.33     0.24     0.56    -0.80 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.16    -0.20     0.95    -0.08 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.48     0.09    -0.20     0.80 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
    [ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
    [ggml_backend_qnn_free, 3764]: graph type:ADD
    [qnn_finalize, 2318]: succeed to close rpcmem lib
    
    [ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free
    
QNN's RPC feature(which useful for QNN NPU(aka HTP/DSP) backend) was used in this PR and it works fine as expected.there are 2+GBytes ion memory could be used for offload ggml tensors in cgraph to NPU on Qualcomm Snapdragon 8 Gen 3 equipped Android phone.
This PR is a Minimum Viable PR style and functional PR in ggml community. it'll be great helpful for other community programmer/developer/AI expert to contribute codes/ideas to GGML QNN backend if this PR can be approved and merged to master branch. Together we might/should/could reach the final target: utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework. this is might be the exact GGML way in GGML community.

Todo

Qualcomm's QNN backend for GGML has some todo tasks to make this backend can be used in real commercial application:
[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.16     0.85    -0.80    -0.25 
   -0.28     0.66     0.98     0.67 
   -0.15     0.78    -0.45    -0.50 
    0.92     0.31    -0.72    -0.46 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.53     0.86    -0.91    -0.27 
    0.62     0.35    -0.27     0.43 
    0.73     0.42    -0.81    -0.24 
    0.49     0.81    -0.88     0.64 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.69     1.70    -1.70    -0.52 
    0.34     1.02     0.71     1.10 
    0.58     1.19    -1.26    -0.74 
    1.41     1.12    -1.60     0.18 

[ggml_backend_qnn_free, 3286]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3288]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 3300]: graph type:ADD
[qnn_finalize, 1258]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3313]: leave ggml_backend_qnn_free
[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend QNN-NPU: 532 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.96     0.64     0.75     0.27 
   -0.10     0.59    -0.70     0.20 
    0.78     0.98    -0.46     0.33 
   -0.01     0.72     0.78     0.79 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.87     0.89     0.76     0.94 
    0.22    -0.88    -0.63     0.80 
   -0.32     0.16     0.53     0.53 
   -0.78     0.13    -0.04    -0.34 

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6330]: error = 0

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6333]: output matrix:
[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -1.83     1.53     1.52     1.20 
    0.12    -0.29    -1.33     1.00 
    0.45     1.14     0.07     0.86 
   -0.80     0.85     0.75     0.45 

[test-qnn-npu.cpp, qnn_finalize, 4886]: succeed to close rpcmem lib

[info, 161]: duration of qnn_nputest_2_ADD : 233 milliseconds
[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6357]: leave qnn_rpc_test
[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 

[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend ggml: 3 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test

How to verify QNN backend or participate in development activity of GGML QNN backend

I provide a dedicated Android command line program and scripts in this PR for purpose of UT on Android device.


 cd tests/ggml-qnn/
./ggml-qnn-ut-build-run.sh  -h              (show usage)
./ggml-qnn-ut-build-run.sh  help            (show usage)
./ggml-qnn-ut-build-run.sh  build           (build Android command line UT program)
./ggml-qnn-ut-build-run.sh  updateqnnlibs   (upload the latest QNN libs to Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  0  (run UT program and verfiy QNN CPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  1  (run UT program and verfiy QNN GPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  2  (run UT program and verfiy QNN NPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  3  (compare performance between QNN backend and original ggml on Android phone)

A suitable/qualified reviewer should/might be familiar with source code of ggml and Qualcomm QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK or other Qualcomm's AI software stack; skillsets including real/hardcore AI tech is more better(add more quantize data type and implement more GGML OPs(or kernels) require the AI skillset) but is not an essential skillset in this PR. some notes for potential qualified reviewer:

  • Programming language detail is not the key-point in this PR also language detail is really important and I will handle it properly as much as possible(this PR follow the coding style in upstream llama.cpp strictly/as much as possible), pls do NOT spent too much time on these language details : such as code format, code align, variable name, function name, unused variable, unused function, compiler warning, C++ grammar/syntax in so-called modern C++11/14/17/20...).
  • PR should/could be submitted in upstream llama.cpp if the PR is for fix issues/bugs in upstream llama.cpp(this is the reason why familiar with source code of ggml is an essential prerequisite for a suitable reviewer).
  • Don't bring too much complex new features in this PR, a MVP(Minimum Viable PR) style PR might be accepted by the maintainers of ggml community.
  • Pls focus on the real keypoint in this PR:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Any GGML community programmer/developer/AI expert who interesting with the topic of GGML QNN backend can use/extend the dedicated Android command line program to verify GGML QNN backend, review are greatly welcomed and appreciated.

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 59e42f8 to b0c3013 Compare April 24, 2024 10:26
Copy link
Contributor

github-actions bot commented Apr 24, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8677.33ms p(95)=20035.75ms fails=, finish reason: stop=492 truncated=48
  • Prompt processing (pp): avg=95.63tk/s p(95)=443.17tk/s
  • Token generation (tg): avg=47.46tk/s p(95)=47.64tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=qualcomm_qnn_backend_for_ggml commit=a98a4e999000105b81b472c7b36ff80131d68ef1

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 593.29, 593.29, 593.29, 593.29, 593.29, 747.27, 747.27, 747.27, 747.27, 747.27, 756.49, 756.49, 756.49, 756.49, 756.49, 776.61, 776.61, 776.61, 776.61, 776.61, 836.27, 836.27, 836.27, 836.27, 836.27, 841.05, 841.05, 841.05, 841.05, 841.05, 838.87, 838.87, 838.87, 838.87, 838.87, 859.25, 859.25, 859.25, 859.25, 859.25, 865.08, 865.08, 865.08, 865.08, 865.08, 858.89, 858.89, 858.89, 858.89, 858.89, 883.85, 883.85, 883.85, 883.85, 883.85, 891.89, 891.89, 891.89, 891.89, 891.89, 873.3, 873.3, 873.3, 873.3, 873.3, 893.13, 893.13, 893.13, 893.13, 893.13, 909.48, 909.48, 909.48, 909.48, 909.48, 910.98, 910.98, 910.98, 910.98, 910.98, 911.31, 911.31, 911.31, 911.31, 911.31, 910.69, 910.69, 910.69, 910.69, 910.69, 914.6, 914.6, 914.6, 914.6, 914.6, 928.83, 928.83, 928.83, 928.83, 928.83, 927.37, 927.37, 927.37, 927.37, 927.37, 921.49, 921.49, 921.49, 921.49, 921.49, 925.25, 925.25, 925.25, 925.25, 925.25, 928.15, 928.15, 928.15, 928.15, 928.15, 942.74, 942.74, 942.74, 942.74, 942.74, 924.43, 924.43, 924.43, 924.43, 924.43, 923.95, 923.95, 923.95, 923.95, 923.95, 915.03, 915.03, 915.03, 915.03, 915.03, 911.66, 911.66, 911.66, 911.66, 911.66, 909.5, 909.5, 909.5, 909.5, 909.5, 914.04, 914.04, 914.04, 914.04, 914.04, 911.98, 911.98, 911.98, 911.98, 911.98, 910.75, 910.75, 910.75, 910.75, 910.75, 916.72, 916.72, 916.72, 916.72, 916.72, 926.62, 926.62, 926.62, 926.62, 926.62, 924.55, 924.55, 924.55, 924.55, 924.55, 927.08, 927.08, 927.08, 927.08, 927.08, 921.68, 921.68, 921.68, 921.68, 921.68, 920.82, 920.82, 920.82, 920.82, 920.82, 921.7, 921.7, 921.7, 921.7, 921.7, 922.98, 922.98, 922.98, 922.98, 922.98, 930.8, 930.8, 930.8, 930.8, 930.8, 921.59, 921.59, 921.59, 921.59, 921.59, 897.51, 897.51, 897.51, 897.51, 897.51, 894.98, 894.98, 894.98, 894.98, 894.98, 893.03, 893.03, 893.03, 893.03, 893.03, 895.37, 895.37, 895.37, 895.37, 895.37, 897.77, 897.77, 897.77, 897.77, 897.77, 896.81, 896.81, 896.81, 896.81, 896.81, 899.61, 899.61, 899.61, 899.61, 899.61, 898.83, 898.83, 898.83, 898.83, 898.83, 901.17, 901.17, 901.17, 901.17, 901.17, 890.73, 890.73, 890.73, 890.73, 890.73, 888.87, 888.87, 888.87, 888.87, 888.87, 889.05, 889.05, 889.05, 889.05, 889.05, 889.17, 889.17, 889.17, 889.17, 889.17, 888.29, 888.29, 888.29, 888.29, 888.29, 887.41, 887.41, 887.41, 887.41, 887.41, 888.05, 888.05, 888.05, 888.05, 888.05, 888.97, 888.97, 888.97, 888.97, 888.97, 889.62, 889.62, 889.62]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.93, 41.93, 41.93, 41.93, 41.93, 35.06, 35.06, 35.06, 35.06, 35.06, 27.87, 27.87, 27.87, 27.87, 27.87, 30.27, 30.27, 30.27, 30.27, 30.27, 31.24, 31.24, 31.24, 31.24, 31.24, 31.49, 31.49, 31.49, 31.49, 31.49, 32.61, 32.61, 32.61, 32.61, 32.61, 33.52, 33.52, 33.52, 33.52, 33.52, 33.92, 33.92, 33.92, 33.92, 33.92, 34.15, 34.15, 34.15, 34.15, 34.15, 34.26, 34.26, 34.26, 34.26, 34.26, 33.88, 33.88, 33.88, 33.88, 33.88, 33.24, 33.24, 33.24, 33.24, 33.24, 33.26, 33.26, 33.26, 33.26, 33.26, 31.54, 31.54, 31.54, 31.54, 31.54, 31.03, 31.03, 31.03, 31.03, 31.03, 29.95, 29.95, 29.95, 29.95, 29.95, 29.72, 29.72, 29.72, 29.72, 29.72, 29.96, 29.96, 29.96, 29.96, 29.96, 29.84, 29.84, 29.84, 29.84, 29.84, 29.65, 29.65, 29.65, 29.65, 29.65, 29.74, 29.74, 29.74, 29.74, 29.74, 29.88, 29.88, 29.88, 29.88, 29.88, 30.09, 30.09, 30.09, 30.09, 30.09, 30.2, 30.2, 30.2, 30.2, 30.2, 30.22, 30.22, 30.22, 30.22, 30.22, 30.51, 30.51, 30.51, 30.51, 30.51, 30.47, 30.47, 30.47, 30.47, 30.47, 30.42, 30.42, 30.42, 30.42, 30.42, 30.68, 30.68, 30.68, 30.68, 30.68, 30.77, 30.77, 30.77, 30.77, 30.77, 30.87, 30.87, 30.87, 30.87, 30.87, 31.02, 31.02, 31.02, 31.02, 31.02, 31.2, 31.2, 31.2, 31.2, 31.2, 31.05, 31.05, 31.05, 31.05, 31.05, 31.03, 31.03, 31.03, 31.03, 31.03, 30.8, 30.8, 30.8, 30.8, 30.8, 30.35, 30.35, 30.35, 30.35, 30.35, 30.3, 30.3, 30.3, 30.3, 30.3, 30.55, 30.55, 30.55, 30.55, 30.55, 30.68, 30.68, 30.68, 30.68, 30.68, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 30.24, 30.24, 30.24, 30.24, 30.24, 29.97, 29.97, 29.97, 29.97, 29.97, 29.37, 29.37, 29.37, 29.37, 29.37, 29.03, 29.03, 29.03, 29.03, 29.03, 29.04, 29.04, 29.04, 29.04, 29.04, 29.1, 29.1, 29.1, 29.1, 29.1, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.27, 29.27, 29.27, 29.27, 29.27, 29.28, 29.28, 29.28, 29.28, 29.28, 29.11, 29.11, 29.11, 29.11, 29.11, 29.13, 29.13, 29.13, 29.13, 29.13, 29.15, 29.15, 29.15, 29.15, 29.15, 29.24, 29.24, 29.24, 29.24, 29.24, 29.36, 29.36, 29.36, 29.36, 29.36, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.39, 0.39, 0.39, 0.39, 0.39, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.4, 0.4, 0.4, 0.4, 0.4, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.32, 0.32, 0.32, 0.32, 0.32, 0.51, 0.51, 0.51, 0.51, 0.51, 0.61, 0.61, 0.61, 0.61, 0.61, 0.48, 0.48, 0.48, 0.48, 0.48, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.08, 0.08, 0.08, 0.08, 0.08, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]
                    
Loading

@Dampfinchen
Copy link

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 24, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 198 iterations 🚀

Expand details for performance related PR only

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 5 times, most recently from 5abb2e4 to 7a420e1 Compare April 25, 2024 08:11
@zhouwg zhouwg changed the title ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend Apr 25, 2024
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 95a980a to b0c3013 Compare April 25, 2024 09:03
@ggerganov
Copy link
Owner

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 25, 2024

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend.

@slaren
Copy link
Collaborator

slaren commented Apr 25, 2024

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

GGML_CALL static void ggml_backend_registry_init(void) {

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 25, 2024

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

GGML_CALL static void ggml_backend_registry_init(void) {

thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 25, 2024

@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3).

Could you take a moment to look at it? thanks.

BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before.

BTW, should the README-qnn.md be removed?

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from eff9669 to 180ab5f Compare April 25, 2024 15:47
tests/test-backend-ops.cpp Outdated Show resolved Hide resolved
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from 992cf05 to 67beeb6 Compare April 26, 2024 02:12
Copy link
Contributor Author

@zhouwg zhouwg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this review comment is very useful and I had been modified codes accordingly.
thanks too much.

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch from 8240376 to f20e281 Compare April 26, 2024 03:19
qnn_instance * instance = nullptr;
std::string graph_name = "ggml_op_qnn_add";
Qnn_GraphHandle_t graph_handle = nullptr;
Qnn_Tensor_t * tensor_0 = nullptr;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR on your fork, to simpilify the binding from Qnn_Tensor_t to ggml_tensor, please have look if have time: zhouwg#2

* mul_mat_f16_f32: src0 is F16 and src1 is F32.
* mul_mat_q_f32: src0 is quantized (Q4_0, Q4_1, ...), and src1 is F32.
*/
static void ggml_qnn_mul_mat(ggml_backend_qnn_context * ctx,
Copy link

@chraac chraac Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also find a maybe bug on this branch when trying to do mulmat with gpu backend on my 8 Gen2 phone, commandline:
ggml-qnn-ut -t GGML_OP_MUL_MAT -b 1

image
As you can see it generate a wrong dst matrix.

When running with cpu backend, the result is correct:
image

Copy link

@chraac chraac Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks the graphExecute failed with error 6004. maybe we can use it to find the root cause here

Copy link

@chraac chraac Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reproduce, you could use my patch to constant initialize the test tensor:

llama.cpp-5e18cdc-init the test array with const values.patch

just change the tensor init in the unit test so that we can reproduce it more easily

@myan-o
Copy link

myan-o commented Jun 18, 2024

problem 1

i'm tred build in termux.
Can't you change the path of /data/local/tmp?
The Skel.so path cannot be changed in NPU and loading fails.

problem 2

qnnsdk cannot be obtained without an account.
In other words, it cannot be built using termux alone.

Comment on lines +3215 to +3219
GGML_CALL static bool ggml_backend_qnn_offload_op(ggml_backend_t backend,const ggml_tensor * tensor) {
ggml_backend_qnn_context * ctx = (ggml_backend_qnn_context *) backend->context;

return ggml_qnn_compute_forward(ctx, nullptr, (ggml_tensor *) tensor);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only needs to return true or false, but it must not execute the operation. The purpose of this function is to determine if an operation should be executed in this backend, even if it would require copying weights to the backend memory. As it is, this will either prevent the backend from working entirely, or it will cause many operations to be run twice.

Copy link

@chraac chraac Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually I've tried to make some improvement regarding those comments on my branch, also created a PR in this fork and ask for review several days before, but looks there's no responding on the original author so far.
Will spend some time on my fork next few weeks, gonna add more operators then.

@ao-zz
Copy link

ao-zz commented Jun 28, 2024

Sad to see such a great PR being blocked by endless complaints.

@slaren @chraac Can't we just focus on the correctness of this groundbreaking PR? If this PR could make a correct result, then there is NO reason to block it. All other problems should be discussed in new PR or issues.

I hold this is the very first step we need to take. Then users around the world could have a chance to engage it and improve for the efficiency and other things you guys worried.

@chraac
Copy link

chraac commented Jun 28, 2024

@slaren @chraac Can't we just focus on the correctness of this groundbreaking PR? If this PR could make a correct result, then there is NO reason to block it. All other problems should be discussed in new PR or issues.

Hi @ao-zz, thank you for your comment, for the PR not getting approval, as i said before:

MVP (Minimum Viable Pull Request) is good. however, thought we first need to establish a consensus with the community on the criteria that define a PR as viable.

And as you can see in my pervious post, there're some work we should do before merge, from my point of view:

  1. Add more tensor ops to support to load a model (currently only have add and matmul implemented).
  2. Some bug on gpu backend should be resolved (can check out my comment above).

Also have created a small refactoring PR on this fork, you can have a look.

@hans00
Copy link

hans00 commented Jul 2, 2024

qnnsdk cannot be obtained without an account.

currently is public for everyone.
https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.22.6.240515.zip?query=aiesdk (copy from official page

@Ther-nullptr
Copy link

This work has finished the SnapDragon CPU/GPU/NPU overhead:
https://arxiv.org/abs/2406.06282

@ao-zz
Copy link

ao-zz commented Jul 8, 2024

This work has finished the SnapDragon CPU/GPU/NPU overhead: https://arxiv.org/abs/2406.06282

code not yet released, as you asked last week

@chraac
Copy link

chraac commented Jul 15, 2024

Hi, it looks like this PR has been inactive for a while. I've made some changes on my local fork based on this PR, including:

  • Splited ggml-qnn.cpp into several separate files for better modularity.
  • Supported more operations, include: GGML_OP_SUB, GGML_OP_MUL, GGML_OP_DIV, GGML_OP_SQRT and GGML_OP_LOG.
  • Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.
  • Fixed the GPU backend error mentioned above (6869#discussion_r1642152991)
    image

For anyone interested in this PR, please take a look at my fork. Comments and feedback are appreciated!
Also, I will spend more time continuing iteration on my fork, aiming to support more operators and run on more platforms other than Android.

@slaren
Copy link
Collaborator

slaren commented Jul 15, 2024

  • Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.

FYI this is exactly what test-backend-ops does. To make it work with the QNN backend, you would have to modify ggml_backend_registry_init in ggml-backend.c to register the backend, and the supports_op function of the backend must be accurate. I am not really sure why this code is being duplicated, in fact, the QNN test has code copied from test-backend-ops.

@myan-o
Copy link

myan-o commented Jul 16, 2024

please support in termux.

@chraac
Copy link

chraac commented Jul 16, 2024

  • Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.

FYI this is exactly what test-backend-ops does. To make it work with the QNN backend, you would have to modify ggml_backend_registry_init in ggml-backend.c to register the backend, and the supports_op function of the backend must be accurate. I am not really sure why this code is being duplicated, in fact, the QNN test has code copied from test-backend-ops.

After a brief review of the test-backend-ops, thought remove the QNN unit tests and reuse the test-backend-ops instead. I'll work on implementing this change in the next iteration.

@chraac
Copy link

chraac commented Jul 16, 2024

problem 1

i'm tred build in termux. Can't you change the path of /data/local/tmp? The Skel.so path cannot be changed in NPU and loading fails.

problem 2

qnnsdk cannot be obtained without an account. In other words, it cannot be built using termux alone.

Hi @myan-o ,

For problem 1, I propose implementing a CMake parameter that allows users to customize the default path for dependent libraries.

For 2, maybe you can refer to this comment:

qnnsdk cannot be obtained without an account.

currently is public for everyone. https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.22.6.240515.zip?query=aiesdk (copy from official page

@yeonseok-zeticai
Copy link

Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later.

@chraac
Copy link

chraac commented Jul 16, 2024

Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later.

Hi @yeonseok-zeticai, this branch has been inactive for some time. In recent weeks, I've undertaken some refactoring on my own branch. If you're interested, please have a look. My branch is also based on this PR.:
https://github.com/chraac/llama.cpp/tree/dev-refactoring

@yeonseok-zeticai
Copy link

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

@chraac
Copy link

chraac commented Jul 16, 2024

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.

My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

@zhouwg
Copy link
Contributor Author

zhouwg commented Jul 18, 2024

Great work @zhouwg! I worked for Qualcomm until early this year and I am quite used to using Qualcomm AI SDK, and I want to help you to get these things done. I think I can help you implement the to-do items that you have for the QNN backend. Let me catch up on your work in a few weeks about this PR and update more later.

Thanks so much and thanks to a real QNN expert is coming.

  1. We can see that ggml-rpc is came from rgerganov, ggml-sycl is came from regular employee of Intel, ggml-cann is came from regular employee of Huawei.

  2. As I said in the very beginning of this PoC(PoC: Add Qualcomm mobile SoC native backend for GGML zhouwg/kantv#121), this work(add Qualcomm QNN backend for ggml) should/could/might be initiated/done by permanent/regular employee from Qualcomm but not be initiated/done by a standalone/independent programmer.

  3. I'm a streaming media expert and good at Linux/Android system software development but an AI beginner/know nothing about real/hard-core AI tech.

This PR can not be accepted by the maintainer/author of ggml backend subsystem although I begged for PR approval again and again before 06/15/2024. I understand this decision according to above reasons. but, I'm not sure whether there is double-standard in PR approval consideration although I has sincerely thanks for the help from the maintainer/author of ggml backend subsystem and has fully/100% positive opinion for this great/compact/awesome/excellent/high-performance device-side AI inference framework:

Screenshot from 2024-07-18 16-11-43
Screenshot from 2024-07-18 15-38-45

Screenshot from 2024-07-18 15-54-08

#7844:
Screenshot from 2024-07-18 16-18-40

One more thing, I feel a little disappointment on 06/15/2024 that the maintainers of this great opensource AI project can't understand what GFW brings to the programmers/developers in mainland China and has some misunderstandings about what I think about it although I really/100% love my country:
Screenshot from 2024-07-18 16-26-52
Screenshot from 2024-07-18 16-26-38

@zhouwg
Copy link
Contributor Author

zhouwg commented Jul 18, 2024

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.

My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

Thanks for your help and continued efforts of this PR(BTW, I had read source codes in your personal/forked llama.cpp although I think put everything relative things in one single source file might be more better idea).

Your PR in my personal/forked llama.cpp is not make sense:
Screenshot from 2024-07-18 16-49-24

That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding.

@zhouwg
Copy link
Contributor Author

zhouwg commented Jul 18, 2024

  • Running the pure CPU backend in unit tests to cross-validate the results returned by the QNN backend.

FYI this is exactly what test-backend-ops does. To make it work with the QNN backend, you would have to modify ggml_backend_registry_init in ggml-backend.c to register the backend, and the supports_op function of the backend must be accurate. I am not really sure why this code is being duplicated, in fact, the QNN test has code copied from test-backend-ops.

Your test-backend-ops.cpp is good and highly-designed but not good/robust/easy-understanding enough and there is unknown issue for the ggml-qnn.cpp. that's the reason why I provide a standalone/easy-understanding UT(some codes borrows from your test-backend-ops.cpp) for ggml-qnn.cpp.thanks for your understanding.

@chraac
Copy link

chraac commented Jul 18, 2024

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.
My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

Thanks for your help and continued efforts of this PR(I had read source codes in your personal/forked llama.cpp).

Your PR in my personal/forked llama.cpp is not make sense: Screenshot from 2024-07-18 16-49-24

That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding.

As i said before,

  1. 4000+ lines of code in single file is unreviewable and unmaintainable, so I've split them into separated files.
  2. see so many duplicated source code in your original branch, the object bounding of qnn tensor and ggml tensor is vague, that may leads to more error in feature development, that's why i split them into files and also added some object to better manage them (My PR).
  3. ran your test show no error on result as i mention above, so I thought its able to replace your old implementation
  4. what do you mean by 'doesn't make sense', if there're some misunderstanding, you can leave comment on my PR, then we can have further discussion.

also you can have a look on my new refactoring branch, next will utilize the existing test-backend-ops and remove your unit test.

@zhouwg
Copy link
Contributor Author

zhouwg commented Jul 18, 2024

@chraac, is there any special reason for being inactive?. I can see the works you've made for 2 weeks on your branch. I'll catch up on your work as well.

Sorry for misleading, when I mentioned inactive branch, I was referring to the branch of this PR: zhouwg:qualcomm_qnn_backend_for_ggml. I attempted to merge my changes into that branch, but received no response. As a result, I've kept my modifications in my own branch.
My branch: https://github.com/chraac/llama.cpp/tree/dev-refactoring

Thanks for your help and continued efforts of this PR(I had read source codes in your personal/forked llama.cpp).
Your PR in my personal/forked llama.cpp is not make sense: Screenshot from 2024-07-18 16-49-24
That's the reason why your PR in my personal/forked llama.cpp was not merged to my personal/forked llama.cpp.thanks for your understanding.

As i said before,

  1. 4000+ lines of code in single file is unreviewable and unmaintainable, so I've split them into separated files.

I do not want to argue this opinion with you again and pls see my opinion in this PR although I feel a little surprise of/thanks for your continued efforts(which I personally think it's exactly same to this PR but with more/advanced C++ language grammars) of this PR.

  1. see so many duplicated source code in your original branch, the object bounding of qnn tensor and ggml tensor is vague, that may leads to more error in feature development, that's why i split them into files and also added some object to better manage them (My PR).
  2. ran your test show no error on result as i mention above, so I thought its able to replace your old implementation
  3. what do you mean by 'doesn't make sense', if there're some misunderstanding, you can leave comment on my PR, then we can have further discussion.

I'm sorry for that because I feel great disappointment and has no positive attention for this PR after 06/15/2024.

also you can have a look on my new refactoring branch, next will utilize the existing test-backend-ops and remove your unit test.

@chraac
Copy link

chraac commented Jul 18, 2024

I'm sorry for that because I feel great disappointment and has no positive attention for this PR after 06/15/2024.

No worries, your effort on adding the QNN backend won't be wasted. You've done excellent work. I'll continue iterating on my branch, and as this backend garners more attention, we're hopeful it can be integrated into the upstream project in the future.

@ggerganov
Copy link
Owner

 I'm not sure whether there is double-standard in PR approval consideration

@zhouwg Such comments are completely inappropriate. As I already mentioned in #6210 (comment), this will not be tolerated. Therefore I’ve decided to block you from the projects.

@ggerganov ggerganov closed this Jul 19, 2024
Repository owner locked and limited conversation to collaborators Jul 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
devops improvements to build systems and github actions enhancement New feature or request ggml changes relating to the ggml tensor library for machine learning Qualcomm QNN Qualcomm's QNN(AI Direct Engine) SDK Review Complexity : High Generally require indepth knowledge of LLMs or GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.