Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Buffer offset is not aligned on macOS / Intel / Vulkan #10984

Open
soerenkampschroer opened this issue Dec 26, 2024 · 12 comments
Open

Comments

@soerenkampschroer
Copy link

soerenkampschroer commented Dec 26, 2024

Name and Version

❯ ./llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4391 (9ba399d)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.1.0

Operating systems

macOS

Which llama.cpp modules do you know to be affected?

No response

Problem description & steps to reproduce

When using Vulkan/MoltenVK GPU acceleration on an Intel Mac, the model output will be garbled starting after around 400 words. Using export VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation there is an error that is repeating over and over until the prompt is done:

VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (31457794) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER. The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)

If you instead use export VK_LAYER_DISABLES=shader_validation,thread_safety, the prompt will still be garbled after 400 words, but will return to normal after 15-20 words and finish the prompt normally.

I compiled llama.cpp like this:

  • Install VulkanSDK from their website
cmake -B build -DLLAMA_CURL=1 -DGGML_METAL=OFF -DGGML_VULKAN=1 -DVulkan_INCLUDE_DIR=~/VulkanSDK/1.3.296.0/macOS/include -DVulkan_LIBRARY=/usr/local/lib/libvulkan.1.3.296.dylib
cmake --build build --config Release

First Bad Commit

No response

Relevant log output

Full log export VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation
  ❯ ./llama-cli -m ~/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf -p "Write a story about a bear with 600 words." --n-gpu-layers 200 --ctx-size 512 --batch_size 8
  ggml_vulkan: Found 1 Vulkan devices:
  ggml_vulkan: 0 = AMD Radeon RX 6800 (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
  build: 4391 (9ba399df) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.1.0
  main: llama backend init
  main: load the model and apply lora adapter, if any
  llama_load_model_from_file: using device Vulkan0 (AMD Radeon RX 6800) - 16368 MiB free
  llama_model_loader: loaded meta data with 30 key-value pairs and 464 tensors from /Users/soeren/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf (version GGUF V3 (latest))
  llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  llama_model_loader: - kv   0:                       general.architecture str              = gemma2
  llama_model_loader: - kv   1:                               general.type str              = model
  llama_model_loader: - kv   2:                               general.name str              = gemma-2-9b-it
  llama_model_loader: - kv   3:                      gemma2.context_length u32              = 8192
  llama_model_loader: - kv   4:                    gemma2.embedding_length u32              = 3584
  llama_model_loader: - kv   5:                         gemma2.block_count u32              = 42
  llama_model_loader: - kv   6:                 gemma2.feed_forward_length u32              = 14336
  llama_model_loader: - kv   7:                gemma2.attention.head_count u32              = 16
  llama_model_loader: - kv   8:             gemma2.attention.head_count_kv u32              = 8
  llama_model_loader: - kv   9:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
  llama_model_loader: - kv  10:                gemma2.attention.key_length u32              = 256
  llama_model_loader: - kv  11:              gemma2.attention.value_length u32              = 256
  llama_model_loader: - kv  12:                          general.file_type u32              = 17
  llama_model_loader: - kv  13:              gemma2.attn_logit_softcapping f32              = 50.000000
  llama_model_loader: - kv  14:             gemma2.final_logit_softcapping f32              = 30.000000
  llama_model_loader: - kv  15:            gemma2.attention.sliding_window u32              = 4096
  llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
  llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
  llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
  llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
  llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
  llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 2
  llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1
  llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
  llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
  llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
  llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
  llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
  llama_model_loader: - kv  28:            tokenizer.ggml.add_space_prefix bool             = false
  llama_model_loader: - kv  29:               general.quantization_version u32              = 2
  llama_model_loader: - type  f32:  169 tensors
  llama_model_loader: - type q5_K:  252 tensors
  llama_model_loader: - type q6_K:   43 tensors
  llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
  llm_load_vocab: special tokens cache size = 217
  llm_load_vocab: token to piece cache size = 1.6014 MB
  llm_load_print_meta: format           = GGUF V3 (latest)
  llm_load_print_meta: arch             = gemma2
  llm_load_print_meta: vocab type       = SPM
  llm_load_print_meta: n_vocab          = 256000
  llm_load_print_meta: n_merges         = 0
  llm_load_print_meta: vocab_only       = 0
  llm_load_print_meta: n_ctx_train      = 8192
  llm_load_print_meta: n_embd           = 3584
  llm_load_print_meta: n_layer          = 42
  llm_load_print_meta: n_head           = 16
  llm_load_print_meta: n_head_kv        = 8
  llm_load_print_meta: n_rot            = 256
  llm_load_print_meta: n_swa            = 4096
  llm_load_print_meta: n_embd_head_k    = 256
  llm_load_print_meta: n_embd_head_v    = 256
  llm_load_print_meta: n_gqa            = 2
  llm_load_print_meta: n_embd_k_gqa     = 2048
  llm_load_print_meta: n_embd_v_gqa     = 2048
  llm_load_print_meta: f_norm_eps       = 0.0e+00
  llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
  llm_load_print_meta: f_clamp_kqv      = 0.0e+00
  llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  llm_load_print_meta: f_logit_scale    = 0.0e+00
  llm_load_print_meta: n_ff             = 14336
  llm_load_print_meta: n_expert         = 0
  llm_load_print_meta: n_expert_used    = 0
  llm_load_print_meta: causal attn      = 1
  llm_load_print_meta: pooling type     = 0
  llm_load_print_meta: rope type        = 2
  llm_load_print_meta: rope scaling     = linear
  llm_load_print_meta: freq_base_train  = 10000.0
  llm_load_print_meta: freq_scale_train = 1
  llm_load_print_meta: n_ctx_orig_yarn  = 8192
  llm_load_print_meta: rope_finetuned   = unknown
  llm_load_print_meta: ssm_d_conv       = 0
  llm_load_print_meta: ssm_d_inner      = 0
  llm_load_print_meta: ssm_d_state      = 0
  llm_load_print_meta: ssm_dt_rank      = 0
  llm_load_print_meta: ssm_dt_b_c_rms   = 0
  llm_load_print_meta: model type       = 9B
  llm_load_print_meta: model ftype      = Q5_K - Medium
  llm_load_print_meta: model params     = 9.24 B
  llm_load_print_meta: model size       = 6.19 GiB (5.75 BPW)
  llm_load_print_meta: general.name     = gemma-2-9b-it
  llm_load_print_meta: BOS token        = 2 '<bos>'
  llm_load_print_meta: EOS token        = 1 '<eos>'
  llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
  llm_load_print_meta: UNK token        = 3 '<unk>'
  llm_load_print_meta: PAD token        = 0 '<pad>'
  llm_load_print_meta: LF token         = 227 '<0x0A>'
  llm_load_print_meta: EOG token        = 1 '<eos>'
  llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
  llm_load_print_meta: max token length = 48
  VUID-VkDeviceCreateInfo-pNext-pNext(ERROR / SPEC): msgNum: -1876993556 - Validation Error: [ VUID-VkDeviceCreateInfo-pNext-pNext ] Object 0: handle = 0x7f7d8500a400, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x901f59ec | vkCreateDevice(): pCreateInfo->pNext<VkPhysicalDeviceSubgroupSizeControlFeatures> includes a pointer to a VkPhysicalDeviceSubgroupSizeControlFeatures, but when creating VkDevice, the parent extension (VK_EXT_subgroup_size_control) was not included in ppEnabledExtensionNames.
  The Vulkan spec states: Each pNext member of any structure (including this one) in the pNext chain must be either NULL or a pointer to a valid struct for extending VkDeviceCreateInfo (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pNext-pNext)
      Objects: 1
          [0] 0x7f7d8500a400, type: 1, name: NULL
  VUID-VkDeviceCreateInfo-pProperties-04451(ERROR / SPEC): msgNum: 976972960 - Validation Error: [ VUID-VkDeviceCreateInfo-pProperties-04451 ] Object 0: handle = 0x600001c1a460, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; | MessageID = 0x3a3b6ca0 | vkCreateDevice():  VK_KHR_portability_subset must be enabled because physical device VkPhysicalDevice 0x600001c1a460[] supports it.
  The Vulkan spec states: If the VK_KHR_portability_subset extension is included in pProperties of vkEnumerateDeviceExtensionProperties, ppEnabledExtensionNames must include "VK_KHR_portability_subset" (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pProperties-04451)
      Objects: 1
          [0] 0x600001c1a460, type: 2, name: NULL
  ggml_vulkan: Compiling shaders..........................Done!
  llm_load_tensors: offloading 42 repeating layers to GPU
  llm_load_tensors: offloading output layer to GPU
  llm_load_tensors: offloaded 43/43 layers to GPU
  llm_load_tensors:   CPU_Mapped model buffer size =   717.77 MiB
  llm_load_tensors:      Vulkan0 model buffer size =  6333.65 MiB
  ..................................................................................
  llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
  llama_new_context_with_model: n_seq_max     = 1
  llama_new_context_with_model: n_ctx         = 512
  llama_new_context_with_model: n_ctx_per_seq = 512
  llama_new_context_with_model: n_batch       = 32
  llama_new_context_with_model: n_ubatch      = 32
  llama_new_context_with_model: flash_attn    = 0
  llama_new_context_with_model: freq_base     = 10000.0
  llama_new_context_with_model: freq_scale    = 1
  llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
  llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 42
  llama_kv_cache_init:    Vulkan0 KV buffer size =   168.00 MiB
  llama_new_context_with_model: KV self size  =  168.00 MiB, K (f16):   84.00 MiB, V (f16):   84.00 MiB
  llama_new_context_with_model: Vulkan_Host  output buffer size =     0.98 MiB
  llama_new_context_with_model:    Vulkan0 compute buffer size =    31.69 MiB
  llama_new_context_with_model: Vulkan_Host compute buffer size =     0.56 MiB
  llama_new_context_with_model: graph nodes  = 1690
  llama_new_context_with_model: graph splits = 2
  common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
  common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
  main: llama threadpool init, n_threads = 6

  system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

  sampler seed: 3149956165
  sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
  sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
  generate: n_ctx = 512, n_batch = 8, n_predict = -1, n_keep = 1

  Write a story about a bear with 600 words.

  The first thing Barnaby the bear noticed was the silence. Usually, the forest floor hummed with the buzz of insects, the rustle of small creatures, the distant calls of birds. But today, an unsettling stillness hung in the air. Even the wind seemed to hold its breath.

  Barnaby, a young brown bear with a curious nose and a penchant for trouble, sniffed the air cautiously. A faint, metallic tang lingered, something he'd never encountered before. It didn't smell like danger, not exactly, but it was unsettling nonetheless.

  He lumbered towards the source of the strange scent, his massive paws leaving deep imprints in the soft earth. He pushed through a thicket of ferns, his heart pounding with a mixture of excitement and apprehension.

  And then he saw it.

  A giant, metallic beast lay sprawled in the clearing, its sleek silver body gleaming in the dappled sunlight. It was unlike anything Barnaby had ever seen. Its surface was covered in strange, smooth lines and knobs, and a plume of smoke rose lazily from a hole near its head.

  Barnaby approached cautiously, his instincts screaming at him to flee. But his curiosity, as always, won out. He sniffed at the beast's side, its metallic scent overwhelming his senses. He touched it with a cautious paw, and a jolt of electricity shot through him, causing him to yelp and stumble back.

  The beast remained silent, its metallic body unmoving.

  Barnaby circled it cautiously, his eyes darting from one strange feature to another. He noticed a large, glass-like window on its side, through which he could see rows of intricate, blinking lights. He poked at it with a clawed paw, and the lights flickered in response.

  Suddenly, the beast shuddered, and a low rumble emanated from within. The glass window darkened, and Barnaby felt a surge of fear. He backed away, his heart pounding in his chest.

  The rumbling grew louder, and the beast began to shake violently. Then, with a deafening roar, the window exploded outwards, showering Barnaby with shards of glass.

  Barnaby scrambled to his feet, his fur bristling. He stared at the beast in terror as it lurched forward, its massive metal-free zone friends,Entwicklung-slash脚注の使い方脚注の使い方脚注の使い方脚注の使い方ential Fonten pri JardDataAnnotations. GenerationTypeally,uvwxyzuvwxyzuvwxyzuvwxyzuvwxyz, crescere point f, kterVUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (2097666) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (6291970) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (10486274) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (14680578) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (18874882) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (23069186) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (27263490) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (31457794) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (35652098) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (39846402) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (44040706) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (48235010) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (52429314) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (56623618) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (60817922) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (65012226) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (69206530) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (73400834) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (77595138) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (81789442) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (85983746) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (90178050) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (94372354) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (98566658) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (102760962) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (106955266) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (111149570) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (115343874) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (119538178) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (123732482) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (127926786) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (132121090) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (136315394) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (140509698) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (144704002) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (148898306) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (153092610) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (157286914) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (161481218) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (165675522) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (169869826) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  VUID-VkWriteDescriptorSet-descriptorType-00328(ERROR / SPEC): msgNum: -368569266 - Validation Error: [ VUID-VkWriteDescriptorSet-descriptorType-00328 ] | MessageID = 0xea08144e | vkUpdateDescriptorSets(): pDescriptorWrites[0].pBufferInfo[0].offset (174064130) must be a multiple of device limit minStorageBufferOffsetAlignment 16 when descriptor type is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER.
  The Vulkan spec states: If descriptorType is VK_DESCRIPTOR_TYPE_STORAGE_BUFFER or VK_DESCRIPTOR_TYPE_STORAGE_BUFFER_DYNAMIC, the offset member of each element of pBufferInfo must be a multiple of VkPhysicalDeviceLimits::minStorageBufferOffsetAlignment (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkWriteDescriptorSet-descriptorType-00328)
  .

  A strange, mechanical voice echoed from within the beast. "Greetings," it said. "I am Unit 7. I require assistance."

  Barnaby stared at the beast, his mind reeling. A talking machine? What had he gotten himself into?


  [end of text]


  llama_perf_sampler_print:    sampling time =     205.67 ms /   568 runs   (    0.36 ms per token,  2761.72 tokens per second)
  llama_perf_context_print:        load time =    4740.50 ms
  llama_perf_context_print: prompt eval time =     863.95 ms /    14 tokens (   61.71 ms per token,    16.20 tokens per second)
  llama_perf_context_print:        eval time =   14329.44 ms /   553 runs   (   25.91 ms per token,    38.59 tokens per second)
  llama_perf_context_print:       total time =   15658.13 ms /   567 tokens
Full log export VK_LAYER_DISABLES=shader_validation,thread_safety
❯ ./llama-cli -m ~/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf -p "Write a story about a bear with 600 words." --n-gpu-layers 200 --ctx-size 512 --batch_size 8
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 4391 (9ba399df) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (AMD Radeon RX 6800) - 16368 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 464 tensors from /Users/soeren/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   3:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   4:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   6:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   8:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  11:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  14:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  15:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  28:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q5_K:  252 tensors
llama_model_loader: - type q6_K:   43 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 217
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 6.19 GiB (5.75 BPW)
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOG token        = 1 '<eos>'
llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Compiling shaders..........................Done!
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   717.77 MiB
llm_load_tensors:      Vulkan0 model buffer size =  6333.65 MiB
..................................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 32
llama_new_context_with_model: n_ubatch      = 32
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 42
llama_kv_cache_init:    Vulkan0 KV buffer size =   168.00 MiB
llama_new_context_with_model: KV self size  =  168.00 MiB, K (f16):   84.00 MiB, V (f16):   84.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    31.69 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     0.56 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

sampler seed: 3231945240
sampler params:
  repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
  dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
  top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
  mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 512, n_batch = 8, n_predict = -1, n_keep = 1

Write a story about a bear with 600 words.

Barnaby wasn’t like the other bears. While his brothers and sisters frolicked in the sun-dappled forest, fishing for salmon and tumbling over each other in playful mock-fights, Barnaby preferred the quiet company of the ancient oaks.

He’d sit for hours beneath their sprawling branches, listening to the wind whispering through their leaves, the rustle of small creatures in the undergrowth, the distant murmur of a brook. He loved the feel of the rough bark against his fur, the scent of damp earth and decaying leaves. The forest held a magic for him, a sense of ancient wisdom that spoke to his soul.

The other bears thought Barnaby was odd. “Why do you spend all your time with those old trees?” his sister, Brenda, would ask, her voice dripping with amusement. “There’s nothing interesting about them. Come play with us!”

Barnaby would shake his head, his dark eyes reflecting the dappled sunlight filtering through the leaves. “I find peace here, Brenda. The trees have stories to tell, if you know how to listen.”

Brenda would scoff and trot off to join the others, leaving Barnaby to his solitude. But he didn’t mind. He felt a connection to the trees that ran deeper than the bonds of family. He felt their strength, their resilience, their quiet wisdom.

One day, a strange scent drifted through the forest, sharp and acrid. Barnaby sensed danger. The wind carried whispers of fear from the animals, their usual chatter replaced with hushed warnings.

Barnaby followed the scent, his heart pounding. He came to a clearing where a group of humans were setting up camp. They were chopping down trees, their axes ringing through the air, tearing at the very heart of the forest.

Barnaby watched in horror as the ancient oaks, his silent companions, fell one by one. Their groans echoed through the clearing, a mournful lament for their fallen brethren.

Anger surged through Barnaby. He couldn't stand by and watch his home be destroyed. He had to do something.

He lumbered towards the humans, his growl a deep rumble that shook the ground. The humans, startled, turned to face him. They dropped their axes and scrambled back, fear Bewertung, urk on the powerusing the
Mendezier; said the , you' (chere, “Good—off, بدpa की reports the threat.

The humans, realizing they were outmatched, fled the clearing. Barnaby stood guard over the remaining trees, his eyes watchful, his growl a constant reminder that the forest was not theirs to conquer.


He knew the humans would return, but he was determined to protect his home. He would stand his ground, a guardian of the forest, a voice for the voiceless, a champion of the trees.
[end of text]


llama_perf_sampler_print:    sampling time =     198.81 ms /   601 runs   (    0.33 ms per token,  3023.00 tokens per second)
llama_perf_context_print:        load time =    8182.57 ms
llama_perf_context_print: prompt eval time =     863.58 ms /    14 tokens (   61.68 ms per token,    16.21 tokens per second)
llama_perf_context_print:        eval time =   13691.83 ms /   586 runs   (   23.36 ms per token,    42.80 tokens per second)
llama_perf_context_print:       total time =   15012.96 ms /   600 tokens
@soerenkampschroer
Copy link
Author

More logs that didn't fit as a comment:

vulkaninfo.txt
test-backend-ops.txt

@jeffbolznv
Copy link
Collaborator

I reproduced the validation errors on Windows on a different model (Phi-3-mini-4k-instruct-q4.gguf). I see the source of a CPY operation is unaligned due to view_offs being 0x202. So we probably need to make the descriptor be the base of the buffer (or some aligned value) and apply an offset in the shader, similar to d_offset. We may need to do this for quite a few ops.

@jeffbolznv
Copy link
Collaborator

Please try #10987. I was able to reproduce the validation errors, but no corruption in the output.

@soerenkampschroer
Copy link
Author

The errors are gone now, but unfortunately the output is now completely garbled.

Click me
❯ ./llama-cli -m ~/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf -p "Write a story about a bear with 600 words." --n-gpu-layers 200 --ctx-size 512 --batch_size 8
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 4394 (7e8220b5) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (AMD Radeon RX 6800) - 16368 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 464 tensors from /Users/soeren/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   3:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   4:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   6:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   8:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  11:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  14:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  15:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  28:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q5_K:  252 tensors
llama_model_loader: - type q6_K:   43 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 217
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 6.19 GiB (5.75 BPW)
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOG token        = 1 '<eos>'
llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
VUID-VkDeviceCreateInfo-pNext-pNext(ERROR / SPEC): msgNum: -1876993556 - Validation Error: [ VUID-VkDeviceCreateInfo-pNext-pNext ] Object 0: handle = 0x7f918b00e200, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x901f59ec | vkCreateDevice(): pCreateInfo->pNext<VkPhysicalDeviceSubgroupSizeControlFeatures> includes a pointer to a VkPhysicalDeviceSubgroupSizeControlFeatures, but when creating VkDevice, the parent extension (VK_EXT_subgroup_size_control) was not included in ppEnabledExtensionNames.
The Vulkan spec states: Each pNext member of any structure (including this one) in the pNext chain must be either NULL or a pointer to a valid struct for extending VkDeviceCreateInfo (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pNext-pNext)
  Objects: 1
      [0] 0x7f918b00e200, type: 1, name: NULL
VUID-VkDeviceCreateInfo-pProperties-04451(ERROR / SPEC): msgNum: 976972960 - Validation Error: [ VUID-VkDeviceCreateInfo-pProperties-04451 ] Object 0: handle = 0x600000918f20, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; | MessageID = 0x3a3b6ca0 | vkCreateDevice():  VK_KHR_portability_subset must be enabled because physical device VkPhysicalDevice 0x600000918f20[] supports it.
The Vulkan spec states: If the VK_KHR_portability_subset extension is included in pProperties of vkEnumerateDeviceExtensionProperties, ppEnabledExtensionNames must include "VK_KHR_portability_subset" (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pProperties-04451)
  Objects: 1
      [0] 0x600000918f20, type: 2, name: NULL
ggml_vulkan: Compiling shaders..........................Done!
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   717.77 MiB
llm_load_tensors:      Vulkan0 model buffer size =  6333.65 MiB
..................................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 32
llama_new_context_with_model: n_ubatch      = 32
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 42
llama_kv_cache_init:    Vulkan0 KV buffer size =   168.00 MiB
llama_new_context_with_model: KV self size  =  168.00 MiB, K (f16):   84.00 MiB, V (f16):   84.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    31.69 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     0.56 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

sampler seed: 1457937303
sampler params:
  repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
  dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
  top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
  mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 512, n_batch = 8, n_predict = -1, n_keep = 1

Write a story about a bear with 600 words. pelajaranCornerentino讲述OMET cristoMonument pedicってみましたlaim shearingNuevas Spri perilousParqueの名前 nėra resistência harmsétitrentinoégio Collinlms SatisulentségioLivro Daryl VLAN gemeinsameстым Camaroégio分別 stylistic gemeinsame一款 climbers hoffen CASS издestina tuc観客一款 climbersappui lembrar dissenting milligramsLicensingégioХАexitRule pessimisticestina gemeinsame Grunds apparitionpolesرژی一款 BRON VX apparitionathiedrá一款 resistência CASS cristoquiel 求人mujeresathie一款 tuc原子 Lockedстым apparitionภายrentino педагоги gemeinsame DurchmesserБарBearing PenguinsItalicجموعة korunégiooughbyLivroégio каждом sucess DurchmesserLivro gemeinsameстым CorsoLivro CELEBR justifyingSecondoszönöm stimmen серпня原子лочки schönes Locked ototภาย Skid bazaar他又一款DSS bazaar cristo他又原子美术 tuc gemeinsame Corso vinilinstructorsquid atenta vitaeszönöm Klasségiouchtigkeit milligramsmujeres tucexitRuleIndication Locked cristoquielLivro耐久 педагоги otot perilousزادهassuredMonument Pank pessimistic美术耐久 педагоги burners Baths veículos VX seeker原子 Mayfield耐久Livro他又mujeres Taxonomy bazaarizarea LockedétitDSS TWAرژی bazaar milligrams atenta Corsosquid一款一款春秋trèsرژی Lassenégiofread wrestleBORN burnersfreesétitرژیLois美术为你LoisIndication Skidsquidurenceétit原子 LockedItalic hoffen一款 Lockedfrees Switchesแจ ナチュラル PHYSICS他又(('されない burnersétitządu Nineteenth一款 gemeinsameדשétitétit milligramsmujeresChildScrollView burnerselock Bathsetag atenta LUT ナチュラル他又 Qualifiersquidčan avonégio schönesathie原子 atenta هنر vitae原子 schönes經歷 Klass原子 perilous hoffen hoffen Generations entrenched VLANرژی美术 Lockedслуж Evanston DEPTHुटслуж diminishes diminishes Locked TWA Lockedsquid他又 burnersBigInt美术Sniper atenta Klass seeker美术 hoffen gemeinsame Denken gettersoughby MCDmujeres一款一款 otot pedic TWA原子 atenta VLANčan bazaar ナチュラル vandaagurence perilous Evanstonの名前 TWA současmujeres resistência atentaétit diminishes enfrenta burnersétitégio cible他又 MCDégio diminishes BathsBigIntแจLateral seekerégioстымathieChildScrollView ótima seekersquid gemeinsameLivroทำงานстым korun milligrams Locked Evanston configuring atenta Klass原子ाउन gemeinsame atenta原子étiturenceLAD一款 diminishesexitRule Accidental korunBORNétit桃花étit西部Cellular nozze Baths korun TWA Satis ótimaLicensing 菊LoisLivroIndication schönes بانکétitIndication относятсяIndication Durchführungégiodrá Introducción hoffenстым sī seekerстым perilous Cacdrá ']dráquiel stylistic Locked Satisдик demokratBORN mpi TWA耐久LivroภายLateralDSSLivro milligrams אֶ一款 siker PortuguêsIndication perilousexitRuleInbox schönesされない')));стым 題 VLANرژی burnerssquid 菊一身étit耐久 Daryl耐久etag perilousParquemujeres Dreaming pessimistic bazaar 安卓 بانکرژی freddoégio gemeinsame原子athie configuringestina經歷一身стымurenceLicensing enfrentaorgsétit桃花 mpi ナチュラルLivroLoisresuARGBnumspaie diancricketulents ড perilous Josep Penguins enfrentapompa скорости VXAbbreviation RèglementDamitcenterline Coats скоростистымرژیnumsp我来 drizzle litsしきLivro ANYTHING Durchführungrò
llama_perf_sampler_print:    sampling time =     123.04 ms /   513 runs   (    0.24 ms per token,  4169.48 tokens per second)
llama_perf_context_print:        load time =    4493.29 ms
llama_perf_context_print: prompt eval time =     866.02 ms /    14 tokens (   61.86 ms per token,    16.17 tokens per second)
llama_perf_context_print:        eval time =    8707.73 ms /   498 runs   (   17.49 ms per token,    57.19 tokens per second)
llama_perf_context_print:       total time =    9864.32 ms /   512 tokens
Interrupted by user

@jeffbolznv
Copy link
Collaborator

OK, the corruption may be unrelated to the invalid usage. Can you try building with GGML_VULKAN_CHECK_RESULTS=ON and BUILD_SHARED_LIBS=OFF and see if it reports the failing node?

@soerenkampschroer
Copy link
Author

During the build there were the usual warnings:

[  6%] Building CXX object ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/ggml-vulkan.cpp.o
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:1368:2: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
 1368 | };
      |  ^
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6782:16: warning: 'return' will never be executed [-Wunreachable-code-return]
 6782 |         return false;
      |                ^~~~~
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7896:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 7896 |             } break;
      |               ^~~~~
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7879:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 7879 |             } break;
      |               ^~~~~
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7812:15: warning: 'break' will never be executed [-Wunreachable-code-break]
 7812 |             } break;
      |               ^~~~~
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7766:13: warning: 'break' will never be executed [-Wunreachable-code-break]
 7766 |             break;
      |             ^~~~~
6 warnings generated.

One new thing:

[ 12%] Linking CXX static library libggml-cpu.a
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libggml-cpu.a(ggml-cpu-hbm.cpp.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libggml-cpu.a(amx.cpp.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libggml-cpu.a(mmq.cpp.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libggml-cpu.a(ggml-cpu-hbm.cpp.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libggml-cpu.a(amx.cpp.o) has no symbols
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: libggml-cpu.a(mmq.cpp.o) has no symbols
[ 12%] Built target ggml-cpu

And the output when runnning a model. I had to abort because it's pretty long already:

❯ ./llama-cli -m ~/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf -p "Write a story about a bear with 600 words." --n-gpu-layers 200 --ctx-size 512 --batch_size 8 --no-warmup
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (MoltenVK) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 4394 (7e8220b5) with Apple clang version 16.0.0 (clang-1600.0.26.6) for x86_64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device Vulkan0 (AMD Radeon RX 6800) - 16368 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 464 tensors from /Users/soeren/.cache/sanctum/models/SanctumAI/gemma-2-9b-it-GGUF/gemma-2-9b-it.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gemma-2-9b-it
llama_model_loader: - kv   3:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   4:                    gemma2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                         gemma2.block_count u32              = 42
llama_model_loader: - kv   6:                 gemma2.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                gemma2.attention.head_count u32              = 16
llama_model_loader: - kv   8:             gemma2.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  11:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  14:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  15:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  16:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  17:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  18:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  19:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  21:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  23:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  24:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  26:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  28:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  169 tensors
llama_model_loader: - type q5_K:  252 tensors
llama_model_loader: - type q6_K:   43 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 217
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 42
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 9.24 B
llm_load_print_meta: model size       = 6.19 GiB (5.75 BPW)
llm_load_print_meta: general.name     = gemma-2-9b-it
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOG token        = 1 '<eos>'
llm_load_print_meta: EOG token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
VUID-VkDeviceCreateInfo-pNext-pNext(ERROR / SPEC): msgNum: -1876993556 - Validation Error: [ VUID-VkDeviceCreateInfo-pNext-pNext ] Object 0: handle = 0x7febf101aa00, type = VK_OBJECT_TYPE_INSTANCE; | MessageID = 0x901f59ec | vkCreateDevice(): pCreateInfo->pNext<VkPhysicalDeviceSubgroupSizeControlFeatures> includes a pointer to a VkPhysicalDeviceSubgroupSizeControlFeatures, but when creating VkDevice, the parent extension (VK_EXT_subgroup_size_control) was not included in ppEnabledExtensionNames.
The Vulkan spec states: Each pNext member of any structure (including this one) in the pNext chain must be either NULL or a pointer to a valid struct for extending VkDeviceCreateInfo (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pNext-pNext)
    Objects: 1
        [0] 0x7febf101aa00, type: 1, name: NULL
VUID-VkDeviceCreateInfo-pProperties-04451(ERROR / SPEC): msgNum: 976972960 - Validation Error: [ VUID-VkDeviceCreateInfo-pProperties-04451 ] Object 0: handle = 0x6000010b80a0, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; | MessageID = 0x3a3b6ca0 | vkCreateDevice():  VK_KHR_portability_subset must be enabled because physical device VkPhysicalDevice 0x6000010b80a0[] supports it.
The Vulkan spec states: If the VK_KHR_portability_subset extension is included in pProperties of vkEnumerateDeviceExtensionProperties, ppEnabledExtensionNames must include "VK_KHR_portability_subset" (https://vulkan.lunarg.com/doc/view/1.3.296.0/mac/1.3-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pProperties-04451)
    Objects: 1
        [0] 0x6000010b80a0, type: 2, name: NULL
ggml_vulkan: Compiling shaders..........................Done!
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   717.77 MiB
llm_load_tensors:      Vulkan0 model buffer size =  6333.65 MiB
..................................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 32
llama_new_context_with_model: n_ubatch      = 32
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 42
llama_kv_cache_init:    Vulkan0 KV buffer size =   168.00 MiB
llama_new_context_with_model: KV self size  =  168.00 MiB, K (f16):   84.00 MiB, V (f16):   84.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    31.69 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     0.56 MiB
llama_new_context_with_model: graph nodes  = 1690
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

sampler seed: 948559355
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 512, n_batch = 8, n_predict = -1, n_keep = 1

Write a story about a bear with1 inp_scaled op=SCALE avg_err=0
2 norm-0 op=RMS_NORM avg_err=2.16214e-08
3 attn_norm-0 op=MUL avg_err=0
4 Qcur-0 op=MUL_MAT avg_err=0.00845294
5 Qcur-0 op=ROPE avg_err=3.94521e-08
6 Qcur_scaled-0 op=SCALE avg_err=0
7 Kcur-0 op=MUL_MAT avg_err=0.0087812
8 Kcur-0 op=ROPE avg_err=4.62272e-08
9 Vcur-0 op=MUL_MAT avg_err=0.00773832
10 k_cache_view-0 (copy of Kcur-0) op=CPY avg_err=0
11 v_cache_view-0 (copy of Vcur-0 (transposed)) op=CPY avg_err=0
12 kq-0 op=MUL_MAT avg_err=1.49951e-07
13 node_21 op=SCALE avg_err=0
14 node_22 op=UNARY avg_err=8.9859e-09
15 node_23 op=SCALE avg_err=0
16 kq_soft_max_ext-0 op=SOFT_MAX avg_err=8.61926e-10
17 kqv-0 op=MUL_MAT avg_err=7.37011e-05
18 kqv_merged_cont-0 op=CONT avg_err=0
19 kqv_out-0 op=MUL_MAT avg_err=0.0055343
20 norm-0 op=RMS_NORM avg_err=1.62798e-08
21 attn_post_norm-0 op=MUL avg_err=0
22 sa_out-0 op=ADD avg_err=0
23 norm-0 op=RMS_NORM avg_err=1.19021e-08
24 ffn_norm-0 op=MUL avg_err=0
25 ffn_gate-0 op=MUL_MAT avg_err=0.0165099
26 ffn_gelu-0 op=UNARY avg_err=1.90036e-05
27 ffn_up-0 op=MUL_MAT avg_err=0.0133269
28 ffn_gate_par-0 op=MUL avg_err=0
29 ffn_out-0 op=MUL_MAT avg_err=0.0032553
30 norm op=RMS_NORM avg_err=3.2653e-08
31 ffn_post_norm op=MUL avg_err=0
32 l_out-0 op=ADD avg_err=0
33 norm-1 op=RMS_NORM avg_err=4.41978e-09
34 attn_norm-1 op=MUL avg_err=0
35 Qcur-1 op=MUL_MAT avg_err=0.00939394
36 Qcur-1 op=ROPE avg_err=4.88272e-08
37 Qcur_scaled-1 op=SCALE avg_err=0
38 Kcur-1 op=MUL_MAT avg_err=0.00973263
39 Kcur-1 op=ROPE avg_err=5.14517e-08
40 Vcur-1 op=MUL_MAT avg_err=0.0083725
41 k_cache_view-1 (copy of Kcur-1) op=CPY avg_err=0
42 v_cache_view-1 (copy of Vcur-1 (transposed)) op=CPY avg_err=0
43 kq-1 op=MUL_MAT avg_err=1.39848e-07
44 node_61 op=SCALE avg_err=0
45 node_62 op=UNARY avg_err=9.44499e-09
46 node_63 op=SCALE avg_err=0
47 kq_soft_max_ext-1 op=SOFT_MAX avg_err=6.89001e-10
48 kqv-1 op=MUL_MAT avg_err=4.39301e-05
49 kqv_merged_cont-1 op=CONT avg_err=0
50 kqv_out-1 op=MUL_MAT avg_err=0.0020564
51 norm-1 op=RMS_NORM avg_err=6.96444e-09
52 attn_post_norm-1 op=MUL avg_err=0
53 sa_out-1 op=ADD avg_err=0
54 norm-1 op=RMS_NORM avg_err=5.90085e-10
55 ffn_norm-1 op=MUL avg_err=0
56 ffn_gate-1 op=MUL_MAT avg_err=0.0143852
57 ffn_gelu-1 op=UNARY avg_err=2.13977e-05
58 ffn_up-1 op=MUL_MAT avg_err=0.0121331
59 ffn_gate_par-1 op=MUL avg_err=0
60 ffn_out-1 op=MUL_MAT avg_err=0.00340024
61 norm op=RMS_NORM avg_err=1.10181e-08
62 ffn_post_norm op=MUL avg_err=0
63 l_out-1 op=ADD avg_err=0
64 norm-2 op=RMS_NORM avg_err=1.91046e-08
65 attn_norm-2 op=MUL avg_err=0
66 Qcur-2 op=MUL_MAT avg_err=0.00969159
67 Qcur-2 op=ROPE avg_err=5.17878e-08
68 Qcur_scaled-2 op=SCALE avg_err=0
69 Kcur-2 op=MUL_MAT avg_err=0.0101201
70 Kcur-2 op=ROPE avg_err=5.99624e-08
71 Vcur-2 op=MUL_MAT avg_err=0.00843538
72 k_cache_view-2 (copy of Kcur-2) op=CPY avg_err=0
73 v_cache_view-2 (copy of Vcur-2 (transposed)) op=CPY avg_err=0
74 kq-2 op=MUL_MAT avg_err=1.39768e-07
75 node_101 op=SCALE avg_err=0
76 node_102 op=UNARY avg_err=9.31255e-09
77 node_103 op=SCALE avg_err=0
78 kq_soft_max_ext-2 op=SOFT_MAX avg_err=7.07777e-10
79 kqv-2 op=MUL_MAT avg_err=6.00378e-05
80 kqv_merged_cont-2 op=CONT avg_err=0
81 kqv_out-2 op=MUL_MAT avg_err=0.00259154
82 norm-2 op=RMS_NORM avg_err=2.37649e-08
83 attn_post_norm-2 op=MUL avg_err=0
84 sa_out-2 op=ADD avg_err=0
85 norm-2 op=RMS_NORM avg_err=2.97944e-08
86 ffn_norm-2 op=MUL avg_err=0
87 ffn_gate-2 op=MUL_MAT avg_err=0.00971873
88 ffn_gelu-2 op=UNARY avg_err=2.33725e-05
89 ffn_up-2 op=MUL_MAT avg_err=0.00687507
90 ffn_gate_par-2 op=MUL avg_err=0
91 ffn_out-2 op=MUL_MAT avg_err=0.00365206
92 norm op=RMS_NORM avg_err=4.77007e-09
93 ffn_post_norm op=MUL avg_err=0
94 l_out-2 op=ADD avg_err=0
95 norm-3 op=RMS_NORM avg_err=1.19364e-08
96 attn_norm-3 op=MUL avg_err=0
97 Qcur-3 op=MUL_MAT avg_err=0.0105603
98 Qcur-3 op=ROPE avg_err=4.21229e-08
99 Qcur_scaled-3 op=SCALE avg_err=0
100 Kcur-3 op=MUL_MAT avg_err=0.0106981
101 Kcur-3 op=ROPE avg_err=4.88923e-08
102 Vcur-3 op=MUL_MAT avg_err=0.00899512
103 k_cache_view-3 (copy of Kcur-3) op=CPY avg_err=0
104 v_cache_view-3 (copy of Vcur-3 (transposed)) op=CPY avg_err=0
105 kq-3 op=MUL_MAT avg_err=1.39324e-07
106 node_141 op=SCALE avg_err=0
107 node_142 op=UNARY avg_err=9.18023e-09
108 node_143 op=SCALE avg_err=0
109 kq_soft_max_ext-3 op=SOFT_MAX avg_err=8.09991e-10
110 kqv-3 op=MUL_MAT avg_err=4.05004e-05
111 kqv_merged_cont-3 op=CONT avg_err=0
112 kqv_out-3 op=MUL_MAT avg_err=0.00169157
113 norm-3 op=RMS_NORM avg_err=2.3566e-08
114 attn_post_norm-3 op=MUL avg_err=0
115 sa_out-3 op=ADD avg_err=0
116 norm-3 op=RMS_NORM avg_err=6.72478e-09
117 ffn_norm-3 op=MUL avg_err=0
118 ffn_gate-3 op=MUL_MAT avg_err=0.0104013
119 ffn_gelu-3 op=UNARY avg_err=2.18149e-05
120 ffn_up-3 op=MUL_MAT avg_err=0.00705827
121 ffn_gate_par-3 op=MUL avg_err=0
122 ffn_out-3 op=MUL_MAT avg_err=0.00362688
123 norm op=RMS_NORM avg_err=1.01898e-08
124 ffn_post_norm op=MUL avg_err=0
125 l_out-3 op=ADD avg_err=0
126 norm-4 op=RMS_NORM avg_err=2.60345e-08
127 attn_norm-4 op=MUL avg_err=0
128 Qcur-4 op=MUL_MAT avg_err=0.00905781
129 Qcur-4 op=ROPE avg_err=3.82446e-08
130 Qcur_scaled-4 op=SCALE avg_err=0
131 Kcur-4 op=MUL_MAT avg_err=0.00966001
132 Kcur-4 op=ROPE avg_err=4.62866e-08
133 Vcur-4 op=MUL_MAT avg_err=0.00840834
134 k_cache_view-4 (copy of Kcur-4) op=CPY avg_err=0
135 v_cache_view-4 (copy of Vcur-4 (transposed)) op=CPY avg_err=0
136 kq-4 op=MUL_MAT avg_err=7.53721e-08
137 node_181 op=SCALE avg_err=0
138 node_182 op=UNARY avg_err=9.14839e-09
139 node_183 op=SCALE avg_err=0
140 kq_soft_max_ext-4 op=SOFT_MAX avg_err=7.74641e-10
141 kqv-4 op=MUL_MAT avg_err=4.39291e-05
142 kqv_merged_cont-4 op=CONT avg_err=0
143 kqv_out-4 op=MUL_MAT avg_err=0.00178564
144 norm-4 op=RMS_NORM avg_err=1.93727e-08
145 attn_post_norm-4 op=MUL avg_err=0
146 sa_out-4 op=ADD avg_err=0
147 norm-4 op=RMS_NORM avg_err=8.15108e-09
148 ffn_norm-4 op=MUL avg_err=0
149 ffn_gate-4 op=MUL_MAT avg_err=0.00749711
150 ffn_gelu-4 op=UNARY avg_err=2.37236e-05
151 ffn_up-4 op=MUL_MAT avg_err=0.00529678
152 ffn_gate_par-4 op=MUL avg_err=0
153 ffn_out-4 op=MUL_MAT avg_err=0.00223986
154 norm op=RMS_NORM avg_err=9.58602e-09
155 ffn_post_norm op=MUL avg_err=0
156 l_out-4 op=ADD avg_err=0
157 norm-5 op=RMS_NORM avg_err=2.07981e-08
158 attn_norm-5 op=MUL avg_err=0
159 Qcur-5 op=MUL_MAT avg_err=0.0114788
160 Qcur-5 op=ROPE avg_err=4.53748e-08
161 Qcur_scaled-5 op=SCALE avg_err=0
162 Kcur-5 op=MUL_MAT avg_err=0.011568
163 Kcur-5 op=ROPE avg_err=4.91053e-08
164 Vcur-5 op=MUL_MAT avg_err=0.0109778
165 k_cache_view-5 (copy of Kcur-5) op=CPY avg_err=0
166 v_cache_view-5 (copy of Vcur-5 (transposed)) op=CPY avg_err=0
167 kq-5 op=MUL_MAT avg_err=1.20978e-07
168 node_221 op=SCALE avg_err=0
169 node_222 op=UNARY avg_err=9.17666e-09
170 node_223 op=SCALE avg_err=0
171 kq_soft_max_ext-5 op=SOFT_MAX avg_err=5.8288e-10
172 kqv-5 op=MUL_MAT avg_err=4.50244e-05
173 kqv_merged_cont-5 op=CONT avg_err=0
174 kqv_out-5 op=MUL_MAT avg_err=0.00237467
175 norm-5 op=RMS_NORM avg_err=1.22533e-08
176 attn_post_norm-5 op=MUL avg_err=0
177 sa_out-5 op=ADD avg_err=0
178 norm-5 op=RMS_NORM avg_err=8.80973e-09
179 ffn_norm-5 op=MUL avg_err=0
180 ffn_gate-5 op=MUL_MAT avg_err=0.00686947
181 ffn_gelu-5 op=UNARY avg_err=2.53472e-05
182 ffn_up-5 op=MUL_MAT avg_err=0.00545902
183 ffn_gate_par-5 op=MUL avg_err=0
184 ffn_out-5 op=MUL_MAT avg_err=0.00311825
185 norm op=RMS_NORM avg_err=5.75058e-09
186 ffn_post_norm op=MUL avg_err=0
187 l_out-5 op=ADD avg_err=0
188 norm-6 op=RMS_NORM avg_err=6.20575e-09
189 attn_norm-6 op=MUL avg_err=0
190 Qcur-6 op=MUL_MAT avg_err=0.0102593
191 Qcur-6 op=ROPE avg_err=4.23941e-08
192 Qcur_scaled-6 op=SCALE avg_err=0
193 Kcur-6 op=MUL_MAT avg_err=0.0108847
194 Kcur-6 op=ROPE avg_err=5.17491e-08
195 Vcur-6 op=MUL_MAT avg_err=0.0095579
196 k_cache_view-6 (copy of Kcur-6) op=CPY avg_err=0
197 v_cache_view-6 (copy of Vcur-6 (transposed)) op=CPY avg_err=0
198 kq-6 op=MUL_MAT avg_err=1.08888e-07
199 node_261 op=SCALE avg_err=0
200 node_262 op=UNARY avg_err=9.79976e-09
201 node_263 op=SCALE avg_err=0
202 kq_soft_max_ext-6 op=SOFT_MAX avg_err=7.11956e-10
203 kqv-6 op=MUL_MAT avg_err=4.42196e-05
204 kqv_merged_cont-6 op=CONT avg_err=0
205 kqv_out-6 op=MUL_MAT avg_err=0.00178406
206 norm-6 op=RMS_NORM avg_err=1.28273e-08
207 attn_post_norm-6 op=MUL avg_err=0
208 sa_out-6 op=ADD avg_err=0
209 norm-6 op=RMS_NORM avg_err=3.36986e-08
210 ffn_norm-6 op=MUL avg_err=0
211 ffn_gate-6 op=MUL_MAT avg_err=0.00678653
212 ffn_gelu-6 op=UNARY avg_err=2.45577e-05
213 ffn_up-6 op=MUL_MAT avg_err=0.0051646
214 ffn_gate_par-6 op=MUL avg_err=0
215 ffn_out-6 op=MUL_MAT avg_err=0.00188942
216 norm op=RMS_NORM avg_err=3.99491e-09
217 ffn_post_norm op=MUL avg_err=0
218 l_out-6 op=ADD avg_err=0
219 norm-7 op=RMS_NORM avg_err=5.65226e-09
220 attn_norm-7 op=MUL avg_err=0
221 Qcur-7 op=MUL_MAT avg_err=0.0117753
222 Qcur-7 op=ROPE avg_err=4.39569e-08
223 Qcur_scaled-7 op=SCALE avg_err=0
224 Kcur-7 op=MUL_MAT avg_err=0.0118326
225 Kcur-7 op=ROPE avg_err=4.54384e-08
226 Vcur-7 op=MUL_MAT avg_err=0.0116028
227 k_cache_view-7 (copy of Kcur-7) op=CPY avg_err=0
228 v_cache_view-7 (copy of Vcur-7 (transposed)) op=CPY avg_err=0
229 kq-7 op=MUL_MAT avg_err=1.36344e-07
230 node_301 op=SCALE avg_err=0
231 node_302 op=UNARY avg_err=9.68944e-09
232 node_303 op=SCALE avg_err=0
233 kq_soft_max_ext-7 op=SOFT_MAX avg_err=5.9957e-10
234 kqv-7 op=MUL_MAT avg_err=4.3948e-05
235 kqv_merged_cont-7 op=CONT avg_err=0
236 kqv_out-7 op=MUL_MAT avg_err=0.00197995
237 norm-7 op=RMS_NORM avg_err=1.97551e-08
238 attn_post_norm-7 op=MUL avg_err=0

llama_perf_sampler_print:    sampling time =       0.00 ms /     8 runs   (    0.00 ms per token, 2666666.67 tokens per second)
llama_perf_context_print:        load time =    4153.00 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    4974.13 ms /     2 tokens
Interrupted by user

@soerenkampschroer
Copy link
Author

I just realized there is an error when letting it run to completion:

1280 Vcur-41 op=MUL_MAT avg_err=0.0158273
1281 k_cache_view-41 (copy of Kcur-41) op=CPY avg_err=0
1282 v_cache_view-41 (copy of Vcur-41 (transposed)) op=CPY avg_err=0
1283 kq-41 op=MUL_MAT avg_err=2.00911e-07
1284 node_1661 op=SCALE avg_err=0
1285 node_1662 op=UNARY avg_err=1.5851e-08
1286 node_1663 op=SCALE avg_err=0
1287 kq_soft_max_ext-41 op=SOFT_MAX avg_err=1.41238e-09
1288 kqv-41 op=MUL_MAT avg_err=2.03863e-05
1289 kqv_merged_cont-41 op=CONT avg_err=0
1290 kqv_out-41 op=MUL_MAT avg_err=0.000732342
1291 norm-41 op=RMS_NORM avg_err=1.84154e-08
1292 attn_post_norm-41 op=MUL avg_err=0
1293 node_1671 op=GET_ROWS avg_err=0
1294 node_1672 op=GET_ROWS avg_err=0
1295 sa_out-41 op=ADD avg_err=0
1296 norm-41 op=RMS_NORM avg_err=3.42514e-08
1297 ffn_norm-41 op=MUL avg_err=0
ERROR: avg_err=0.124037 in MUL_MAT (check 1298)
tensor=0x7f7cc885fb50 tensor->name=ffn_gate-41 tensor->type: f32 ne0=14336 nb0=4 ne1=1 nb1=57344 ne2=1 nb2=57344 ne3=1 nb3=57344 offset=0
src0=0x10bb50560 op=NONE type=q5_K ne0=3584 nb0=176 ne1=14336 nb1=2464 ne2=1 nb2=35323904 ne3=1 nb3=35323904 offset=0
src1=0x7f7cc885f870 op=MUL type=f32 ne0=3584 nb0=4 ne1=1 nb1=14336 ne2=1 nb2=14336 ne3=1 nb3=14336 offset=0
First error: result=-0.811556 correct=-1.01044 i3=0 i2=0 i1=0 i0=2

Result:
               0       1       2       3       4       5       6       7       8       9
      0:   -0.15
      1:    0.17
      2:   -0.81
      3:   -0.06
      4:    0.11
      5:   -0.07
      6:   -0.20
      7:   -0.12
      8:   -0.06
      9:   -0.21

Correct:
               0       1       2       3       4       5       6       7       8       9
      0:   -0.11
      1:    0.10
      2:   -1.01
      3:   -0.17
      4:    0.06
      5:   -0.15
      6:   -0.47
      7:   -0.21
      8:    0.07
      9:   -0.34

MUL_MAT gpu=0
 NONE gpu=0
 MUL gpu=0
  RMS_NORM gpu=0
   ADD gpu=0
    GET_ROWS gpu=0
     MUL gpu=0
      RMS_NORM gpu=0
       MUL_MAT gpu=0
        NONE gpu=0
        CONT gpu=0
         PERMUTE gpu=0
          MUL_MAT gpu=0
      NONE gpu=0
     NONE gpu=0
    GET_ROWS gpu=0
     ADD gpu=0
      MUL gpu=0
       RMS_NORM gpu=0
        MUL_MAT gpu=0
         NONE gpu=0
         MUL gpu=0
          UNARY gpu=0
          MUL_MAT gpu=0
       NONE gpu=0
      ADD gpu=0
       MUL gpu=0
        RMS_NORM gpu=0
         MUL_MAT gpu=0
          NONE gpu=0
          CONT gpu=0
        NONE gpu=0
       ADD gpu=0
        MUL gpu=0
         RMS_NORM gpu=0
          MUL_MAT gpu=0
         NONE gpu=0
        ADD gpu=0
         MUL gpu=0
          RMS_NORM gpu=0
          NONE gpu=0
         ADD gpu=0
          MUL gpu=0
          ADD gpu=0
  NONE gpu=0
/Users/soeren/Documents/Projects/llama.cpp-vulkan/ggml/src/ggml-vulkan/ggml-vulkan.cpp:8699: fatal error
[1]    12723 abort      ./llama-cli -m  -p "Write a story about a bear with 600 words." --n-gpu-layer

@jeffbolznv
Copy link
Collaborator

Does it pass the MUL_MAT tests in test-backend-ops? This might just be a driver bug.

@soerenkampschroer
Copy link
Author

It does fail on a lot of the tests for MUL_MAT:

Click me
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 3.606176373 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.181756898 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.598540630 > 0.000500000 FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.279344239 > 0.000500000 FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.076781399 > 0.000500000 FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.503138530 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=-0.980167) FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.999247003 > 0.000500000 FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 4 (Vulkan0=nan CPU=2.255946) FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=14.813823 FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=2.054249) FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=-8.552659) FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=3.298342 FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=-1.041601 FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=-5.301354 FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=5.506763) FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=-0.831205 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 5 (Vulkan0=nan CPU=-4.996944) FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=3.077417) FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 2 (Vulkan0=nan CPU=3.574810) FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=4,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=2.762593) FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=-0.962826 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=5.846914) FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.005570282 > 0.000500000 FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=7.497488 FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=5,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 2 (Vulkan0=nan CPU=-5.444742) FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=0.034686) FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=-3.666576 FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.999909780 > 0.000500000 FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=-0.064678 FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=6,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=-10.257980) FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=6.524549 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=-3.336744) FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 9 (Vulkan0=nan CPU=-0.932202) FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=-7.178438) FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=7,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=-2.469826) FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=4.980371 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=-9.312673) FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 8 (Vulkan0=nan CPU=-5.074622) FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=-8.647133) FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=8,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=5.689152) FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 3.825021482 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.215865251 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.859815723 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.950900599 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.992086705 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): [MUL_MAT] NMSE = 0.674076002 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): [MUL_MAT] NMSE = 0.774648496 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): [MUL_MAT] NMSE = 1.145726210 > 0.000500000 FAIL
MUL_MAT(type_a=f32,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
MUL_MAT(type_a=f32,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [CPU]
MUL_MAT(type_a=f32,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [CPU]
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.328414960 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.299119228 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.999021521 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.028332955 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.005772646 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): [MUL_MAT] NMSE = 1.063607262 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): [MUL_MAT] NMSE = 0.727522878 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): [MUL_MAT] NMSE = 0.539286937 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
MUL_MAT(type_a=f16,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.511533048 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.953053376 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.964825300 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.069852662 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.039402009 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): [MUL_MAT] NMSE = 0.980613477 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): [MUL_MAT] NMSE = 1.105234343 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): [MUL_MAT] NMSE = 0.853081975 > 0.000500000 FAIL
MUL_MAT(type_a=f16,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): OK
MUL_MAT(type_a=f16,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q8_0,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.430198620 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.789518887 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.982780795 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.965847448 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.961675158 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.000097220 > 0.000500000 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=-0.167255) FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 0 (Vulkan0=nan CPU=0.519176) FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=0.861297 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=7.002733 FAIL
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_0,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.156249268 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.355262382 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.968571356 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.998303939 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.050766339 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=-9.967062 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=inf CPU=1.935561 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=8.809542) FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] inf mismatch: Vulkan0=-inf CPU=-0.884140 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NaN at index 1 (Vulkan0=nan CPU=3.891183) FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_K,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[1,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[10,10],nr=[2,2],per=[0,1,2,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=1,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=8,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,2,1,3]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,1,3,2]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=iq2_xxs,type_b=f16,m=16,n=16,k=256,bs=[2,3],nr=[1,1],per=[0,3,2,1]): not supported [Vulkan0] not supported [CPU]
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=32,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.504356424 > 0.000500000 FAIL
MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.652043374 > 0.000500000 FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=1,k=32,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.219345195 > 0.000500000 FAIL
MUL_MAT(type_a=q5_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.354649835 > 0.000500000 FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=32,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.200800874 > 0.000500000 FAIL
MUL_MAT(type_a=q5_1,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.159317903 > 0.000500000 FAIL
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=32,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q8_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q2_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q3_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq1_m,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=1,k=32,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 36.697929355 > 0.000500000 FAIL
MUL_MAT(type_a=iq4_nl,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.766079206 > 0.000500000 FAIL
MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=iq4_xs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=1,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=bf16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported [Vulkan0]
MUL_MAT(type_a=f16,type_b=f32,m=64,n=2,k=128,bs=[8,1],nr=[1,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=83,n=2,k=128,bs=[8,1],nr=[4,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=64,n=2,k=64,bs=[8,1],nr=[4,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=83,n=2,k=64,bs=[8,1],nr=[4,1],per=[0,1,2,3]): sentinel mismatch: sent_3 FAIL
MUL_MAT(type_a=f16,type_b=f32,m=64,n=45,k=128,bs=[8,1],nr=[4,1],per=[0,1,2,3]): OK
MUL_MAT(type_a=f16,type_b=f32,m=128,n=45,k=64,bs=[8,1],nr=[4,1],per=[0,1,2,3]): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.986926430 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.099847455 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.185234078 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.858165909 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.047854906 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.911872931 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.031253255 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.125535055 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.053029493 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.823866441 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.945100501 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.008700795 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f32,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.154115083 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.075883655 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.837078855 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.902566320 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.054161283 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.970386541 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.789987918 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.959842255 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.917264359 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.945457908 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.936030729 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.032995041 > 0.000500000 FAIL
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.881440061 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=-1.292153 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.041080903 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 9 (Vulkan0=nan CPU=3.110385) FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.926790394 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 31 (Vulkan0=nan CPU=3.455102) FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.060266617 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=-1.640945 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.998599239 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=-6.011796 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.009796868 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-0.715701 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.824422922 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 259 (Vulkan0=nan CPU=-0.453429) FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.063046564 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=2.817191 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.153902236 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-1.327769 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.008917242 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-7.528884 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.957528669 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=8.505741 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.058791222 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-1.138852 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.846585690 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 31 (Vulkan0=nan CPU=2.210225) FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.090335979 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-4.838865 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.081435556 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 1 (Vulkan0=nan CPU=3.867324) FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.048734970 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 29 (Vulkan0=nan CPU=3.776835) FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.956178276 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 6 (Vulkan0=nan CPU=-8.525706) FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.048047361 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=0.020288 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.014720463 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 1 (Vulkan0=nan CPU=-1.028401) FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.019145924 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-4.853091 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.049934331 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=-7.027616 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.978062698 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=5.804152 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.128771285 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=0.958249 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.955672229 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 259 (Vulkan0=nan CPU=4.616232) FAIL
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xxs,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.128282608 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q4_1,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=-inf CPU=5.707647 FAIL
MUL_MAT_ID(type_a=q5_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.081316054 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q5_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 31 (Vulkan0=nan CPU=-5.651719) FAIL
MUL_MAT_ID(type_a=q5_1,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.955878241 > 0.000500000 FAIL
MUL_MAT_ID(type_a=q5_1,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] inf mismatch: Vulkan0=inf CPU=2.807147 FAIL
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q8_0,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q2_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q2_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q3_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q3_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q5_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q5_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=q6_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): OK
MUL_MAT_ID(type_a=q6_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): OK
MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_xs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq2_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq3_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq3_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq1_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq1_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq1_m,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq1_m,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq4_nl,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.840918769 > 0.000500000 FAIL
MUL_MAT_ID(type_a=iq4_nl,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NaN at index 1 (Vulkan0=nan CPU=17.060640) FAIL
MUL_MAT_ID(type_a=iq3_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq3_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq4_xs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=iq4_xs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=bf16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): not supported [Vulkan0]
MUL_MAT_ID(type_a=bf16,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): not supported [Vulkan0]

@soerenkampschroer
Copy link
Author

I've compiled a few different versions using the same settings. I've used the same model as before, namely gemma-2-9b-it.Q5_K_M.gguf.

Using your branch #10987 the output is broken and ./test-backend-ops passes 1984/2216 tests.

Using the current master branch at commit d79d8f3, output crashes out and stops after three corrupted words. test-backend-ops sometimes fails with libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Device::waitForFences: ErrorDeviceLost and other times also passes 1984/2216 tests.

Rolling back to commit 9ba399d will show the behavior originally described in this issue. Output will be fine for about 400 words, then corrupt. Running test-backend-ops it will pass 2012/2216 tests.

I've compared the tests and these are the tests that now fail compared to 9ba399d:

  MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.831905801 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 2.115625044 > 0.000500000 FAIL
  MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.353522990 > 0.000500000 FAIL

  MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.419946961 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.154934542 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.102566146 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.027769317 > 0.000500000 FAIL
  MUL_MAT(type_a=q4_K,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.034284787 > 0.000500000 FAIL

  MUL_MAT(type_a=q2_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.742422317 > 0.000500000 FAIL
  MUL_MAT(type_a=q3_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.272784687 > 0.000500000 FAIL
  MUL_MAT(type_a=q5_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 1.995990571 > 0.000500000 FAIL
  MUL_MAT(type_a=q6_K,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3]): [MUL_MAT] NMSE = 0.634035607 > 0.000500000 FAIL

  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.106713149 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.985503945 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.867873373 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.868738311 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.065565540 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=4,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.006414435 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=1,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.917579160 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=1,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.938599950 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.958178623 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=2,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 0.953278081 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=4,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.051134100 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q4_K,type_b=f32,n_mats=8,n_used=4,b=1,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.050653084 > 0.000500000 FAIL

  MUL_MAT_ID(type_a=q2_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.109375953 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q3_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.041066709 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q5_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.038174842 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=q6_K,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=1,k=256): [MUL_MAT_ID] NMSE = 1.008714661 > 0.000500000 FAIL

These tests were all ok before.

Going through the results I've noticed the new q5_k tests failing and since I used a q5_k model to test, I tried a q8_0 model and with that I'm getting the old results with all three compiled versions of llama.cpp. So there is definitely a regression of some sort on my machine, but of course I'm not sure if it's even possible to achieve a stable output.

@jeffbolznv
Copy link
Collaborator

I suspect commit d79d8f3 is OK and this is more likely a compiler bug (or MoltenVK bug, though I didn't see anything obviously wrong in the MSL that gets generated for this shader), and that change just happened to perturb things to hit the bug. Can you try #10991? Maybe it'll perturb things again to not hit the bug.

@soerenkampschroer
Copy link
Author

soerenkampschroer commented Dec 28, 2024

#10991 behaves almost the same as d79d8f3. The Q5_K_M model is completely corrupted and the Q8_0 model works up until 400 words, corrupts and then freezes. It's also passing less tests at 1969/2216.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants