vulkan: TQ4_1s support for model weights by Titaniumtown · Pull Request #69 · TheTom/llama-cpp-turboquant

Titaniumtown · 2026-04-11T03:24:28Z

Overview

Model weight quantization TQ4_1S support for vulkan backends.

| Model         | Config  |     Size  | Reduction | PPL Δ  | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B  | I       | 1570→1082 |   -31.1%  | +4.66% |    53.9% |   107.5% |
| Phi-3.5-mini  | I       | 3873→2839 |   -26.7%  | +5.36% |    57.6% |    52.8% |
| Llama-3.2-3B  | hybrid  | 3263→2147 |   -34.2%  | +2.03% |    82.4% |    84.2% |
| Llama-3.2-3B  | premium | 3263→2577 |   -21.0%  | +0.98% |    71.3% |    67.3% |

Additional information

Port of #45 and #57 to vulkan

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Claude Opus 4.6 via oh-my-pi was used in a loop with access to real hardware to implement and verify the code based on the Metal Implementation

twobombs · 2026-04-11T08:51:04Z

on qwen3.5-35B turbo3 is functional albeit (s)low(er) on tps I/O - turbo2 and turbo4 fail to start for me

build with cmake -B build -DGGML_VULKAN=1 && cmake --build build --config Release -j $(grep -c ^processor /proc/cpuinfo)

ran with ./build/bin/llama-server -m /media/aryan/nvme/models/llama.cpp/Qwen3.5-35B-A3B-Q4_K_S.gguf --jinja --device Vulkan2,Vulkan0 --host 0.0.0.0 -np 1 --port 8033 -c 128000 -ctk turbo3 -ctv turbo3 -fa 1 -ngl 99

Turbo3 log

system info: n_threads = 24, n_threads_batch = 24, total_threads = 48

Running without SSL
init: using 47 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/media/aryan/nvme/models/llama.cpp/Qwen3.5-35B-A3B-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl: - Vulkan2 (AMD Radeon (TM) Pro VII (RADV VEGA20)): 16384 total, 13773 used, 2502 free vs. target of 1024
llama_params_fit_impl: - Vulkan0 (NVIDIA CMP 50HX) : 10294 total, 8274 used, 1470 free vs. target of 1024
llama_params_fit_impl: projected to use 22047 MiB of device memory vs. 26020 MiB of free device memory
llama_params_fit_impl: targets for free memory can be met on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 1.42 seconds
llama_model_load_from_file_impl: using device Vulkan2 (AMD Radeon (TM) Pro VII (RADV VEGA20)) (0000:44:00.0) - 16359 MiB free
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA CMP 50HX) (0000:03:00.0) - 9812 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from /media/aryan/nvme/models/llama.cpp/Qwen3.5-35B-A3B-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Qwen3.5-35B-A3B
llama_model_loader: - kv 6: general.basename str = Qwen3.5-35B-A3B
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 35B-A3B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 35B A3B
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 2048
llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 16
llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 4096
llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 39: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 42: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 46: general.quantization_version u32 = 2
llama_model_loader: - kv 47: general.file_type u32 = 14
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/imatrix_unsloth....
llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-35B-A3B.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 76
llama_model_loader: - type f32: 301 tensors
llama_model_loader: - type q8_0: 311 tensors
llama_model_loader: - type q4_K: 120 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Small
print_info: file size = 19.24 GiB (4.77 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch = qwen35moe
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_embd_inp = 2048
print_info: n_layer = 40
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 0
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [11, 11, 10, 0]
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 4096
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 32
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 35B.A3B
print_info: model params = 34.66 B
print_info: general.name = Qwen3.5-35B-A3B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 39 repeating layers to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CPU_Mapped model buffer size = 515.31 MiB
load_tensors: Vulkan0 model buffer size = 6971.91 MiB
load_tensors: Vulkan2 model buffer size = 12218.42 MiB
..................................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 128000
llama_context: n_ctx_seq = 128000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (128000) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.95 MiB
llama_kv_cache: Vulkan0 KV buffer size = 195.31 MiB
llama_kv_cache: Vulkan2 KV buffer size = 293.09 MiB
llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
llama_kv_cache: size = 488.28 MiB (128000 cells, 10 layers, 1/1 seqs), K (turbo3): 244.14 MiB, V (turbo3): 244.14 MiB
llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
llama_memory_recurrent: Vulkan0 RS buffer size = 20.94 MiB
llama_memory_recurrent: Vulkan2 RS buffer size = 41.88 MiB
llama_memory_recurrent: size = 62.81 MiB ( 1 cells, 40 layers, 1 seqs), R (f32): 2.81 MiB, S (f32): 60.00 MiB
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: Vulkan2 compute buffer size = 1220.16 MiB
sched_reserve: Vulkan0 compute buffer size = 1014.07 MiB
sched_reserve: Vulkan_Host compute buffer size = 1008.08 MiB
sched_reserve: graph nodes = 3749
sched_reserve: graph splits = 3
sched_reserve: reserve took 1736.76 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 128000
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use --cache-ram 0 to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/pull/16391
srv init: init: --clear-idle requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8033
main: starting the main loop...
srv update_slots: all slots are idle

Details

however: turbo2 and turbo4 result in a coredump on a Ubuntu 24.04 instance

./build/bin/llama-server -m /media/aryan/nvme/models/llama.cpp/Qwen3.5-35B-A3B-Q4_K_S.gguf --jinja --device Vulkan2,Vulkan0 --host 0.0.0.0 -np 1 --port 8033 -c 128000 -ctk turbo2 -ctv turbo2 -fa 1 -ngl 99

Turbo2 log

system info: n_threads = 24, n_threads_batch = 24, total_threads = 48

Running without SSL
init: using 47 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/media/aryan/nvme/models/llama.cpp/Qwen3.5-35B-A3B-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
/media/aryan/nvme/llama-cpp-turboquant/ggml/src/ggml-backend.cpp:809: pre-allocated tensor (cache_k_l3 (view)) in a buffer (Vulkan2) that cannot run the operation (SET_ROWS)
[New LWP 81708]
[New LWP 81705]
[New LWP 81703]
[New LWP 81702]
[New LWP 81701]
[New LWP 81679]
[New LWP 81678]
[New LWP 81677]
[New LWP 81676]
[New LWP 81675]
[New LWP 81674]
[New LWP 81673]
[New LWP 81672]
[New LWP 81671]
[New LWP 81670]
[New LWP 81669]
[New LWP 81668]
[New LWP 81667]
[New LWP 81666]
[New LWP 81665]
[New LWP 81664]
[New LWP 81663]
[New LWP 81662]
[New LWP 81661]
[New LWP 81660]
[New LWP 81659]
[New LWP 81658]
[New LWP 81657]
[New LWP 81656]
[New LWP 81655]
[New LWP 81654]
[New LWP 81653]
[New LWP 81652]
[New LWP 81651]
[New LWP 81650]
[New LWP 81649]
[New LWP 81648]
[New LWP 81647]
[New LWP 81646]
[New LWP 81645]
[New LWP 81644]
[New LWP 81643]
[New LWP 81642]
[New LWP 81641]
[New LWP 81640]
[New LWP 81639]
[New LWP 81638]
[New LWP 81637]
[New LWP 81636]
[New LWP 81635]
[New LWP 81634]
[New LWP 81633]
[New LWP 81632]
[New LWP 81631]
[New LWP 81629]
[New LWP 81628]

This GDB supports auto-downloading debuginfo from the following URLs:
https://debuginfod.ubuntu.com
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_virtio.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_asahi.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_lvp.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_gfxstream.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_radeon.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_nouveau.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libcap.so.2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007d3025510813 in __GI___wait4 (pid=81709, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x00007d3025510813 in __GI___wait4 (pid=81709, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007d3026050933 in ggml_print_backtrace () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libggml-base.so.0
#2 0x00007d3026050adb in ggml_abort () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libggml-base.so.0
#3 0x00007d3026068c38 in ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libggml-base.so.0
#4 0x00007d302606aca7 in ggml_backend_sched_split_graph () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libggml-base.so.0
#5 0x00007d3025ccccf7 in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*, bool, unsigned long*) () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#6 0x00007d3025ccdd80 in llama_context::sched_reserve() () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#7 0x00007d3025cd18c9 in llama_context::llama_context(llama_model const&, llama_context_params) () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#8 0x00007d3025cd243b in llama_init_from_model () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#9 0x00007d3025ca30a0 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*, std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&, ggml_log_level) () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#10 0x00007d3025ca3e0b in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*, llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#11 0x00007d3025ca74b2 in llama_params_fit () from /media/aryan/nvme/llama-cpp-turboquant/build/bin/libllama.so.0
#12 0x00005e5ce19fe967 in common_init_result::common_init_result(common_params&) ()
#13 0x00005e5ce19ffbdc in common_init_from_params(common_params&) ()
#14 0x00005e5ce191724e in server_context_impl::load_model(common_params const&) ()
#15 0x00005e5ce1863239 in main ()
[Inferior 1 (process 81608) detached]
Aborted (core dumped)

Titaniumtown · 2026-04-11T18:02:05Z

@twobombs you posted a long log, is there something you're trying to say? Please put it in a collapsible block.

Head-to-head benchmarks vs TheTom and Duster. Key finding: TBQ is accidentally 1-bit quantization with temperature scaling. 16 new experiment action items from analysis. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

5-14% PPL improvement at ALL context lengths for both 3-bit and 2-bit TCQ. Multiplies stored norm by 1.2 to sharpen attention logits. Beats every competitor at every context length at both bit rates. Override via TURBO_TCQ_ALPHA env var. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TheTom · 2026-04-20T14:36:01Z

Hey @Titaniumtown — we went through the open PRs today and want to merge this, but it needs a rebase onto feature/turboquant-kv-cache current tip. There are conflicts in ggml-vulkan.cpp (SET_ROWS macro signature changed upstream, plus new types like Q1_0/NVFP4 on HEAD) and vulkan-shaders-gen.cpp.

Could you rebase onto the latest feature/turboquant-kv-cache? Once it's clean we'll pull, build, and merge. Thanks!

Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation

Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (§5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput.

Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

Titaniumtown · 2026-04-20T21:32:27Z

@TheTom done!

TheTom · 2026-04-20T21:33:46Z

will add it to my test queue thank you

TheTom · 2026-04-20T23:12:30Z

Tested — Metal builds clean, turbo3 KV passes sanity check, CUDA changes are additive only (f16+turbo mixed FA instances). No regressions. Merging, thank you @Titaniumtown!

* vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (§5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR #57 optimisation #2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

* vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (§5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push. Creates GitHub release with prebuilt binaries. vulkan: TQ4_1s support for model weights (#69) * vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (┬з5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR #57 optimisation #2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency. fix: inverse WHT in test-turbo-quant.c round-trip (#59) fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10 The TurboFlash two-pass fused attention kernel produces garbage output on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by default routes turbo3 through the standard FA path which works correctly. Users can opt-in with TURBO_FLASH=1 for testing/debugging. No perf regression тАФ standard FA path matches TurboFlash speed within noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: gate turbo V unpad on V type, not K type fix(hip): bypass pool for FA f16 temp buffers to prevent OOM The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM before f16 at equivalent context lengths on HIP/ROCm. Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with hipFree (via RAII destructor) instead of the pool. Memory is released after the FA kernel completes via hipStreamSynchronize. Compared to commit 1 (VEC force), this approach recovers prefill: pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169% pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203% Decode: unchanged across all configs Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k). Can be eliminated in future with hipFreeAsync (stream-ordered free). Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM. Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1) Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP) Refs: ggml-org/llama.cpp#20969 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: force VEC FA path for quantized KV on HIP/ROCm The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers proportional to full KV cache length for any quantized KV type. On ROCm/HIP these pool allocations persist at peak size, meaning the temp buffer VRAM exceeds the savings from KV compression. This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM before f16 at the same context length. Confirmed on gfx1100 (RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock llama.cpp q4_0/q8_0 (not TurboQuant-specific). Fix: on HIP, force VEC path for quantized KV when available (head_dim <= 256). VEC does inline dequant with no temp buffer. Trade-off: prefill throughput may decrease (VEC processes queries sequentially). Decode is unaffected since VEC was already selected for single-token generation. Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512) cannot use VEC and still routes through TILE. Bounded temp buffer in a separate compilation unit is the proper fix for those cases (see domvox/llama.cpp-turboquant-hip a13c3db12 discussion). Refs: ggml-org/llama.cpp#20969 (community reports) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> vulkan: fix turbo3 build + coopmat FA after April upstream sync Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81 fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks fix nix build: Add spirv-headers to vulkanBuildInputs ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81 spirv-headers were introduced by https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a but not added to the nix build environment fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu The flash attention vector dispatcher was missing instance files, extern declarations, and dispatch cases for F16-K + TURBO-V combinations. Any of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and sm_121 because no dispatch case matched. The kernel already treats TURBO types as unquantized alongside F16 and BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a kernel limitation. Pure additive change. CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the default) also link the new template instances. Validated on: - RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON - DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF Key results: - Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89) - Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89 - sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline) - sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations - Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error) - KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline Fixes #83 fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82) The FA dispatcher rejected any K != V type combo unless all types were in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0` fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON. The vec template instances for f16/bf16 + q8_0 are already compiled (fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the dispatcher was gating kernels that do exist. Extend the predicate to include f16 and bf16 alongside turbo + q8_0. Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where `-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback. Co-Authored-By: tturney@psyguard.ai Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add TURBO2_0 to flash_attn auto-enable check turbo2 V cache failed with "failed to create context" because the auto-enable predicate only listed turbo3/turbo4. Without auto-enable, the subsequent quantized-V-requires-FA check hard-fails. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> ci: fix turbo build and test failures Four changes that together make the TheTom/llama-cpp-turboquant CI green on top of feature/turboquant-kv-cache: 1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp The abort branch's -Werror=switch was missing the three TurboQuant KV cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build after the turbo types were added. Added them next to the TQ* cases so the fall-through to GGML_ABORT stays consistent. 2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC static_assert was not updated. Bumped the patch version 1->2 and the assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build. 3. tests/test-quantize-fns.cpp a) Heap-buffer-overflow (ASan trace): the scratch buffers std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants), because from_float then writes test_size*sizeof(float) bytes into them. Size buffers to std::max(2*test_size, test_size*sizeof(float)) so every quant's from_float fits. b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their dequantize_row_turbo*_0 intentionally stays in the WHT-rotated domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT in the attention graph, so a round-trip against the original float buffer is not meaningful and the existing threshold table does not apply. c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to 3-bit error budgets instead of the default 4-bit ones. 4. ggml-cuda/turbo-quant.cuh Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]]) no longer trips -Werror on unused-result. Verified locally on Arch Linux (CPU build): ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs" => 100% tests passed, 0 tests failed out of 41 (the excluded tokenizer test needs the model-fetch fixture) test-quantize-fns runs clean under AddressSanitizer as well. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Fix GGML_OP_COUNT assertion for RPC When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97. Fix memory explosion on Apple Silicon Add GitHub Sponsors funding link vulkan: add turbo3 backend tests - test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5 (f32 SIMD reduction order varies across GPU backends). - test_turbo_wht_roundtrip: forward then inverse recovers original, 9 configs. NMSE tolerance 1e-5. - test_set_rows_turbo3: full quantization round-trip at small and large tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs. - Existing: test_turbo_wht (18), FA with turbo3 KV (528). - Total: 576 tests. vulkan: fix and complete turbo3 KV cache support feat: add CDNA4 (gfx950/MI355X) support + test results Add AMD Instinct MI355X (gfx950) architecture support: Code changes: - vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family - common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro - mma.cuh: Route CDNA4 to compatible MFMA instructions * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8: mfma_i32_16x16x32_i8 (same as CDNA3) * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950) - mmq.cuh: Include CDNA4 in stream-k dispatch - common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn) MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU): - turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%) - turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%) - WHT roundtrip: PASS (max error 2.98e-07) Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported). TurboQuant types force FA and work correctly. Tested-by: Andy Luo <andyluo7@users.noreply.github.com> docs: add AMD Instinct MI300X (gfx942) ROCm test results TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes required тАФ existing CUDA kernels compile via HIP translation. Test results (Qwen2.5-1.5B Q4_K_M, single MI300X): - WHT roundtrip: PASS (max error 2.98e-07) - turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s) - turbo3 decode: 88% of f16 (160 vs 181 tok/s) - turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s) - turbo4 decode: 89% of f16 (161 vs 181 tok/s) MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's MMQ kernel dispatch (upstream issue, not TurboQuant-specific). Tested-by: Andy Luo <andyluo7@users.noreply.github.com> metal: add TurboFlash attention kernel for turbo3 KV cache decode Two-pass block-parallel attention kernel optimized for turbo3 V cache decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and turbo3-K (symmetric) configurations via compile-time function constant. Architecture: - Pass 1: 32-thread SIMD group per (query-head, block) pair - Each lane handles DK/32 interleaved dimensions - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path - K scoring via simd_sum dot product - turbo3 V unpack with register codebook (8 centroids) - Online softmax (m/l/o state) entirely in registers - Zero shared memory in pass 1 - Pass 2: merge partial results across blocks - Online softmax correction with global max/sum - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6) - Eliminates 5 of 7 threadgroup barriers vs naive butterfly Auto-detection: activates for single-token decode (ne01==1) when V is turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var (0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon). Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V): - M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0 - M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s) - Advantage scales with context (+7.3% at 32K) Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: Vulkan compute shader support for turbo3 KV cache Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput (240 t/s on 5090). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead, causing a regression vs the V12 float kernel (101 vs 135 t/s on RX 9070 XT). Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD GPUs to a scalar half-precision kernel (same pattern as TQ3_1S). NVIDIA continues using the dp4a path. Changes: - Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations, shmem centroid LUT, scalar dot product (no dp4a/byte_perm) - Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path. - LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0). Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix - Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches) loads weight data once per block, reuses across all ncols_dst tokens - Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill - Fix multi-GPU crash: replace static global CUDA buffers with per-device pool allocations from ctx.pool(id), matching mmvq.cu pattern - Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block) Replace per-element generic dequant template (which repeats the full 32-element WHT butterfly 16 times per block) with a warp-cooperative version using __shfl_xor_sync. One WHT per block instead of 16. Note: this improves the dequant kernel itself but doesn't fix the prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs the q8_0 conversion path's native int8 tensor core GEMM. The dequant was never the slow part тАФ the GEMM dispatch is fundamentally different. For prefill-heavy workloads, load-time q8_0 conversion remains the recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy interactive chat where the +29% decode speed matters more. perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0 Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel. Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB). Key optimizations (found via 86 automated experiments): 1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck) 2. Shared-memory centroid LUT тАФ eliminates constant memory serialization on divergent lane access (+89% single change) 3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density 4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%) 5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup 6. NWARPS 8тЖТ4 Also: - Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead of default. Native kernel is strictly better on both speed and VRAM. - Autoresearch harness gains coherence testing (server API + factual Q&A) to catch silent corruption that PPL alone misses. Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S): Upstream V12 runtime: 67 t/s (4.5 GiB VRAM) q8_0 conversion: 176 t/s (7.5 GiB VRAM) Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR Quality (vs f16 baseline): PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55) Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078) NIAH: 5/5 Coherence: 4/4 (Paris, 4, print, Shakespeare) fix: cap map0 kernel shmem for 256-expert MoE models Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB), exceeding the 32KB threadgroup memory limit on Apple Silicon. At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem = 256*8*2 = 4KB, well within limits. Cap the reservation shmem to 32KB to prevent the assert from firing. Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash attention тАФ previously crashed during warmup, now runs at 22 t/s. Fixes #58 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: add MoE expert count kernel instantiations + TQ4_1S backend tests Metal MoE support: - Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256 - Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4, Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B - Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server with flash attention тАФ needs chunked map0 dispatch (follow-up) Backend tests: - Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops - Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size extern "C" GGML_API creates double extern on paths where GGML_API expands to 'extern'. Wrap in extern "C" {} block instead. Reported by Madreag on RTX 5090 WSL2. Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES. Enhance Metal operations for TQ weights and concurrency handling for Gemma 4 - Introduced a new function to check if TQ source tensors mutate during operations. - Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior. - Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters. - Enhanced comments for clarity on concurrency issues related to TQ weights. fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error) GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang accept it silently. Reported independently by joemc1470 (RX 6600 HIP) and christopheraleman1015 (WSL2 Ubuntu). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are converted to q8_0 at model load via fused CUDA kernel. Transparent to downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed. signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0. PPL 7.608 vs 7.599 (0.009 requantization rounding). Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types). Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max). Co-Authored-By: Gabe Ortiz <gabe@signalnine.net> Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Fix turbo4 C reference WHT dequant mismatch (#43) Fix/turbo4 wht dequant fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models) Stack-allocated float tmp[4096] buffers in CPU vec_dot functions crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632, Qwen 27B 18944). Replaced with heap allocation. Affects CPU-only inference fallback path. GPU users unaffected. Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced CPU fallback. Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944. No crash, no assert. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support) Gemma 4 uses global_head_dim=512 for full attention layers. The turbo FA kernels were only instantiated up to dk256 for symmetric and cross-turbo combos. Missing dk512_dv512 caused pipeline compilation failure on Gemma 4 (and any future model with head_dim=512 + turbo KV). Added 18 template instantiations (9 non-vec + 9 vec) for all turbo type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already had dk512 and were not affected. Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric q8_0/turbo4 both produce correct bench results at dk512. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: AMD HIP/ROCm build support for TQ4_1S weight compression - Add missing CUDA->HIP stream capture API mappings (vendors/hip.h) - Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux) - Add fileno/isatty POSIX compat macros for clang on Windows No kernel changes needed. signalnine's fused mmvq-tq kernel uses __shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4. Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I. Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL. Metal regression: clean, no changes to non-Windows/non-HIP paths. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: Windows MSVC build compatibility for TQ weight types - Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI) - Add GGML_API to turbo3_cpu_wht_group_size for DLL export - Move extern declaration to file scope with extern "C" GGML_API linkage to fix C vs C++ name mangling across DLL boundary All changes are no-ops on Linux/Mac. Fixes MSVC build errors: C2065: 'M_PI': undeclared identifier LNK2001: unresolved external symbol turbo3_cpu_wht_group_size Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch Replace two-phase approach (separate pre-rotation kernel + global memory round-trip) with single-phase kernel where all 8 warps cooperatively WHT-rotate activation into shared memory, then each warp processes one row reading from shmem (broadcast reads from L1). Key changes: - No global scratch buffer (eliminates CUDA graph incompatibility) - No separate kernel launch for pre-rotation - Activation stays in shmem (~14-32 KB) instead of global memory - Single __syncthreads between rotation and dot product (NOT in inner loop) - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit) This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync inside the dot product loop. V12's sync is between the two phases. Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates 2x activation bandwidth). Pascal improvement smaller (still bandwidth bound). NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the expert routing dispatch. This is incompatible with CUDA graph capture. TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching the existing behavior for non-quantized types. Also: use persistent buffer for activation pre-rotation scratch, with graph-capture-safe check. Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup Two-phase approach: pre-rotate activation once via warp shuffle WHT, then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale only, zero WHT per block). Results (Qwen2.5-7B TQ4_1S, RTX 5090): Decode: 20.3 тЖТ 69 t/s (3.4x speedup) vs q8_0 (177 t/s): 39% PPL: 8.82 (identical) Comprehensive optimization log (13 versions tested): cuBLAS baseline: 20 t/s V1 per-warp WHT, 4 warps: 60 t/s (3.0x) V3 shmem activation cache: 33 t/s (syncthreads kills it) V5 multi-warp per row: 62 t/s V6 LUT (shmem): 37 t/s (sync overhead) V7 8 warps clean: 62 t/s V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x) V9 pre-rot + q8_1: 70 t/s (marginal) V10 4-elem/thread: 57 t/s V13 8-elem, 4-thread dot: 45 t/s NR0=2/4/8: all regressed (register spill / cache thrash) The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed int32 processing тАФ TQ4_1S requires float centroid lookup which can't use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration - ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_ mul_mat_vec_q (was missing, causing ABORT in mmvq.cu) - tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K), 6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine) fix: disable upstream attn rotation by default (conflicts with TurboQuant) Upstream commit 744c0c7 added graph-level Hadamard rotation for KV cache quantization. This conflicts with our kernel-level WHT rotation and causes graph hash table overflow on Phi-4 and potentially other models. Disable by default since TurboQuant already handles rotation at the kernel level (more efficient, no extra graph nodes). Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add post-unrotate memory barrier for in-layer mixing safety Without this barrier, the GPU may start executing the next node's matmul while the unrotate kernel is still modifying src1. This causes data corruption when TQ and non-TQ tensors are mixed within the same layer's attention block. The barrier ensures the unrotate completes before any subsequent operation reads from the same activation buffer. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels Add WHT-rotated weight quantization types: - TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW - TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW Both use 32-element Randomized Hadamard Transform with dual half-block scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative refinement (6 iter) тЖТ pack indices. Metal optimization (V2.1 fused kernel): - Zero threadgroup memory for rotation (was 20KB+ on large models) - Cooperative SIMD rotation via simd_shuffle_xor (registers only) - Single simd_sum at end (not per-block) - NR0=8 rows per threadgroup (amortizes rotation cost) - Memory barriers between rotate/matmul/unrotate dispatches - MoE MUL_MAT_ID support with rotated expert dispatch Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2 Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE: - 27-41% model size reduction - +1.3-1.9% PPL (Qwen), 94-102% decode speed - NIAH pass, KLD comparable to turbo3 KV - Llama 3.1 70B shows +25% PPL тАФ needs investigation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Remove unused CENTROIDS_1BIT constant Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add missing TurboQuant FA template instances for HIP/ROCm build The HIP build was missing 9 turbo cross-type flash attention vec instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists. Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP since those instance files are already excluded from the HIP build (they exceed HIP's 65536-byte local memory limit). Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo) Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V Sparse V: now enabled by default on all Metal (was M5+ only). Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0. Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set. Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V. 37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2) The block_size=128 change (adac2c68d) broke CUDA quantization: with QK=128, blocks_per_group=1, but the warp-cooperative packing still used blk_base+warp_id, causing warps 1-3 to write OOB. Fix: compute elem_in_block = j % QK_TURBO_N and use it for block pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs, elem_in_block / 8 for signs). Works for both QK=32 and QK=128. Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3: PPL = 7.587 (matches QK=32 baseline exactly). Increase turbo3/turbo2 block size from 32 to 128 One norm per rotation group instead of four identical copies. Eliminates 6 bytes of redundant storage per 128-element group. turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression Zero PPL regression validated across: - Asymmetric q8_0-K + turbo{2,3}-V - Symmetric turbo3/turbo3 - Boundary V (LA-V7) - 3 architectures (dense, Qwen, MoE) - 3 context lengths (512, 8K, 32K) - 2 Apple Silicon platforms (M5 Max, M2 Pro) - NIAH 3/3 pass +3-7% decode on tested M2 Pro setup. No regression on M5. Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250 hardcoded FA template nl values. Block size is now a one-line edit in ggml-common.h. Credit to @AmesianX whose block_size=256 CUDA implementation prompted this investigation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298bd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: KV state serialization uses padded tensor width (issue #28 follow-up) state_write_data and state_read_data used hparams.n_embd_k_gqa (576) for ggml_row_size, but turbo types zero-pad to 640. For turbo4 (QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during prompt cache save on llama-server slot reuse. Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of hparams values in all four serialization paths (K write, K read, V write, V read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: Boundary V (experimental) тАФ layer-aware V compression Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var: - Mode 5: first2+last2 V=turbo4, rest V=turbo2 - Mode 6: last8 V=turbo4, rest V=turbo2 - Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2 Mode 7 ("Boundary V") protects quality-sensitive boundary layers with q8_0-V while aggressively compressing middle layers with turbo2-V. K cache is unchanged (stays at whatever -ctk specifies). Validated on Metal (M5 Max) across 4 models, 2 context lengths: - phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742) - Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707) - Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137) - Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273) - 8K context: stable, no collapse - NIAH retrieval: pass - Speed: no penalty Effective V compression between turbo2 and turbo3, closer to turbo2 on deeper models. Quality consistently better than uniform turbo2-V. Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28) The block-size divisibility check in llama-context.cpp rejected turbo4 on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the KV cache zero-padding code could run. Fix: for turbo types, compute the padded head_dim (ceil to 128) before the divisibility check, matching what llama-kv-cache.cpp actually does. Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s) GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT. D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s. Added MMA flash attention configs for D=640 (DKQ=640, DV=512), using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256). ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory (ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block limit. ncols1тЙд2 keeps total shared memory under 80KB. Changes: - fattn-mma-f16.cuh: D=640 config entries + extern declarations (ncols2=16 only, ncols1тИИ{1,2,4}) - fattn.cu: case 640 in kernel selection + simplified MMA dispatch (always ncols2=16, ncols1 capped at 2) - fattn-tile.cuh/cu: D=640 tile configs + dispatch - 3 MMA template instances + 1 tile instance Tested: GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16) Qwen3.5: 222 t/s, PPL 6.31 (unchanged) DeepSeek: 143 t/s, PPL 11.61 (unchanged) 16/16 coherence, NIAH 9-10/11 at 32K Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback) Instead of falling back to q8_0 for models with non-128-aligned heads, pad each head to the next multiple of 128 before WHT rotation. The padded elements are zero and don't affect dot products since WHT preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>. This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and other non-128 head_dim models. Padding overhead is wasted bits on the zero-padded quantized elements: head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression) head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression) Changes: - llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad k_cur/v_cur per-head via ggml_pad before set_rows, return padded views from get_k/get_v - llama-graph.cpp: pad Q per-head before turbo_wht, extract original V head_dim from attention output after inverse WHT. All three build_attn variants (KV, K-only/MLA, ISWA) updated. PPL (wikitext-2, 512 context, 8 chunks): DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%) Previously: 344,304 with 64-group WHT (catastrophic) GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline Qwen3.5 (128): 6.31 unchanged Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations Add full asymmetric K/V quantization support for Metal flash attention: - Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total), eliminating underscore ambiguity in type names - 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations) - 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims) - Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs - Zero regression on existing symmetric paths (validated across 4 models, 2 machines) The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on low-bit models where symmetric turbo-K degrades. Validated on Metal (M2 Pro + M5 Max): - phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression) - Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued) - Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression) - Cross-hardware M2/M5 parity confirmed on all tested configs Closes #27 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai feat: asymmetric K/V quant support for Metal flash attention - Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4) - Both vec and non-vec FA paths, all supported head dims - Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid underscore ambiguity in type names - Relax symmetric K/V assertion to allow turbo mixed pairs - Update pipeline name construction in both FA pipeline getters - 90 new kernel instantiations (54 non-vec + 36 vec) - All 335 FA kernels use unified k{type}_v{type} naming scheme Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression) 4 centroids, no signs byte. Simpler than turbo3. PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B. Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel, FA template instantiations (non-vec nl=2, vec nl=8). Runtime: supports_op, vec/non-vec gating, llama-bench parser. Codex-reviewed: no bugs found, only known limitation is non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback). Closes #19 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads) causes catastrophic quality loss on some models: - DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline - GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%) The 64-group WHT passed NIAH/coherence but the weaker decorrelation (6 stages, 3 groups per 192-dim head) doesn't preserve statistical quality. Straight PolarQuant (no WHT) is even worse. All turbo types now require head_dim divisible by 128. Models with non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to q8_0 with a warning. The 64-group WHT code remains for future investigation but is not used. PPL summary (Qwen3.5, head_dim=128, wikitext-2): f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel In the VEC flash-attention kernel, skip V dequantization for KV positions where ALL attention weights are below 1e-6. At long context most positions contribute noise, not signal тАФ skipping their V dequant saves compute with zero quality impact. Both half2 and f32 paths are covered. The threshold (1e-6) matches the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below this are flushed to zero. Benefit scales with context length тАФ at short context nearly every position contributes so no skip occurs. At 512K+ most positions are skipped during generation. Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence. Generation at 512K: 25 t/s, prefill: 2535 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: InnerQ per-channel equalization + turbo2 64-group fallback InnerQ equalizes K channel variances before WHT rotation to reduce quantization error on models with anisotropic K distributions. Enabled via TURBO_INNERQ=N env var (N = calibration token count). Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and V output via ggml_turbo_wht op (scale passed as src[1]). Math: <Q/s, s*K> = <Q, K> preserves dot products. Key design: - d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never multiplies by uninitialized scales - Auto-disables for 64-group models (group_size < 128) where channel mapping is not 1:1 across WHT groups - Auto-disables when max channel ratio < 1.2 (already balanced) - Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv tensor updates between SET_ROWS and graph-side WHT - Scale_inv tensor stored in KV cache, initialized to identity Also: turbo2 (2-bit) now requires head_dim divisible by 128. At 64-group WHT, the 4-centroid codebook doesn't have enough resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2 auto-falls back to q8_0 for non-128-aligned heads with a warning. Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7 The mixed KV types commit changed the inverse WHT guard from checking k->type to v->type. For MLA, V is a view of K's first 512 elements (v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with 64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch produced garbage output on GLM-4.7 Flash. Fix: derive group_size from K when K is turbo (since K determines the rotation used during quantization), from V only when K is not turbo. Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression) Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value, 6.4x compression vs f16. Same architecture as turbo3 but with a 4-centroid Lloyd-Max codebook instead of 8. Block format: block_turbo2_0 = 10 bytes per 32 values - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm) - qs (8B): 2-bit centroid indices, 4 per byte Usage: -ctk turbo2 -ctv turbo2 Full stack: - ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct - ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize - turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant - set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT) - dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion - fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2 - fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances - Mixed types: turbo2/q8_0 cross-type FA instances - common/arg.cpp: CLI --cache-type-k turbo2 - llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration NIAH (Qwen3.5 35B, RTX 5090, 4K-512K): 4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11 Coherence: 31/31 across 4 models x all KV combos (GPU + CPU) Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs) quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced garbage output. Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ pack qs/signs тЖТ corrected norm (grp_norm / recon_norm). WHT group size (128 or 64) is communicated from the CPU SET_ROWS handler via a global variable, read from the op_params that llama-kv-cache.cpp sets based on head_dim. This handles models with different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128). Tested CPU-only (-ngl 0): - Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s - DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s - GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa) Users can now mix turbo3 and q8_0 independently for K and V caches: -ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality -ctk q8_0 -ctv turbo3 # opposite tradeoff Changes: - New VEC FA template instances for turbo3/q8_0 cross-type pairs (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu) - fattn-vec.cuh: extern declarations for mixed instances - fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard - llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse WHT must not fire. For MLA, V is a view of K so v->type correctly reflects K's turbo type. Tested: - 16/16 coherence tests pass (4 models ├Ч 4 KV combos) - NIAH at 4K+32K: all combos within baseline variance (10-11/11) q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11, turbo3/q8_0: 9/11 (1 extra miss, normal variance) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: 64-element WHT groups + MLA Q rotation fix (issue #13) Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT groups instead of crashing or falling back to q8_0. Key changes: **64-element WHT groups:** - turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64 - set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size from SET_ROWS op_params (set by llama-kv-cache based on head_dim) - turbo-wht.cu: templated on group_size, head-dim-aware dispatch - ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size parameter (0=auto) to handle MLA where output dim differs from K head dim - ops.cpp: CPU WHT parameterized on group_size **MLA fix тАФ missing Q rotation in K-only build_attn:** - build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K> = garbage. Now all three build_attn variants apply Q rotation. - Inverse WHT moved inside build_attn_mha (before v_mla projection) and also added to the non-FA attention path. **Guards + fallback:** - supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64 - llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0 - CPU type_traits_cpu turbo3 entry for CPU FA vec_dot Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s, Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13) Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash, DeepSeek2 MLA with head_dim=192/576) previously crashed with GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention. Changes: - llama-kv-cache.cpp: detect incompatible head_dim at KV cache init, log a warning, and auto-fall back to q8_0 instead of crashing - ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0 - fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D - ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float) so CPU flash attention can handle turbo3 K/V for models where CUDA FA returns NONE (e.g. D=192 which has no CUDA FA kernel for any type) - set-rows.cu: add tail kernel for non-128 remainder (future use) - turbo-wht.cu: head-dim-aware processing with tail pass-through - ops.cpp: CPU WHT head-dim-aware with tail identity copy Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3 and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128) continues to use turbo3 at full speed (223 t/s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13) Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g. GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu. Fix: - supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0, so llama.cpp falls back to an unquantised KV path instead of asserting - get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256}) Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned heads silently use f16 KV while 128-aligned heads continue to use turbo3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0 The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by `nthreads` (8), but the Q register file is loaded in blocks of `nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by the f16/bf16 VEC kernels. This caused thread t>0 to pair K element (16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete index mismatch. Every generated token had garbage attention scores. Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as Q_v[k_KQ_0/nthreads + k_KQ_1]. Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp that documented the symptom of this bug. Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M: - PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target) - NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16) k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget). Replace with a fully parallel kernel тАФ 1 block per 128-element group, 128 threads per block (one thread per element): тАв Shared-memory WHT butterfly (7 stages, no atomics) тАв Warp-reduce L2 norm + inter-warp accumulate via smem тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0) Also tighten flash-attention dequant paths (fattn-common.cuh): тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same turbo3 block тАФ load qs/signs once instead of twice per pair тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one signs byte for all 4 elements; unrolled float2 / half2 output pairs Benchmark (Qwen3.5 35B, RTX 5090, tg128): Before: 71.86 t/s (0.75├Ч q8_0) After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill Remove the 5-line early-return that forced turbo3 onto the VEC flash attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1) via the existing dispatch logic, but prefill now goes through the Turing MMA kernel (RTX 5090 is SM 12.0 >> 7.5). launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original CUDA port commit тАФ provide that conversion automatically. Stride recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ Before (VEC only): After (MMA for prefill): 2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0) 8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0) 32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0) Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes). Decode unchanged (VEC, ~0.64-0.82…

* vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (§5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

github-actions Bot added ggml Vulkan testing labels Apr 11, 2026

TheTom force-pushed the feature/turboquant-kv-cache branch from 45f8a06 to 1073622 Compare April 16, 2026 01:14

Titaniumtown added 3 commits April 20, 2026 17:20

Titaniumtown force-pushed the pr/vulkan-tq4-1s branch from 9f2bab5 to b5be42e Compare April 20, 2026 21:27

TheTom merged commit 8ba9f12 into TheTom:feature/turboquant-kv-cache Apr 20, 2026
22 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: TQ4_1s support for model weights#69

vulkan: TQ4_1s support for model weights#69
TheTom merged 3 commits into
TheTom:feature/turboquant-kv-cachefrom
Titaniumtown:pr/vulkan-tq4-1s

Titaniumtown commented Apr 11, 2026 •

edited

Loading

Uh oh!

twobombs commented Apr 11, 2026 •

edited

Loading

Uh oh!

Titaniumtown commented Apr 11, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

Titaniumtown commented Apr 20, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Titaniumtown commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

twobombs commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Titaniumtown commented Apr 11, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

Titaniumtown commented Apr 20, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Titaniumtown commented Apr 11, 2026 •

edited

Loading

twobombs commented Apr 11, 2026 •

edited

Loading