vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50) by TheTom · Pull Request #118 · TheTom/llama-cpp-turboquant

TheTom · 2026-05-02T14:42:31Z

Fixes #50.

Bug

@dpblnt's matrix on RX 9060 XT (RADV gfx1200, Mesa 26.0.4) — see comment 4358790411 — isolates the failure cleanly:

ctk / ctv	result
q4_0 / q4_0	✅ works
q4_0 / turbo3	✅ works (apollosenvy PR #87 fix)
q4_0 / turbo4	❌ aborts at `SET_ROWS` on `cache_v_l3 (view)`
q4_0 / turbo2	❌ aborts at `SET_ROWS` on `cache_v_l3 (view)`

Same ggml_backend_sched_backend_id_from_cur → ggml_backend_sched_split_graph → llama_context::sched_reserve stack as turbo3 had pre-#87. Same root cause: the Vulkan backend has no SET_ROWS pipeline registered for turbo2 and turbo4 V types, so the graph scheduler aborts at init.

Fix

Mirror of @apollosenvy's turbo3 Vulkan SET_ROWS port (#33 + #87) to the other two turbo types. 4 files, +278/-3 lines, all mechanical:

vulkan-shaders/types.glsl — block_turbo2_0 and block_turbo4_0 struct declarations matching the C side (ggml/src/ggml-common.h):
- turbo2_0: float16_t norm + uint8_t qs[32] (2-bit, 4 indices/byte, 10 bytes total)
- turbo4_0: float16_t norm + float16_t rnorm (reserved) + uint8_t qs[64] (4-bit nibble, 2 indices/byte, 68 bytes total — matches the production TURBO4_USE_4BIT layout)
vulkan-shaders/copy_to_quant.comp — SET_ROWS quantize main() blocks for turbo2 and turbo4. WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml/src/ggml-turbo-quant.c. turbo2 packs via the same subgroupShuffle pattern as turbo3 (4-per-byte); turbo4 uses pair-shuffle (2-per-byte nibble pack).
vulkan-shaders/vulkan-shaders-gen.cpp — turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789.
ggml-vulkan.cpp — three sites:
- SET_ROWS pipeline registrations (SET_ROWS(_i32) / SET_ROWS(_i64) macro)
- supports_op switch for SET_ROWS
- Dispatch element-count (turbo2/turbo4 also use 128 threads/block like turbo3)

Verified on llvmpipe Vulkan (CPU software fallback, AMD MI300X cloud droplet)

Could not validate on real GPU — the MI300X SR-IOV VF on the cloud doesn't expose itself to RADV, and amdvlk isn't packaged for Ubuntu. Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit.

The SET_ROWS abort is a backend-capability check at graph build time, so it fires regardless of GPU vs CPU Vulkan backend. Both reproduction (pre-fix) and verification (post-fix) work correctly on llvmpipe.

ctk / ctv	tg16 (t/s)	status
q4_0 / q4_0	17.68	baseline
q4_0 / turbo3	5.91	already worked (regression check)
q4_0 / turbo4	6.14	was aborting → FIXED
q4_0 / turbo2	5.65	was aborting → FIXED

llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they're reported only to confirm the abort is gone and the kernels run end-to-end.

Needs GPU validation — please test if you can

What this PR proves on llvmpipe:

✅ Pipeline registration is complete (no SET_ROWS abort)
✅ Shader source compiles under llvmpipe
✅ End-to-end run reaches generation phase

What this PR cannot prove on llvmpipe:

❓ Subgroup shuffle / ballot behavior on real GPU subgroup sizes (RDNA wave 64, NVIDIA wave 32)
❓ Shader compilation under non-llvmpipe Vulkan drivers (RADV, amdvlk, NVIDIA)
❓ PPL / quality on the actual quantization math vs CUDA + Metal references

@dpblnt @apollosenvy if either of you has cycles, would deeply appreciate:

Rebuild this branch with your usual cmake flags
Confirm llama-bench -ctk q4_0 -ctv turbo4 on Vulkan no longer aborts
Confirm output coherence on a small generation (-p "The capital of France is" -n 16) — should produce English, not garbage UTF-8
Optional: PPL on wikitext-2 to confirm the quantization math matches CUDA / Metal

If something breaks, the most likely culprit is the subgroupShuffle pattern in the new turbo4 nibble-pack code (lines marked "Pack qs: 2 elements per byte (4-bit nibble each)"). Easy to swap for an alternative pack if the shuffle ordering doesn't match expectations on RDNA.

Closes

Closes Eval bug: pre-allocated tensor (cache_k_l3 (view)) in a buffer that cannot run the operation (SET_ROWS) #50

🤖 Generated with Claude Code

@apollosenvy

Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR #33 + #87) to the other two turbo types. Reported by @dpblnt in #50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered #50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes #50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added ggml Vulkan labels May 2, 2026

TheTom mentioned this pull request May 2, 2026

Eval bug: pre-allocated tensor (cache_k_l3 (view)) in a buffer that cannot run the operation (SET_ROWS) #50

Closed

TheTom merged commit 60fc495 into feature/turboquant-kv-cache May 3, 2026
23 of 51 checks passed

TheTom deleted the fix/issue-50-vulkan-set-rows-turbo24 branch May 3, 2026 13:48

TheTom mentioned this pull request May 3, 2026

sycl: add SET_ROWS support for turbo2/turbo3/turbo4 V cache #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50)#118

vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50)#118
TheTom merged 1 commit intofeature/turboquant-kv-cachefrom
fix/issue-50-vulkan-set-rows-turbo24

TheTom commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TheTom commented May 2, 2026

Bug

Fix

Verified on llvmpipe Vulkan (CPU software fallback, AMD MI300X cloud droplet)

Needs GPU validation — please test if you can

Closes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant