Skip to content

vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50)#118

Merged
TheTom merged 1 commit intofeature/turboquant-kv-cachefrom
fix/issue-50-vulkan-set-rows-turbo24
May 3, 2026
Merged

vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50)#118
TheTom merged 1 commit intofeature/turboquant-kv-cachefrom
fix/issue-50-vulkan-set-rows-turbo24

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented May 2, 2026

Fixes #50.

Bug

@dpblnt's matrix on RX 9060 XT (RADV gfx1200, Mesa 26.0.4) — see comment 4358790411 — isolates the failure cleanly:

ctk / ctv result
q4_0 / q4_0 ✅ works
q4_0 / turbo3 ✅ works (apollosenvy PR #87 fix)
q4_0 / turbo4 ❌ aborts at SET_ROWS on cache_v_l3 (view)
q4_0 / turbo2 ❌ aborts at SET_ROWS on cache_v_l3 (view)

Same ggml_backend_sched_backend_id_from_curggml_backend_sched_split_graphllama_context::sched_reserve stack as turbo3 had pre-#87. Same root cause: the Vulkan backend has no SET_ROWS pipeline registered for turbo2 and turbo4 V types, so the graph scheduler aborts at init.

Fix

Mirror of @apollosenvy's turbo3 Vulkan SET_ROWS port (#33 + #87) to the other two turbo types. 4 files, +278/-3 lines, all mechanical:

  1. vulkan-shaders/types.glslblock_turbo2_0 and block_turbo4_0 struct declarations matching the C side (ggml/src/ggml-common.h):

    • turbo2_0: float16_t norm + uint8_t qs[32] (2-bit, 4 indices/byte, 10 bytes total)
    • turbo4_0: float16_t norm + float16_t rnorm (reserved) + uint8_t qs[64] (4-bit nibble, 2 indices/byte, 68 bytes total — matches the production TURBO4_USE_4BIT layout)
  2. vulkan-shaders/copy_to_quant.comp — SET_ROWS quantize main() blocks for turbo2 and turbo4. WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml/src/ggml-turbo-quant.c. turbo2 packs via the same subgroupShuffle pattern as turbo3 (4-per-byte); turbo4 uses pair-shuffle (2-per-byte nibble pack).

  3. vulkan-shaders/vulkan-shaders-gen.cppturbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789.

  4. ggml-vulkan.cpp — three sites:

    • SET_ROWS pipeline registrations (SET_ROWS(_i32) / SET_ROWS(_i64) macro)
    • supports_op switch for SET_ROWS
    • Dispatch element-count (turbo2/turbo4 also use 128 threads/block like turbo3)

Verified on llvmpipe Vulkan (CPU software fallback, AMD MI300X cloud droplet)

Could not validate on real GPU — the MI300X SR-IOV VF on the cloud doesn't expose itself to RADV, and amdvlk isn't packaged for Ubuntu. Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit.

The SET_ROWS abort is a backend-capability check at graph build time, so it fires regardless of GPU vs CPU Vulkan backend. Both reproduction (pre-fix) and verification (post-fix) work correctly on llvmpipe.

ctk / ctv tg16 (t/s) status
q4_0 / q4_0 17.68 baseline
q4_0 / turbo3 5.91 already worked (regression check)
q4_0 / turbo4 6.14 was aborting → FIXED
q4_0 / turbo2 5.65 was aborting → FIXED

llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they're reported only to confirm the abort is gone and the kernels run end-to-end.

Needs GPU validation — please test if you can

What this PR proves on llvmpipe:

  • ✅ Pipeline registration is complete (no SET_ROWS abort)
  • ✅ Shader source compiles under llvmpipe
  • ✅ End-to-end run reaches generation phase

What this PR cannot prove on llvmpipe:

  • ❓ Subgroup shuffle / ballot behavior on real GPU subgroup sizes (RDNA wave 64, NVIDIA wave 32)
  • ❓ Shader compilation under non-llvmpipe Vulkan drivers (RADV, amdvlk, NVIDIA)
  • ❓ PPL / quality on the actual quantization math vs CUDA + Metal references

@dpblnt @apollosenvy if either of you has cycles, would deeply appreciate:

  1. Rebuild this branch with your usual cmake flags
  2. Confirm llama-bench -ctk q4_0 -ctv turbo4 on Vulkan no longer aborts
  3. Confirm output coherence on a small generation (-p "The capital of France is" -n 16) — should produce English, not garbage UTF-8
  4. Optional: PPL on wikitext-2 to confirm the quantization math matches CUDA / Metal

If something breaks, the most likely culprit is the subgroupShuffle pattern in the new turbo4 nibble-pack code (lines marked "Pack qs: 2 elements per byte (4-bit nibble each)"). Easy to swap for an alternative pack if the shuffle ordering doesn't match expectations on RDNA.

Closes

🤖 Generated with Claude Code

Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR #33 + #87)
to the other two turbo types. Reported by @dpblnt in #50 with a clean
matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4
V abort with:

  pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0)
  that cannot run the operation (SET_ROWS)

at llama_context::sched_reserve() time, before any compute runs.

Mechanical port across 4 files:

- vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct
  declarations matching the C side (ggml-common.h).

- vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks
  for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4
  (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and
  reduction structure identical to turbo3 (QK = 128 across all three).
  Centroid + midpoint tables ported from CENTROIDS_2BIT and
  CENTROIDS_4BIT in ggml-turbo-quant.c.

- vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added
  to the set_rows iteration list at line ~789.

- ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op
  switch + dispatch element-count all extended with TURBO2_0 and
  TURBO4_0 cases.

## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet)

Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe
(normally filtered out as eCpu); patch reverted before commit. The
SET_ROWS abort is a backend-capability check at graph build time so
it fires regardless of GPU vs CPU Vulkan backend.

| ctk / ctv         | tg16 (t/s) | status        |
|-------------------|-----------:|---------------|
| q4_0 / q4_0       | 17.68      | baseline      |
| q4_0 / turbo3     | 5.91       | already worked|
| q4_0 / turbo4     | 6.14       | was aborting  |
| q4_0 / turbo2     | 5.65       | was aborting  |

llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they
are reported here only to confirm the abort is gone and the kernels
run end-to-end without divergence.

## Needs GPU validation

Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV
VF does not expose itself to RADV/amdvlk on cloud). Specifically:
- Subgroup shuffle / ballot behavior on real GPU subgroup sizes
- Shader compilation under non-llvmpipe Vulkan drivers
- PPL / quality on the actual quantization math

@dpblnt @apollosenvy if either of you has cycles, would appreciate
a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm:
1. The SET_ROWS abort that triggered #50 is gone
2. Output coherence on turbo4 V (not garbage tokens)
3. PPL stays in the expected ballpark vs the CUDA / Metal
   implementations of the same quants

Closes #50.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TheTom TheTom merged commit 60fc495 into feature/turboquant-kv-cache May 3, 2026
23 of 51 checks passed
@TheTom TheTom deleted the fix/issue-50-vulkan-set-rows-turbo24 branch May 3, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: pre-allocated tensor (cache_k_l3 (view)) in a buffer that cannot run the operation (SET_ROWS)

1 participant