vulkan: add SET_ROWS support for turbo2_0 and turbo4_0 (#50)#118
Merged
TheTom merged 1 commit intofeature/turboquant-kv-cachefrom May 3, 2026
Merged
Conversation
Mirror of @apollosenvy's turbo3_0 Vulkan SET_ROWS port (PR #33 + #87) to the other two turbo types. Reported by @dpblnt in #50 with a clean matrix on RX 9060 XT showing turbo3 V works on Vulkan but turbo2/turbo4 V abort with: pre-allocated tensor (cache_v_l*) in a buffer (Vulkan0) that cannot run the operation (SET_ROWS) at llama_context::sched_reserve() time, before any compute runs. Mechanical port across 4 files: - vulkan-shaders/types.glsl: block_turbo2_0 + block_turbo4_0 struct declarations matching the C side (ggml-common.h). - vulkan-shaders/copy_to_quant.comp: SET_ROWS quantize main() blocks for turbo2 (4 centroids, 2-bit pack, no signs byte) and turbo4 (16 centroids, 4-bit nibble pack, no signs byte). WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported from CENTROIDS_2BIT and CENTROIDS_4BIT in ggml-turbo-quant.c. - vulkan-shaders/vulkan-shaders-gen.cpp: turbo2_0 and turbo4_0 added to the set_rows iteration list at line ~789. - ggml-vulkan.cpp: SET_ROWS pipeline registrations + supports_op switch + dispatch element-count all extended with TURBO2_0 and TURBO4_0 cases. ## Verified on llvmpipe Vulkan (CPU software, AMD MI300X cloud droplet) Patched ggml-vulkan.cpp temporarily during repro to allow llvmpipe (normally filtered out as eCpu); patch reverted before commit. The SET_ROWS abort is a backend-capability check at graph build time so it fires regardless of GPU vs CPU Vulkan backend. | ctk / ctv | tg16 (t/s) | status | |-------------------|-----------:|---------------| | q4_0 / q4_0 | 17.68 | baseline | | q4_0 / turbo3 | 5.91 | already worked| | q4_0 / turbo4 | 6.14 | was aborting | | q4_0 / turbo2 | 5.65 | was aborting | llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they are reported here only to confirm the abort is gone and the kernels run end-to-end without divergence. ## Needs GPU validation Cannot validate GPU shader correctness on the droplet (MI300X SR-IOV VF does not expose itself to RADV/amdvlk on cloud). Specifically: - Subgroup shuffle / ballot behavior on real GPU subgroup sizes - Shader compilation under non-llvmpipe Vulkan drivers - PPL / quality on the actual quantization math @dpblnt @apollosenvy if either of you has cycles, would appreciate a quick rebuild on RDNA Vulkan (gfx1100/gfx1200) to confirm: 1. The SET_ROWS abort that triggered #50 is gone 2. Output coherence on turbo4 V (not garbage tokens) 3. PPL stays in the expected ballpark vs the CUDA / Metal implementations of the same quants Closes #50. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #50.
Bug
@dpblnt's matrix on RX 9060 XT (RADV gfx1200, Mesa 26.0.4) — see comment 4358790411 — isolates the failure cleanly:
SET_ROWSoncache_v_l3 (view)SET_ROWSoncache_v_l3 (view)Same
ggml_backend_sched_backend_id_from_cur→ggml_backend_sched_split_graph→llama_context::sched_reservestack as turbo3 had pre-#87. Same root cause: the Vulkan backend has no SET_ROWS pipeline registered for turbo2 and turbo4 V types, so the graph scheduler aborts at init.Fix
Mirror of @apollosenvy's turbo3 Vulkan SET_ROWS port (#33 + #87) to the other two turbo types. 4 files, +278/-3 lines, all mechanical:
vulkan-shaders/types.glsl—block_turbo2_0andblock_turbo4_0struct declarations matching the C side (ggml/src/ggml-common.h):float16_t norm+uint8_t qs[32](2-bit, 4 indices/byte, 10 bytes total)float16_t norm+float16_t rnorm(reserved) +uint8_t qs[64](4-bit nibble, 2 indices/byte, 68 bytes total — matches the productionTURBO4_USE_4BITlayout)vulkan-shaders/copy_to_quant.comp— SET_ROWS quantizemain()blocks for turbo2 and turbo4. WHT setup and reduction structure identical to turbo3 (QK = 128 across all three). Centroid + midpoint tables ported fromCENTROIDS_2BITandCENTROIDS_4BITinggml/src/ggml-turbo-quant.c. turbo2 packs via the same subgroupShuffle pattern as turbo3 (4-per-byte); turbo4 uses pair-shuffle (2-per-byte nibble pack).vulkan-shaders/vulkan-shaders-gen.cpp—turbo2_0andturbo4_0added to the set_rows iteration list at line ~789.ggml-vulkan.cpp— three sites:SET_ROWS(_i32)/SET_ROWS(_i64)macro)supports_opswitch for SET_ROWSVerified on llvmpipe Vulkan (CPU software fallback, AMD MI300X cloud droplet)
Could not validate on real GPU — the MI300X SR-IOV VF on the cloud doesn't expose itself to RADV, and amdvlk isn't packaged for Ubuntu. Patched
ggml-vulkan.cpptemporarily during repro to allow llvmpipe (normally filtered out aseCpu); patch reverted before commit.The SET_ROWS abort is a backend-capability check at graph build time, so it fires regardless of GPU vs CPU Vulkan backend. Both reproduction (pre-fix) and verification (post-fix) work correctly on llvmpipe.
llvmpipe perf numbers are not meaningful (CPU-emulated Vulkan); they're reported only to confirm the abort is gone and the kernels run end-to-end.
Needs GPU validation — please test if you can
What this PR proves on llvmpipe:
What this PR cannot prove on llvmpipe:
@dpblnt @apollosenvy if either of you has cycles, would deeply appreciate:
llama-bench -ctk q4_0 -ctv turbo4on Vulkan no longer aborts-p "The capital of France is" -n 16) — should produce English, not garbage UTF-8If something breaks, the most likely culprit is the subgroupShuffle pattern in the new turbo4 nibble-pack code (lines marked "Pack qs: 2 elements per byte (4-bit nibble each)"). Easy to swap for an alternative pack if the shuffle ordering doesn't match expectations on RDNA.
Closes
🤖 Generated with Claude Code