This fork targets ROCm and Vulkan as first-class backends. Every feature is expected to reach parity across both; features that cannot are explicitly documented as such.
This document defines the policy, lists the per-feature status, and tracks the Vulkan-port backlog.
A feature progresses through three states:
| State | Definition |
|---|---|
| ROCm-landed | Feature works correctly on ROCm. Passes PPL regression on at least one mandatory ROCm target. Merged to trunk. |
| Vulkan-landed | Vulkan implementation works correctly. Passes PPL regression on Vulkan. Merged to trunk (typically on a follow-up branch). |
| Released | Both ROCm-landed AND Vulkan-landed AND cross-backend PPL match within tolerance (defined per type in TYPE_ASSIGNMENTS.md). Listed in user-facing release notes. |
Vulkan parity is high-priority, not a hard gate. A feature ROCm-lands first; the Vulkan port is scheduled as a follow-up sub-layer. The feature does NOT appear as a released feature until both backends are on trunk and cross-backend PPL matches.
A feature is shipped ROCm-only if and only if the Vulkan port has been attempted and documented as non-viable (e.g., a fundamental compute-shader limitation around Viterbi decoding or stateful KV reuse). Non-viability requires a written rationale in this document.
| Target | Priority | Hardware | Notes |
|---|---|---|---|
| gfx1150 | Mandatory | Strix Halo APU | Primary development target; all ROCm features must work here. |
| gfx1102 | Nice-to-have | RDNA3 mobile | AMD's upstream ROCm support is the real bottleneck (stock Tensile lacks GEMM kernels for many dtypes/shapes). Use Vulkan instead unless the side-quest below produces working ROCm. |
| gfx1103 | Nice-to-have | RDNA3 mobile | Same as gfx1102. |
A feature ROCm-lands when it works on gfx1150. gfx1102/1103 are not release-blocking.
Vulkan is expected to be the more portable backend across mixed hardware. Targets:
- AMD RDNA3 (gfx1102/gfx1103) — primary Vulkan use case (systems where ROCm is blocked on upstream AMD)
- AMD RDNA3.5 (gfx1150) — first-class Vulkan target alongside ROCm
- Anything else that runs Mesa RADV — best-effort
A feature Vulkan-lands when it works on both RDNA3 and RDNA3.5 Vulkan.
The PPL regression harness runs every supported quant type on BOTH ROCm and Vulkan after both have landed, on the SAME pinned wikitext slice. Cross-backend delta tolerance:
- PPL delta < 0.5% vs ROCm baseline → parity verified.
- 0.5% ≤ delta < 1.0% → parity warning; investigate before release.
- delta ≥ 1.0% → parity failure; release blocked until fixed.
The harness pins the wikitext slice to avoid the wikitext-mismatch trap that's burned past porting work. Same file, same byte offset, both backends.
Status updated per layer landing. Initial state derived from BACKEND_PARITY.md audit, 2026-05-12.
| Feature | Layer | ROCm status | Vulkan status | Vulkan port priority |
|---|---|---|---|---|
Zyphra ZAYA1-8B model arch (LLM_ARCH_ZAYA) |
model port (not phased) | RELEASED on gfx1150; compiles on gfx1102/1103 but runtime dead per Tensile/hipBLAS gap | RELEASED on RDNA3 (gfx1103); single-seq + multi-seq PPL within ±0.5% across F16/Q8_0/Q5_K_M/IQ4_XS-imat-guq5k | n/a (released; pure-graph port, no new kernels or types) |
| TurboQuant KV (TURBOQ2/3/4_0) | 1 | RELEASED (gfx1150 first-class; gfx1102/1103 smoke-only via HSA_OVERRIDE_GFX_VERSION=11.0.2) |
RELEASED (RDNA3 + RDNA3.5; cross-backend Δ ≤ +0.17%) | n/a (released) |
| TurboQuant8 KV (TURBOQ8_0, source: buun) | 1 | wired — CPU vec_dot + CUDA/HIP fattn-vec instances on main (b22b6492d); 8-bit uniform-grid codec; quality bench pending (measure-first) |
landed — K-cache port on main efb8832cd (get_rows + dequant + scalar-FA binding + gate); coopmat turboq8 FA = follow-on scope |
Vulkan K-cache landed efb8832cd; coopmat FA follow-on pending |
| WHT weight quants (WHT3_0) | 1 | RELEASED | RELEASED | n/a (released) |
| WHT weight quants (WHT4_0) | 1 | RELEASED | RELEASED (cross-backend Δ +0.057%) | n/a (released) |
| GGML_OP_TURBO_WHT | 1 | RELEASED | RELEASED | n/a (released) |
Boundary V / TURBO_LAYER_ADAPTIVE |
1 | RELEASED | RELEASED | n/a (default-off; backend-agnostic plumbing) |
| MTP spec-decode spine + Migration 0-3 | 2 | RELEASED (speculative driver + loader + graph converged to b9246; V-J gap closed 705ffccb8) |
no novel GPU kernels — Vulkan parity via inherited mainline paths | n/a (CPU/scheduler only; no backend-specific kernels) |
| NLD (Nemotron-Labs Diffusion) | model port (not phased) | RELEASED ROCm — CLI 49f88e18a; server self-spec 1cb8c4218 |
RELEASED Vulkan (smoke + 5ch PPL PASS on gfx1103 RADV, 2026-05-24; llama-cli load+gen + llama-perplexity validated) | n/a (released; backend-agnostic inference path, no new kernels) |
| TCQ KV (TURBOQ2/3_TCQ) | 3 | source has CUDA only, no HIP | source has none | P1 — Viterbi-in-shader is hard; investigate viability |
| TriAttention | 9 (revived) | RELEASED — Phase A+B in-graph K/V capture harness 6cbc9e06c + HIP guard + Gemma-4 ISWA fix cbd071632 + legacy compaction evictor 6f93b4e5d; GQA CPU smoke GREEN 3/3; Phase C GPU GQA kernel HIP 51a64b43c + Vulkan 0d13ac92b SHIPPED; SWA-layer capture for Gemma-4 hybrid models SHIPPED 086c8508f (Phase C Part 2) |
RELEASED — Vulkan GPU GQA kernel 0d13ac92b + SWA capture 086c8508f; cross-backend parity achieved |
n/a (released) |
| EAGLE3 | 5 | RELEASED — hidden-state extrapolation ported c0f3c1486; fc dtype-aware fix 4c38845c4 (BF16/F16→F32); struct rebase fixup e109b17d8; compact-vocab (SpecForge 32K-draft-vocab + d2t) b2766ef47 (2026-05-31, PR #18039) |
backend-agnostic; no novel GPU kernels | n/a (backend-agnostic) |
| PHANTOM-X | 5 | RELEASED — speculator ported d6dc63224; Phase 2 dispatch 388169995 |
backend-agnostic; no novel GPU kernels | n/a (backend-agnostic) |
| TurboMind allocator | 5 | gfx1030-specific in source | source has none | Investigate; may not be needed |
| Wave32 RDNA2 kernels | 5 | ROCm-only by design (RDNA2 SIMD32) | not applicable | ROCm-only by design |
| IK quants base-K (IQ2_K, IQ3_K, IQ4_K) | 5 (5b-1a) | RELEASED — ROCm + Vulkan (PPL Δ < 0.0045 vs reference) | RELEASED — Vulkan batched mul_mat SEGV fixed 5fe804bcd |
n/a (released) |
| IK quants row-meta KS/KT (IQ4_KS, IQ4_KSS, IQ3_KS, IQ4_KT) | 5 (5b-1b) | RELEASED — ROCm + Vulkan (PPL gate 20-chunk Δ ≤ 0.043) | RELEASED — Vulkan SEGV fixed via is_empty() dequant-to-f16 fallback |
n/a (released) |
| IK quants extended (IQ5_K, IQ6_K) | 5 (5b-2) | RELEASED — Phase 5b-2 CPU+CUDA/HIP 8e19be061 |
RELEASED — Vulkan dequant + matvec shaders 0ade7ff86 2026-05-25 |
n/a (released) |
| IK quant IQ2_KL (2.6875 bpw) | 5 (5b-1c) | RELEASED — Phase 5b-1c CPU+CUDA/HIP f18a92a42 |
RELEASED — Vulkan dequant + matvec shaders 3723c1f61 2026-05-25 |
n/a (released) |
| IK Trellis IQ2_KT (Phase P3a) | 5 (Trellis P3a) | RELEASED (§-FLAG do-not-use) — CPU+ROCm + Vulkan; 0dac276d9 + cluster-accel 1e8501e46 2026-05-25; known issues: 0.8B PPL anomaly open; cluster-accel PPL +8.3% above ≤+5% gate |
RELEASED — Vulkan dequant + mul_mat_vec shaders (main) | Formal PPL parity gate pending |
| BitNet (IQ1_BN, IQ2_BN, I2_S) | 6 | source has CUDA + implicit HIP | source has none | P1 — ternary decode is simple |
| MLA / FlashMLA | 6 | source has CUDA | source has none | P2 — very high port cost |
| Fused MoE | 6 | source has CUDA | source has none | P2 |
| Trellis weight quants (IQ3_KT, IQ1_KT) | 5 (Trellis P3b, P3c) | RELEASED — IQ3_KT CPU+ROCm+Vulkan (c809225f6, CLOSED — gfx1150 GPU 99% util / 7.66s-pass; gfx1102 warmup crash is a separate Tensile confound); IQ1_KT queued |
RELEASED — Vulkan dequant + mul_mat_vec shaders (main) | RESOLVED on Vulkan: IQ3_KT 8.4299 / IQ3_K 6.8348 (9B, 20ch); ROCm gfx1150 PASS (CLOSED) |
| Q*_K row-interleaved (_R4/_R8) | 6 | CUDA only in ik_llama | none | P2 — CPU variants exist; GPU optional |
| RaBitQ TQ3 weights (RBQ3_*) | 7 | source has CUDA; HIP branch not yet merged | source has none | P1 |
| DFlash spec-decode (S1+S2+S3+converter) | 7a | RELEASED — S1 loader b6a75e524 + S2 dispatch ef80c728c + S3 GPU ring buffer + server spec_type 9b7ab4e83 + mask_token_id u32 fix 1436d1890 + DFlashDraftModel safetensors→GGUF converter ee7d4f896; end-to-end smoke GREEN @ 2726a56c0 |
backend-agnostic at S3 dispatch level; GPU ring buffer is CUDA/HIP | n/a (released) |
| PFlash prompt compression | 8 | mostly CPU/scheduler | mostly CPU | P3 — backend-agnostic |
| --hugepages | 9 | Linux kernel feature; backend-agnostic | same | n/a |
| gfx1030 normalization | 9 | ROCm-only by design | not applicable | ROCm-only by design |
Priorities:
- P0 = port concurrent with ROCm landing (target: same week)
- P1 = port within 2 weeks of ROCm landing
- P2 = port within 4 weeks of ROCm landing
- P3 = port within 8 weeks of ROCm landing or backend-agnostic
- declined = will not port unless explicitly revisited
For each Vulkan port, follow this recipe (informed by community experience with the Vulkan SET_ROWS 5-place wiring pattern and the Vulkan buffer-sizing pitfall for IK-family row-meta types):
- Audit upstream Vulkan dispatch. Map the ROCm code path to its Vulkan
counterpart in
ggml-vulkan.cpp. Identify the 5 places SET_ROWS must be wired for any new quant type. - Use
ggml_nbytes(src0)for descriptor sizing — nevertype_size * x_ne / blck_size. The latter silently undersizes IK-family row-meta types and produces silently-wrong results. - Write the compute shader. Start from the most similar existing shader as a template.
- Wire dispatch sizing. Missing this for a new type produces silently wrong K cache and 0 tokens generated.
- PPL-test on a pinned wikitext slice. Run the SAME slice on ROCm to confirm cross-backend parity. Different wikitext slices produce different absolute PPLs and have produced phantom 0.42-PPL "regressions" that were really file-mismatch artifacts.
- Submit as a follow-up topic branch. Name:
vulkan/<feature>.
AMD's upstream ROCm support for RDNA3 mobile (gfx1102/1103) is incomplete — stock Tensile lacks GEMM kernels for many dtype/shape combinations encountered in production workloads. ROCm calibration on RDNA3-mobile systems is unworkable for production inference; Vulkan is the practical alternative for those workloads.
Project decision (2026-05-12, refined 2026-05-21): This fork treats
gfx1102/1103 ROCm as out of scope for production inference / calibration,
in scope as a regression-smoke target. The build catches HIP-shim
breakage early (e.g., new __shfl_xor_sync call sites, missing
cudaStreamCapture* shims, undefined-symbol link errors when a new
fattn-vec template instance is added). Cross-host PPL parity is validated
against gfx1150 on models that fit gfx1103's GEMM coverage
(empirically: TurboQuant KV types + WHT weight quants pass; mainline
Q4_K_M passes; production-class quantize/calibrate workloads still hit
Tensile gaps).
Build recipe. Single-target gfx1102 build (dual-target gfx1102+gfx1103 install hangs at relink). Runtime requires the HSA override to load the gfx1102 binary on gfx1103 hardware via RDNA3-family ISA compatibility:
cmake -B build-rocm-gfx1102 -S . -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=/opt/llama.cpp \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1102" \
-DGGML_AVX512=ON -DGGML_AVX512_VBMI=ON \
-DGGML_AVX512_VNNI=ON -DGGML_AVX512_BF16=ON \
-DCMAKE_C_FLAGS="-march=native -O3" \
-DCMAKE_CXX_FLAGS="-march=native -O3"
cmake --build build-rocm-gfx1102 -j
# Runtime: ALWAYS set HSA_OVERRIDE
HSA_OVERRIDE_GFX_VERSION=11.0.2 ./build-rocm-gfx1102/bin/llama-perplexity ...If AMD ships upstream support for gfx1102/1103 GEMM kernels at some future point, this fork will inherit it via the standard cmake recipe without project-side work, and this section will collapse back to "first-class".
If a feature's Vulkan port is attempted and abandoned, document it here with:
- Feature name and layer
- What was attempted
- What failed (technical reason)
- ROCm-only marker added to feature documentation
- Date of decision
A feature shipped ROCm-only will be clearly marked in user-facing release notes and the README's feature table.
- Wave32 RDNA2 kernels (carlosfundora) — RDNA2-specific SIMD32 path. Vulkan equivalent would be a generic compute shader; the optimization doesn't translate.
- gfx1030 normalization (carlosfundora) — ROCm gfx1030 build-fixup patches. Vulkan-irrelevant.
(none yet)
Feature: MTP (Multi-Token Prediction) speculative decoding
Backends affected: Vulkan (historical)
Status: Resolved — n_max >= 2 now works on Vulkan; the --spec-draft-n-max 1 constraint is no longer required.
Symptom (historical): The inference process hung when running Vulkan MTP speculative decoding with n_max >= 2 (multi-token draft batches). Single-token drafts (n_max = 1) were unaffected.
Root cause: A partial-acceptance checkpoint-restore livelock in examples/speculative-simple. On a partial accept the draft prefix was restored verbatim, but the MTP draft head's per-step state was not rolled back, so the resampled tail desynced the next verification trace and the restore loop never made progress (reachable only when n_max >= 2 can leave more than one pending draft id). It presented as a GPU-consuming hang because every restore re-ran the verify decode. The equivalent fix already existed in the server speculative path; the standalone example was missing it.
Fix: Gate the checkpoint pop_back on MTP drafts that still have more than one pending id, so the restore loop converges. Vulkan MTP speculative decoding now completes correctly at n_max >= 2.
The harness MUST:
- Run on both ROCm (gfx1150) and Vulkan from day one.
- Pin the wikitext slice (specific file, specific byte range) and never compare across slices.
- Store baseline PPLs per (model, type, backend) tuple.
- Flag cross-backend deltas exceeding the parity tolerance bands above.
- Run before any layer-landing merge.
A test that runs on only one backend is not a complete regression test.
Every weight quantization type that lands in this fork must support the imatrix (importance-matrix) mechanism. Adding a weight quant without imatrix support is a layer-landing failure.
Scope: Applies to ALL weight quants (RBQ3_1S/4S,
IQ*_K, IQ*_KS, IQ*_KSS, BitNet IQ1_BN/IQ2_BN/I2_S,
trellis IQ*_KT if revived, and any future weight quant). Does NOT
apply to KV-cache quants (TURBOQ*_0, TURBOQ*_TCQ, RQ_*).
Exception — WHT3_0/WHT4_0: imatrix is intentionally disabled (quantizer audit a6ccf0bfa). The Walsh-Hadamard rotation mixes all 32 block columns, so post-rotation coefficient buf[j] no longer corresponds to original column j. Weighting the rotated residual by original-basis importance iw[j] misaligns importance and measurably degrades quality (−15.5 % PPL penalty). Both types quantize unweighted by design; tensor_requires_imatrix() returns false.
Per-type integration requires:
quantize_row_<type>(or equivalent mmq precursor) accepts an imatrix and uses it for per-tensor importance weighting.src/llama-quant.cppintegrates withqs.has_imatrix/requires_imatrixper-ftype logic (matching the existing pattern for IQ3_S, IQ3_XXS, IQ4_NL).llama-imatrixrecorder computes the right activation statistic for the type.docs/TYPE_ASSIGNMENTS.mddocuments which imatrix axis/group applies.- PPL regression run BOTH with and without imatrix; results recorded.
Rationale: weight-quant quality degrades sharply without activation- weighted importance at low bit-rates (≤4-bit). Mainline already requires imatrix for some quant types; this fork's contributing forks target ≤4-bit quants where imatrix is not optional.
Initial backend coverage audit performed against fork tips:
- thetom/feature/turboquant-kv-cache (originally measured against
feature/alpha-scaling; TQ-KV is a strict superset — seerecon/06-thetom-branches.md. Re-audit Vulkan shader counts after Phase 1 cherry-pick.) - buun/master
- carlosfundora/1-bit-turbo
- turbotan/main + turbotan/experiment/hip-tq3-support
- domvox/feature/triattention-scoring
- ikllama/main
Key finding: no contributing fork has Vulkan implementations for its distinctive features. Vulkan shader counts at or below mainline for all forks. ik_llama is 56 Vulkan shaders behind mainline.
The implication: this fork bears the entire Vulkan-port burden in-house. Community work in the wider llama.cpp ecosystem (e.g., experiments outside the seven audited forks) may have Vulkan implementations of some of these features; sweep regularly.
- v1 (2026-05-12) — initial policy + audit baseline.
- v2 (2026-05-21) — Phase 1 release: TurboQuant KV (TURBOQ2/3/4_0) +
WHT4_0 released on both backends; WHT3_0 Vulkan released (
37737a197). gfx1102/1103 ROCm scope refined from "out of scope" to "partial scope: smoke target" (HSA_OVERRIDE recipe documented). - v3 (2026-05-22 to 2026-05-24) — Phase 5b-1a (IQ2_K/IQ3_K/IQ4_K) +
Phase 5b-1b (IQ4_KS/IQ4_KSS/IQ3_KS/IQ4_KT) released on both backends
(Vulkan batched mul_mat SEGV fixed via
is_empty()guard5fe804bcd). MTP spine rows updated: migration phases 0-3 complete; V-J accept-rate gap closed. NLD (Nemotron-Labs Diffusion) added as ROCm-released model port. Per-feature table expanded with 5b-1a/1b/2 + NLD + MTP migration rows. - v4 (2026-05-24) — Phase 5b-2 (IQ5_K/IQ6_K) and Phase 5b-1c (IQ2_KL)
released rows updated. EAGLE3 + PHANTOM-X marked RELEASED (backend-agnostic).
DFlash S1 model loader noted.
NLD Vulkan RELEASED confirmed (gfx1103 RADV,
7da3a8378). Vulkan base-K MUL_MAT_ID fix (c4da029f3) recorded. - v5 (2026-06-22) — TURBOQ8_0 Vulkan status updated: K-cache port landed on
main
efb8832cd(get_rows + dequant + scalar-FA binding + gate; was "not present"). TURBOQ4_0 Vulkan verified accurate —cd2d0224c(coopmat1 FA path fix, no stale limitation text to remove). provenance-docfix-2026-06-22 sweep.