docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201) by Kaden-Schutt · Pull Request #156 · Luce-Org/lucebox-hub

Kaden-Schutt · 2026-05-11T18:25:38Z

Summary

Adds dflash/docs/HIP_PERF_PLAN.md — a rocprofv3-grounded diagnosis of where the HIP backend's decode tax lives, plus a ranked 4-tier optimization plan for the Luce-Org/llama.cpp-dflash-ggml fork.

Docs-only PR. Zero behavior change. Intended as the design artifact to review before any kernel work touches the codebase.

Key findings

Captured via rocprofv3 + per-kernel ISA scan on the canonical DFlash bench (Qwen3.6-27B-Q4_K_M + z-lab DFlash drafter, --fast-rollback --ddtree --ddtree-budget=22, HE-style 128-tok prompt md5 4280413edc0b45c2b09e1a45f4f5ee60, n_gen=256):

mul_mat_q<q4_K, 32> dominates — 76% of decode GPU time on gfx1100. Flash attention is 0.5%. The "HIP tax" is in mmq.cuh's MMA path for q4_K/q4_0/q5_0, not in the missing custom flashprefill_kernels.hip.cu.
Dispatch trace: MMVQ_MAX_BATCH_SIZE = 8 (mmvq.cuh:3). DDTree budget=22 → batch>8 → always falls to MMQ, which wastes ~31% of every batch-32 tile on unused columns.
Tier 1 empirically verified: setting --ddtree-budget=8 on gfx1100 lifts decode from 49.81 → 76.02 tok/s (+53%) with zero kernel work. RDNA3.5 (gfx1151) and RDNA4 (gfx1201) prefer the existing budget=22 path; ship is gfx110x-specific.

3-arch empirical table

n_gen=256 on canonical HE bench, warmup + 2 measurement runs:

Arch	Card	budget=22 (MMQ)	budget=8 (MMVQ)	Delta
gfx1100	7900 XTX	49.81 tok/s	76.02 tok/s	+53%
gfx1151	Strix Halo iGPU	34.78 tok/s	30.71 tok/s	-13%
gfx1201	R9700	84.70 tok/s	77.23 tok/s	-9%

Why I'm proposing this as a PR before any kernel work

Lucebox has limited HIP profiling history. Before anyone (us, you, an upstream contributor) commits days to a kernel change, the rocprof evidence + dispatch trace + per-arch sensitivity should be on the record. This doc is that record. If the diagnosis is wrong on any axis, better to surface it via review on a docs PR than to find out mid-kernel-port.

The plan also identifies that Path B (rocWMMA port of flashprefill_kernels.cu) is orthogonal to this — Path B addresses long-context prefill TTFT (compress + target_prefill), not decode tok/s. Both should ship; this doc covers the decode side.

Test plan

No code changes — docs only
Markdown renders correctly on GitHub
Reviewers: sanity-check the dispatch trace at ggml-cuda.cu:2294 and mmvq.cuh:3 against your read of the codebase
Reviewers: comment if Tier 2's "extend MMVQ to ncols_dst ≤ 32" approach has prior art / known issues / preferred alternatives in the ggml-org upstream

Companion artifacts

Profiling wrapper: scripts/lucebox_kernel_atlas.py (adapts hipfire's Kernel-Atlas methodology to lucebox's test_dflash invocation — runs rocprofv3 kernel-trace, joins with HSACO ISA metadata, emits Atlas-format row). Not in this PR — would land separately if there's interest.
Companion HIP support PR (separate branch, separate review): adds the shim aliases needed for the canonical-fork submodule's HIP path to build cleanly on gfx1010/1030/1100/1151/1201 + adds rocm-smi/UMA-aware boot probe to PFlash.

🤖 Generated with Claude Code

Captures the rocprofv3 evidence from a canonical DFlash bench (Qwen3.6-27B-Q4_K_M + z-lab DFlash drafter, --fast-rollback --ddtree --ddtree-budget=22, HE-style 128-tok prompt, n_gen=256) on gfx1100, gfx1151, and gfx1201, plus the dispatch trace that explains why DFlash decode lands on MMQ when DDTree budget exceeds MMVQ_MAX_BATCH_SIZE=8 (mmvq.cuh:3) — and why that path wastes ~31% of a batch-32 MMA tile on the spec-verify shape. Four tiers, ranked by effort/impact: Tier 1 (config-only, 15 min): empirically validated +53% on gfx1100 from --ddtree-budget=8 routing through MMVQ. Win is gfx110x-specific; gfx1151 (RDNA3.5 UMA) and gfx1201 (RDNA4) prefer the existing budget=22 path by ~10-13%. Suggested ship: arch-aware default in the daemon spawn or server.py. Tier 2 (3-5 days): extend MMVQ template instantiations to ncols_dst <= 32 for q4_K/q5_K/q6_K/q4_0/q5_0/q5_1/q8_0 on RDNA3+ with per-arch nwarps tuning. Lets budget=22 route MMVQ on all archs and recovers the +53% benefit at higher acceptance rate. Upstream-able to ggml-org/llama.cpp after landing on the dflash fork. Tier 3 (1-2 weeks): multi-row decode GEMV for q4_K in the hipfire-style multirow pattern (R=4 output cols/warp, register-packed batch dim). Projected ~3x over today's gfx1100. Most engineering work, biggest decode-side payoff. Tier 4 (3-5 days): scalar-fallback score-blocks kernel for gfx1010/gfx1030 (no v_dot4 on RDNA1; no WMMA on RDNA2). Unblocks PFlash on RDNA1/RDNA2 cards where today the kernel hangs or runs ~7x slower than Strix Halo. This doc precedes any kernel work — captures the diagnosis so the choice of which tier to attack first can be made on numbers, not speculation. The rocprofv3 capture script lives at .skills/hipfire-kernel-atlas + scripts/lucebox_kernel_atlas.py (adapted from hipfire's kernel-atlas methodology). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

No issues found across 1 file

Three-arch 10-prompt HumanEval A/B against the canonical Qwen3.6-27B-Q4_K_M + matched z-lab DFlash drafter on the proposed Tier 2 implementation (Kaden-Schutt/llama.cpp-dflash-ggml@feat/mmvq-rdna3-batch16, default-off env-gated) shows uniform regression at the default --ddtree-budget=22 workload: GPU Arch Budget Δ tok/s 7900 XTX gfx1100 (RDNA3) 22 −42.7 % (40.91 → 23.46) R9700 gfx1201 (RDNA4) 22 −68.9 % (77.53 → 24.13) Strix Halo iGPU gfx1151 (RDNA3.5) 22 −43.9 % (26.36 → 14.79) Budget=8 cells are bench noise (paths identical at ne[1]=9 since 9 ∉ {1..8, 16, 23}); FP-order drift only. Why: MMVQ at extended ncols_dst pays activation re-read traffic per column and forfeits MMQ's WMMA matrix-core throughput. RDNA4 worst case directly contradicts the doc's prior claim that RDNA4's tile shapes make wasted columns cheap. Updates: - tl;dr notes the falsification + pivot to Tier 3 - Tier 2 section rewritten as a documented negative result with full bench table, four-point root-cause analysis, and class-of-result context (fifth synth-win→prod-falsify cycle on RDNA3+ ROCm 7.2.x per the hipfire catalogue) - Ranked priority promotes Tier 3 to primary kernel-side lever - PR sequence drops PR-A / PR-B (Tier 2 expansion), reroutes the multi-row GEMV plan to ne[1] ≤ 8 (the regime where MMVQ already wins) instead of competing with MMQ above 8 Disposition: feat/mmvq-rdna3-batch16 stays on the fork as a research artifact for reproduction. Not opened as a perf PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kaden-Schutt · 2026-05-11T19:50:56Z

Update 2026-05-11 — Tier 2 tested and falsified empirically; doc updated in f841b02.

Implemented the proposed Tier 2 approach on a research branch (Kaden-Schutt/llama.cpp-dflash-ggml@feat/mmvq-rdna3-batch16, default-off behind GGML_MMVQ_NO_EXTENDED=1 opt-out) and ran the canonical 10-prompt HumanEval A/B against it across all three RDNA3+ archs available to me. Result:

GPU	Arch	Budget	baseline (MMQ) tok/s	tier2 (MMVQ extended) tok/s	Δ
7900 XTX	gfx1100 (RDNA3)	22	40.91	23.46	−42.7 %
7900 XTX	gfx1100 (RDNA3)	8	62.36	65.72	+5.4 % (noise)
R9700	gfx1201 (RDNA4)	22	77.53	24.13	−68.9 %
R9700	gfx1201 (RDNA4)	8	64.88	70.77	+9.1 % (noise)
Strix Halo iGPU	gfx1151 (RDNA3.5)	22	26.36	14.79	−43.9 %
Strix Halo iGPU	gfx1151 (RDNA3.5)	8	28.73	26.70	−7.1 % (noise)

Same binary per arch, GGML_MMVQ_NO_EXTENDED=1 toggled for baseline cell vs tier2 cell, byte-identical pre-tokenized prompts, 27B Q4_K_M + matched z-lab/Qwen3.6-27B-DFlash drafter, n_gen=256, ROCm 7.2.2. Budget=8 cells are within-noise because the new gate doesn't route ne[1]=9 through MMVQ either (9 ∉ {1..8, 16, 23}), so paths are identical modulo FP-reduction-order drift.

Why the projection was wrong — full discussion in the updated Tier 2 section of the plan doc, summary:

MMQ's WMMA matrix cores beat MMVQ's scalar v_dot4 even at the supposedly-wasteful 32-wide tile with ne[1]=23.
MMVQ at extended ncols_dst re-reads activations per column inside the K-block loop — 23× redundant traffic. MMQ stages once to LDS.
Per-thread tmp[23] accumulator increases VGPR pressure enough to drop wave-occupancy on RDNA3.5 and RDNA4.
FP non-associativity between the two reduction orders shifts AL ±7 % across archs even when raw tok/s is held constant.

Disposition: dropped Tier 2 from the PR sequence. The feat/mmvq-rdna3-batch16 branch stays on my fork as a documented research artifact (default-off, env-gated reproduction). Pivoted the kernel-side roadmap to Tier 3 (multi-row q4_K decode GEMV at ne[1] ≤ 8) — the hipfire pattern that consistently ships on RDNA3+ without either failure mode.

Doc commit: f841b02. New length 338 lines.

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/docs/HIP_PERF_PLAN.md">

<violation number="1" location="dflash/docs/HIP_PERF_PLAN.md:315">
P2: Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to `ne[1] ≤ 8`, which would miss the stated target workload.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-05-11T19:55:44Z

+   remaining payoff). Lands a new
+   `mul_mat_vec_q_multirow_rdna_<arch>.cu` template alongside the
+   existing MMVQ kernel; dispatched from `ggml_cuda_mul_mat` at
+   `ne[1] ≤ 8` (the regime where the existing MMVQ already wins) to


P2: Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to ne[1] ≤ 8, which would miss the stated target workload.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/docs/HIP_PERF_PLAN.md, line 315: <comment>Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to `ne[1] ≤ 8`, which would miss the stated target workload.</comment> <file context> @@ -203,16 +306,24 @@ should ship. + remaining payoff). Lands a new + `mul_mat_vec_q_multirow_rdna_<arch>.cu` template alongside the + existing MMVQ kernel; dispatched from `ggml_cuda_mul_mat` at + `ne[1] ≤ 8` (the regime where the existing MMVQ already wins) to + capture R-row sharing without touching the > 8 dispatcher (which + Tier 2 proved is MMQ's territory). </file context>

cubic-dev-ai Bot reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201)#156

docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201)#156
Kaden-Schutt wants to merge 2 commits into
Luce-Org:mainfrom
Kaden-Schutt:docs/hip-perf-plan

Kaden-Schutt commented May 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Kaden-Schutt commented May 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kaden-Schutt commented May 11, 2026

Summary

Key findings

3-arch empirical table

Why I'm proposing this as a PR before any kernel work

Test plan

Companion artifacts

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kaden-Schutt commented May 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant