docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201)#156
docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201)#156Kaden-Schutt wants to merge 2 commits into
Conversation
Captures the rocprofv3 evidence from a canonical DFlash bench (Qwen3.6-27B-Q4_K_M + z-lab DFlash drafter, --fast-rollback --ddtree --ddtree-budget=22, HE-style 128-tok prompt, n_gen=256) on gfx1100, gfx1151, and gfx1201, plus the dispatch trace that explains why DFlash decode lands on MMQ when DDTree budget exceeds MMVQ_MAX_BATCH_SIZE=8 (mmvq.cuh:3) — and why that path wastes ~31% of a batch-32 MMA tile on the spec-verify shape. Four tiers, ranked by effort/impact: Tier 1 (config-only, 15 min): empirically validated +53% on gfx1100 from --ddtree-budget=8 routing through MMVQ. Win is gfx110x-specific; gfx1151 (RDNA3.5 UMA) and gfx1201 (RDNA4) prefer the existing budget=22 path by ~10-13%. Suggested ship: arch-aware default in the daemon spawn or server.py. Tier 2 (3-5 days): extend MMVQ template instantiations to ncols_dst <= 32 for q4_K/q5_K/q6_K/q4_0/q5_0/q5_1/q8_0 on RDNA3+ with per-arch nwarps tuning. Lets budget=22 route MMVQ on all archs and recovers the +53% benefit at higher acceptance rate. Upstream-able to ggml-org/llama.cpp after landing on the dflash fork. Tier 3 (1-2 weeks): multi-row decode GEMV for q4_K in the hipfire-style multirow pattern (R=4 output cols/warp, register-packed batch dim). Projected ~3x over today's gfx1100. Most engineering work, biggest decode-side payoff. Tier 4 (3-5 days): scalar-fallback score-blocks kernel for gfx1010/gfx1030 (no v_dot4 on RDNA1; no WMMA on RDNA2). Unblocks PFlash on RDNA1/RDNA2 cards where today the kernel hangs or runs ~7x slower than Strix Halo. This doc precedes any kernel work — captures the diagnosis so the choice of which tier to attack first can be made on numbers, not speculation. The rocprofv3 capture script lives at .skills/hipfire-kernel-atlas + scripts/lucebox_kernel_atlas.py (adapted from hipfire's kernel-atlas methodology). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-arch 10-prompt HumanEval A/B against the canonical
Qwen3.6-27B-Q4_K_M + matched z-lab DFlash drafter on the proposed
Tier 2 implementation (Kaden-Schutt/llama.cpp-dflash-ggml@feat/mmvq-rdna3-batch16,
default-off env-gated) shows uniform regression at the default
--ddtree-budget=22 workload:
GPU Arch Budget Δ tok/s
7900 XTX gfx1100 (RDNA3) 22 −42.7 % (40.91 → 23.46)
R9700 gfx1201 (RDNA4) 22 −68.9 % (77.53 → 24.13)
Strix Halo iGPU gfx1151 (RDNA3.5) 22 −43.9 % (26.36 → 14.79)
Budget=8 cells are bench noise (paths identical at ne[1]=9 since
9 ∉ {1..8, 16, 23}); FP-order drift only.
Why: MMVQ at extended ncols_dst pays activation re-read traffic per
column and forfeits MMQ's WMMA matrix-core throughput. RDNA4 worst
case directly contradicts the doc's prior claim that RDNA4's tile
shapes make wasted columns cheap.
Updates:
- tl;dr notes the falsification + pivot to Tier 3
- Tier 2 section rewritten as a documented negative result with full
bench table, four-point root-cause analysis, and class-of-result
context (fifth synth-win→prod-falsify cycle on RDNA3+ ROCm 7.2.x
per the hipfire catalogue)
- Ranked priority promotes Tier 3 to primary kernel-side lever
- PR sequence drops PR-A / PR-B (Tier 2 expansion), reroutes the
multi-row GEMV plan to ne[1] ≤ 8 (the regime where MMVQ already
wins) instead of competing with MMQ above 8
Disposition: feat/mmvq-rdna3-batch16 stays on the fork as a research
artifact for reproduction. Not opened as a perf PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Update 2026-05-11 — Tier 2 tested and falsified empirically; doc updated in Implemented the proposed Tier 2 approach on a research branch (
Same binary per arch, Why the projection was wrong — full discussion in the updated Tier 2 section of the plan doc, summary:
Disposition: dropped Tier 2 from the PR sequence. The Doc commit: |
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/docs/HIP_PERF_PLAN.md">
<violation number="1" location="dflash/docs/HIP_PERF_PLAN.md:315">
P2: Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to `ne[1] ≤ 8`, which would miss the stated target workload.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| remaining payoff). Lands a new | ||
| `mul_mat_vec_q_multirow_rdna_<arch>.cu` template alongside the | ||
| existing MMVQ kernel; dispatched from `ggml_cuda_mul_mat` at | ||
| `ne[1] ≤ 8` (the regime where the existing MMVQ already wins) to |
There was a problem hiding this comment.
P2: Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to ne[1] ≤ 8, which would miss the stated target workload.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/docs/HIP_PERF_PLAN.md, line 315:
<comment>Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to `ne[1] ≤ 8`, which would miss the stated target workload.</comment>
<file context>
@@ -203,16 +306,24 @@ should ship.
+ remaining payoff). Lands a new
+ `mul_mat_vec_q_multirow_rdna_<arch>.cu` template alongside the
+ existing MMVQ kernel; dispatched from `ggml_cuda_mul_mat` at
+ `ne[1] ≤ 8` (the regime where the existing MMVQ already wins) to
+ capture R-row sharing without touching the > 8 dispatcher (which
+ Tier 2 proved is MMQ's territory).
</file context>
Summary
Adds
dflash/docs/HIP_PERF_PLAN.md— a rocprofv3-grounded diagnosis of where the HIP backend's decode tax lives, plus a ranked 4-tier optimization plan for theLuce-Org/llama.cpp-dflash-ggmlfork.Docs-only PR. Zero behavior change. Intended as the design artifact to review before any kernel work touches the codebase.
Key findings
Captured via rocprofv3 + per-kernel ISA scan on the canonical DFlash bench (Qwen3.6-27B-Q4_K_M + z-lab DFlash drafter,
--fast-rollback --ddtree --ddtree-budget=22, HE-style 128-tok prompt md54280413edc0b45c2b09e1a45f4f5ee60, n_gen=256):mul_mat_q<q4_K, 32>dominates — 76% of decode GPU time on gfx1100. Flash attention is 0.5%. The "HIP tax" is inmmq.cuh's MMA path forq4_K/q4_0/q5_0, not in the missing customflashprefill_kernels.hip.cu.MMVQ_MAX_BATCH_SIZE = 8(mmvq.cuh:3). DDTree budget=22 → batch>8 → always falls to MMQ, which wastes ~31% of every batch-32 tile on unused columns.--ddtree-budget=8on gfx1100 lifts decode from 49.81 → 76.02 tok/s (+53%) with zero kernel work. RDNA3.5 (gfx1151) and RDNA4 (gfx1201) prefer the existing budget=22 path; ship is gfx110x-specific.3-arch empirical table
n_gen=256 on canonical HE bench, warmup + 2 measurement runs:
Why I'm proposing this as a PR before any kernel work
Lucebox has limited HIP profiling history. Before anyone (us, you, an upstream contributor) commits days to a kernel change, the rocprof evidence + dispatch trace + per-arch sensitivity should be on the record. This doc is that record. If the diagnosis is wrong on any axis, better to surface it via review on a docs PR than to find out mid-kernel-port.
The plan also identifies that Path B (rocWMMA port of
flashprefill_kernels.cu) is orthogonal to this — Path B addresses long-context prefill TTFT (compress + target_prefill), not decode tok/s. Both should ship; this doc covers the decode side.Test plan
ggml-cuda.cu:2294andmmvq.cuh:3against your read of the codebaseCompanion artifacts
scripts/lucebox_kernel_atlas.py(adapts hipfire's Kernel-Atlas methodology to lucebox'stest_dflashinvocation — runs rocprofv3 kernel-trace, joins with HSACO ISA metadata, emits Atlas-format row). Not in this PR — would land separately if there's interest.🤖 Generated with Claude Code