Skip to content

docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201)#156

Open
Kaden-Schutt wants to merge 2 commits into
Luce-Org:mainfrom
Kaden-Schutt:docs/hip-perf-plan
Open

docs(hip): perf diagnosis + 4-tier optimization plan (rocprofv3 evidence on gfx1100/1151/1201)#156
Kaden-Schutt wants to merge 2 commits into
Luce-Org:mainfrom
Kaden-Schutt:docs/hip-perf-plan

Conversation

@Kaden-Schutt
Copy link
Copy Markdown

Summary

Adds dflash/docs/HIP_PERF_PLAN.md — a rocprofv3-grounded diagnosis of where the HIP backend's decode tax lives, plus a ranked 4-tier optimization plan for the Luce-Org/llama.cpp-dflash-ggml fork.

Docs-only PR. Zero behavior change. Intended as the design artifact to review before any kernel work touches the codebase.

Key findings

Captured via rocprofv3 + per-kernel ISA scan on the canonical DFlash bench (Qwen3.6-27B-Q4_K_M + z-lab DFlash drafter, --fast-rollback --ddtree --ddtree-budget=22, HE-style 128-tok prompt md5 4280413edc0b45c2b09e1a45f4f5ee60, n_gen=256):

  1. mul_mat_q<q4_K, 32> dominates — 76% of decode GPU time on gfx1100. Flash attention is 0.5%. The "HIP tax" is in mmq.cuh's MMA path for q4_K/q4_0/q5_0, not in the missing custom flashprefill_kernels.hip.cu.
  2. Dispatch trace: MMVQ_MAX_BATCH_SIZE = 8 (mmvq.cuh:3). DDTree budget=22 → batch>8 → always falls to MMQ, which wastes ~31% of every batch-32 tile on unused columns.
  3. Tier 1 empirically verified: setting --ddtree-budget=8 on gfx1100 lifts decode from 49.81 → 76.02 tok/s (+53%) with zero kernel work. RDNA3.5 (gfx1151) and RDNA4 (gfx1201) prefer the existing budget=22 path; ship is gfx110x-specific.

3-arch empirical table

n_gen=256 on canonical HE bench, warmup + 2 measurement runs:

Arch Card budget=22 (MMQ) budget=8 (MMVQ) Delta
gfx1100 7900 XTX 49.81 tok/s 76.02 tok/s +53%
gfx1151 Strix Halo iGPU 34.78 tok/s 30.71 tok/s -13%
gfx1201 R9700 84.70 tok/s 77.23 tok/s -9%

Why I'm proposing this as a PR before any kernel work

Lucebox has limited HIP profiling history. Before anyone (us, you, an upstream contributor) commits days to a kernel change, the rocprof evidence + dispatch trace + per-arch sensitivity should be on the record. This doc is that record. If the diagnosis is wrong on any axis, better to surface it via review on a docs PR than to find out mid-kernel-port.

The plan also identifies that Path B (rocWMMA port of flashprefill_kernels.cu) is orthogonal to this — Path B addresses long-context prefill TTFT (compress + target_prefill), not decode tok/s. Both should ship; this doc covers the decode side.

Test plan

  • No code changes — docs only
  • Markdown renders correctly on GitHub
  • Reviewers: sanity-check the dispatch trace at ggml-cuda.cu:2294 and mmvq.cuh:3 against your read of the codebase
  • Reviewers: comment if Tier 2's "extend MMVQ to ncols_dst ≤ 32" approach has prior art / known issues / preferred alternatives in the ggml-org upstream

Companion artifacts

  • Profiling wrapper: scripts/lucebox_kernel_atlas.py (adapts hipfire's Kernel-Atlas methodology to lucebox's test_dflash invocation — runs rocprofv3 kernel-trace, joins with HSACO ISA metadata, emits Atlas-format row). Not in this PR — would land separately if there's interest.
  • Companion HIP support PR (separate branch, separate review): adds the shim aliases needed for the canonical-fork submodule's HIP path to build cleanly on gfx1010/1030/1100/1151/1201 + adds rocm-smi/UMA-aware boot probe to PFlash.

🤖 Generated with Claude Code

Captures the rocprofv3 evidence from a canonical DFlash bench
(Qwen3.6-27B-Q4_K_M + z-lab DFlash drafter, --fast-rollback --ddtree
--ddtree-budget=22, HE-style 128-tok prompt, n_gen=256) on gfx1100,
gfx1151, and gfx1201, plus the dispatch trace that explains why
DFlash decode lands on MMQ when DDTree budget exceeds
MMVQ_MAX_BATCH_SIZE=8 (mmvq.cuh:3) — and why that path wastes ~31%
of a batch-32 MMA tile on the spec-verify shape.

Four tiers, ranked by effort/impact:

Tier 1 (config-only, 15 min): empirically validated +53% on gfx1100
from --ddtree-budget=8 routing through MMVQ. Win is gfx110x-specific;
gfx1151 (RDNA3.5 UMA) and gfx1201 (RDNA4) prefer the existing
budget=22 path by ~10-13%. Suggested ship: arch-aware default in the
daemon spawn or server.py.

Tier 2 (3-5 days): extend MMVQ template instantiations to
ncols_dst <= 32 for q4_K/q5_K/q6_K/q4_0/q5_0/q5_1/q8_0 on RDNA3+ with
per-arch nwarps tuning. Lets budget=22 route MMVQ on all archs and
recovers the +53% benefit at higher acceptance rate. Upstream-able to
ggml-org/llama.cpp after landing on the dflash fork.

Tier 3 (1-2 weeks): multi-row decode GEMV for q4_K in the
hipfire-style multirow pattern (R=4 output cols/warp, register-packed
batch dim). Projected ~3x over today's gfx1100. Most engineering
work, biggest decode-side payoff.

Tier 4 (3-5 days): scalar-fallback score-blocks kernel for
gfx1010/gfx1030 (no v_dot4 on RDNA1; no WMMA on RDNA2). Unblocks
PFlash on RDNA1/RDNA2 cards where today the kernel hangs or runs ~7x
slower than Strix Halo.

This doc precedes any kernel work — captures the diagnosis so the
choice of which tier to attack first can be made on numbers, not
speculation. The rocprofv3 capture script lives at
.skills/hipfire-kernel-atlas + scripts/lucebox_kernel_atlas.py
(adapted from hipfire's kernel-atlas methodology).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Three-arch 10-prompt HumanEval A/B against the canonical
Qwen3.6-27B-Q4_K_M + matched z-lab DFlash drafter on the proposed
Tier 2 implementation (Kaden-Schutt/llama.cpp-dflash-ggml@feat/mmvq-rdna3-batch16,
default-off env-gated) shows uniform regression at the default
--ddtree-budget=22 workload:

  GPU             Arch              Budget  Δ tok/s
  7900 XTX        gfx1100 (RDNA3)   22      −42.7 %  (40.91 → 23.46)
  R9700           gfx1201 (RDNA4)   22      −68.9 %  (77.53 → 24.13)
  Strix Halo iGPU gfx1151 (RDNA3.5) 22      −43.9 %  (26.36 → 14.79)

Budget=8 cells are bench noise (paths identical at ne[1]=9 since
9 ∉ {1..8, 16, 23}); FP-order drift only.

Why: MMVQ at extended ncols_dst pays activation re-read traffic per
column and forfeits MMQ's WMMA matrix-core throughput. RDNA4 worst
case directly contradicts the doc's prior claim that RDNA4's tile
shapes make wasted columns cheap.

Updates:
- tl;dr notes the falsification + pivot to Tier 3
- Tier 2 section rewritten as a documented negative result with full
  bench table, four-point root-cause analysis, and class-of-result
  context (fifth synth-win→prod-falsify cycle on RDNA3+ ROCm 7.2.x
  per the hipfire catalogue)
- Ranked priority promotes Tier 3 to primary kernel-side lever
- PR sequence drops PR-A / PR-B (Tier 2 expansion), reroutes the
  multi-row GEMV plan to ne[1] ≤ 8 (the regime where MMVQ already
  wins) instead of competing with MMQ above 8

Disposition: feat/mmvq-rdna3-batch16 stays on the fork as a research
artifact for reproduction. Not opened as a perf PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kaden-Schutt
Copy link
Copy Markdown
Author

Update 2026-05-11 — Tier 2 tested and falsified empirically; doc updated in f841b02.

Implemented the proposed Tier 2 approach on a research branch (Kaden-Schutt/llama.cpp-dflash-ggml@feat/mmvq-rdna3-batch16, default-off behind GGML_MMVQ_NO_EXTENDED=1 opt-out) and ran the canonical 10-prompt HumanEval A/B against it across all three RDNA3+ archs available to me. Result:

GPU Arch Budget baseline (MMQ) tok/s tier2 (MMVQ extended) tok/s Δ
7900 XTX gfx1100 (RDNA3) 22 40.91 23.46 −42.7 %
7900 XTX gfx1100 (RDNA3) 8 62.36 65.72 +5.4 % (noise)
R9700 gfx1201 (RDNA4) 22 77.53 24.13 −68.9 %
R9700 gfx1201 (RDNA4) 8 64.88 70.77 +9.1 % (noise)
Strix Halo iGPU gfx1151 (RDNA3.5) 22 26.36 14.79 −43.9 %
Strix Halo iGPU gfx1151 (RDNA3.5) 8 28.73 26.70 −7.1 % (noise)

Same binary per arch, GGML_MMVQ_NO_EXTENDED=1 toggled for baseline cell vs tier2 cell, byte-identical pre-tokenized prompts, 27B Q4_K_M + matched z-lab/Qwen3.6-27B-DFlash drafter, n_gen=256, ROCm 7.2.2. Budget=8 cells are within-noise because the new gate doesn't route ne[1]=9 through MMVQ either (9 ∉ {1..8, 16, 23}), so paths are identical modulo FP-reduction-order drift.

Why the projection was wrong — full discussion in the updated Tier 2 section of the plan doc, summary:

  1. MMQ's WMMA matrix cores beat MMVQ's scalar v_dot4 even at the supposedly-wasteful 32-wide tile with ne[1]=23.
  2. MMVQ at extended ncols_dst re-reads activations per column inside the K-block loop — 23× redundant traffic. MMQ stages once to LDS.
  3. Per-thread tmp[23] accumulator increases VGPR pressure enough to drop wave-occupancy on RDNA3.5 and RDNA4.
  4. FP non-associativity between the two reduction orders shifts AL ±7 % across archs even when raw tok/s is held constant.

Disposition: dropped Tier 2 from the PR sequence. The feat/mmvq-rdna3-batch16 branch stays on my fork as a documented research artifact (default-off, env-gated reproduction). Pivoted the kernel-side roadmap to Tier 3 (multi-row q4_K decode GEMV at ne[1] ≤ 8) — the hipfire pattern that consistently ships on RDNA3+ without either failure mode.

Doc commit: f841b02. New length 338 lines.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/docs/HIP_PERF_PLAN.md">

<violation number="1" location="dflash/docs/HIP_PERF_PLAN.md:315">
P2: Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to `ne[1] ≤ 8`, which would miss the stated target workload.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

remaining payoff). Lands a new
`mul_mat_vec_q_multirow_rdna_<arch>.cu` template alongside the
existing MMVQ kernel; dispatched from `ggml_cuda_mul_mat` at
`ne[1] ≤ 8` (the regime where the existing MMVQ already wins) to
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to ne[1] ≤ 8, which would miss the stated target workload.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/docs/HIP_PERF_PLAN.md, line 315:

<comment>Tier 3 is scoped to the wrong batch range: it says the kernel is for wide decode batches, but the PR-C dispatch line limits it to `ne[1] ≤ 8`, which would miss the stated target workload.</comment>

<file context>
@@ -203,16 +306,24 @@ should ship.
+   remaining payoff). Lands a new
+   `mul_mat_vec_q_multirow_rdna_<arch>.cu` template alongside the
+   existing MMVQ kernel; dispatched from `ggml_cuda_mul_mat` at
+   `ne[1] ≤ 8` (the regime where the existing MMVQ already wins) to
+   capture R-row sharing without touching the > 8 dispatcher (which
+   Tier 2 proved is MMQ's territory).
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant