Add HIP/ROCm support for Strix Halo (gfx1151)#119
Conversation
Companion to the short-context RTX 5090 section (HumanEval / Math500 / GSM8K, added in Luce-Org#86). Different layer of validation: short-context tests pure speculative decoding at 50–250 token prompts (PFlash compression not engaged), long-context tests the full PFlash drafter scoring + ~20× compression + DFlash decode pipeline at 117K tokens. Validated config: keep_ratio=0.05, alpha=0.70, ddtree_budget=22, fa_window=4096, kv_tq3=0 (FP16 KV — 5090 has the VRAM headroom). 20/20 NIAH across 2 runs of n=10. Decode avg 210.7 t/s sustained over 117K context. Notable cross-reference: my long-context budget sweep converges on budget=22 as throughput-optimal, matching Luce-Org#86's short-context finding. This makes budget=22 a stable default across context regimes for Qwen3.6-27B on Blackwell, not a knob that needs per-regime tuning. Includes notes on kv_tq3 (skip on 32 GB at this context, keep on 24 GB) and alpha (the docs default of 0.85 fails 2/10 NIAH at this setup; 0.60–0.70 pass 10/10). Depends on Luce-Org#86. This PR is a pure append to dflash/RESULTS.md after the same anchor point Luce-Org#86 appends to, so the rebase once Luce-Org#86 lands is mechanical — my section just stacks below Luce-Org#86's section.
…ance
Six review-feedback fixes on the new RTX 5090 long-context section:
1-3. Throughput unit consistency: rename Decode t/s → Decode tok/s in all
three throughput tables (headline, budget sweep, alpha sweep) to
match the rest of dflash/RESULTS.md and the short-context section.
4. Runtime config interface ambiguity: replace shorthand names
(keep_ratio=, alpha=, ddtree_budget=, ...) with the actual
reproducible interfaces — bench_niah_cpp.py CLI flags added in Luce-Org#90
(--keep-ratio=, --alpha=, --ddtree-budget=, --fa-window=, --kv-tq3=,
--n-gen=) and daemon env vars (DFLASH_FP_USE_BSA=, optional
DFLASH27B_FA_WINDOW=). One concrete way to reproduce, not three
shorthand aliases.
5. niah_gen.py path: reference becomes a relative markdown link to
../pflash/tests/niah_gen.py.
6. P2 — sweep provenance and apparent headline-vs-table mismatch:
the headline 210.7 tok/s comes from Phase 4 (n=20 reliability run at
the V4 config: keep=0.05/alpha=0.70/budget=22), but the budget table
was at keep=0.08/budget=varying and the alpha table was at
keep=0.08/budget=28/alpha=varying. A reader who notices both
"budget=22 → 217.4" (sweep) and "budget=22 → 210.7" (headline)
would correctly conclude the section is mixing configs.
Fix: each sweep table now declares its held-fixed parameters in the
subsection title (Phase 1 / 2 / 3, with "held: ..."). Added the
missing keep-ratio sweep (Phase 3) so the reader can see why
keep=0.05 was chosen over the per-token-faster keep=0.08 (TTFT and
compression-ratio tradeoff). New paragraph between the headline and
the sweeps spells out the methodology so the cross-table comparison
is grounded.
… Q8_0
Four review-feedback fixes on the new long-context section:
1+3. The runtime config and the alpha-sweep section both documented
`--alpha=0.70` as a CLI flag, but bench_niah_cpp.py on main does
not have a --alpha flag (that flag is in the still-open
bsa-cli-flag PR). Alpha is configured via the daemon env var
DFLASH_FP_ALPHA. Renamed all references:
- Runtime config bullet now reads
"DFLASH_FP_USE_BSA=1 and DFLASH_FP_ALPHA=0.70 ... (both are
daemon env vars)".
- Phase 1 section heading + table column header changed from
"--alpha" to "DFLASH_FP_ALPHA".
- "The docs default of --alpha=0.85" → "DFLASH_FP_ALPHA=0.85".
- Phase 2 + Phase 3 "held:" lists also corrected from
"--alpha=0.70" to "DFLASH_FP_ALPHA=0.70".
2. Runtime config preface said "the bracketed names are the exact
interfaces" but no brackets in the list. Reworded to "each bullet
leads with the exact interface" so prose matches presentation.
4. KV-cache type relabeled from FP16 → Q8_0. The reviewer correctly
notes that --kv-tq3=0 alone leaves the daemon at its Q8_0 KV
default (per dflash/src/kv_quant.h); FP16 would require setting
DFLASH27B_KV_K=f16 / DFLASH27B_KV_V=f16 explicitly, which the
recorded runtime config does not include. Conservative
interpretation: the actual benchmark used Q8_0 (the daemon
default). Updated:
- Runtime config bullet: "Q8_0 KV cache — the daemon default
when TQ3_0 is disabled and no other KV type is set".
- "Note on --kv-tq3" subsection: explains the Q8_0 default and
points users at DFLASH27B_KV_K/V for explicit overrides.
The "5090 has VRAM headroom; TQ3_0 not needed" rationale holds for
either Q8_0 or FP16, so the section's argument is unaffected.
There was a problem hiding this comment.
9 issues found across 20 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/prefix_cache.cpp">
<violation number="1" location="dflash/src/prefix_cache.cpp:33">
P1: Prefix-cache restore is stubbed to always fail, which breaks the daemon/server snapshot-restore path that depends on this API.</violation>
</file>
<file name="pflash/.gitignore">
<violation number="1" location="pflash/.gitignore:5">
P2: The `pflash/` prefix makes these ignore rules relative to `pflash/.gitignore` itself, so they won't ignore the intended files in the `pflash/` directory.</violation>
</file>
<file name="dflash/CMakeLists.txt">
<violation number="1" location="dflash/CMakeLists.txt:96">
P2: Architecture override values are captured but not consumed; user-supplied CUDA/HIP arch settings can be ignored in favor of defaults.</violation>
</file>
<file name="dflash/src/qwen3_drafter.cpp">
<violation number="1" location="dflash/src/qwen3_drafter.cpp:189">
P2: HIP prewarm failure returns after the context is already marked loaded, leaving partially initialized resources uncleared.</violation>
<violation number="2" location="dflash/src/qwen3_drafter.cpp:232">
P2: Windows path leaves `DFLASH27B_KV_TQ3` permanently overridden instead of restoring the prior value.</violation>
<violation number="3" location="dflash/src/qwen3_drafter.cpp:615">
P2: Qwen3.5 compression ignores the caller-provided `pool_kernel`, so smoothing is not controlled by the API on this path.</violation>
</file>
<file name="dflash/src/qwen3_0p6b_graph.cpp">
<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:631">
P0: HIP Graph-B normalizes from `h_after` before `gf_proj_add` has produced the current chunk's value, so FFN inputs are stale/incorrect.</violation>
</file>
<file name="dflash/src/gguf_target_loader.cpp">
<violation number="1" location="dflash/src/gguf_target_loader.cpp:534">
P2: `out.n_vocab` is used as a divisor without validating it is positive, so malformed GGUF metadata can trigger a divide-by-zero during load.</violation>
</file>
<file name="dflash/src/qwen35_target_graph.cpp">
<violation number="1" location="dflash/src/qwen35_target_graph.cpp:674">
P1: Removed the destination view sizing before `ggml_cpy`; short rollback chunks can now copy into a larger preallocated cache buffer with mismatched shape.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
…prefill-last-logits fix(dflash): resolve target split prefill OOM from full-prompt logits
|
@smpurkis thanks for the great contribution! We can't wait to integrate AMD, can you check and fix P0-P1 from cubic? |
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the review! All P0–P2 from cubic addressed in commit f285bff. P0 — P1 — P1 — P2 also in:
Re-validated post-fix on gfx1151 / ROCm 7.2:
|
RESULTS.md: add RTX 5090 long-context NIAH validation section
|
Nice work pushing the ROCm/HIP path forward. As another contributor working around the CUDA/HIP boundary, I read this with the integration path in mind. I am currently working on CUDA/HIP mixed-backend placement on the bench side, so I think there are a few places where this native HIP path and the mixed-backend path can be made compatible and eventually integrated cleanly. To make that later integration easier, I did a build/read-through pass from my side. In a local build-only check with ROCm 7.2.1, HIP Phase 1 configured and |
There was a problem hiding this comment.
I locally reproduced a HIP build failure here. With DFLASH27B_USE_HIP=ON, CMAKE_HIP_ARCHITECTURES=gfx906, and DFLASH27B_HIP_SM80_EQUIV=OFF, configure succeeds and test_dflash builds, but pflash_daemon fails at link time because this target still pulls ggml-cuda:
/usr/bin/ld: cannot find -lggml-cuda: No such file or directory
Could this link against the selected backend target, e.g. ${_dflash27b_ggml_backend_lib}, or be guarded as CUDA-only? spike_thin_copy appears to have the same direct ggml-cuda link.
| } | ||
|
|
||
| // rope dimension_sections (array of 4 uint32) | ||
| int rope_sections[4] = {0, 0, 0, 0}; |
There was a problem hiding this comment.
This still appears to weaken target-loader validation. Missing/short qwen35.rope.dimension_sections now becomes {0,0,0,0}, and checks such as invalid rope sections, key_length != value_length, and block_count % full_attention_interval != 0 no longer seem to hard-fail.
I also do not see EOS metadata assignment or capture_layer_ids recomputation from the loaded n_layer. For normal target generation, output.weight also still seems required when plan.load_output is true. Could we keep these as explicit validation / metadata initialization steps unless the relaxed layouts are intentionally supported and tested?
| if (kv_bytes > kv_buf_cap) { | ||
| if (kv_buf_K) hipFree(kv_buf_K); | ||
| if (kv_buf_V) hipFree(kv_buf_V); | ||
| hipMalloc(&kv_buf_K, kv_bytes); |
There was a problem hiding this comment.
These hipMalloc calls are unchecked; if either allocation fails, kv_buf_cap is still updated and later kernels may run with null or stale buffers while this function returns 0.
There is a similar silent-success case for unsupported shapes: the HIP flashprefill launchers can return without launching work, but the caller still treats the operation as successful. Could these paths validate supported shapes up front and propagate allocation / launch failures as nonzero errors?
…sponses Parse <think>...</think> tags from Qwen model output and expose reasoning separately: - OpenAI: reasoning_content field on the message / streaming delta - Anthropic: thinking content blocks with thinking_delta events Streaming uses a sliding-window state machine (consume_stream_piece) to detect tag boundaries across token boundaries. Respects chat_template_kwargs.enable_thinking — when disabled, tokens stream as plain content without tag parsing.
The host-side `build_causal_mask` sizes the mask buffer as
`align_up(kv_len, g_kq_stride_pad)`. `g_kq_stride_pad` is bumped from
`KQ_MASK_PAD=32` to `256` so the mask matches the TurboQuant chunked-FA
driver, which pads `kv_len` to `align_up(kv_len, FATTN_KQ_STRIDE=256)`
when KV is TQ3_0.
The bump used to fire only on the legacy `DFLASH27B_KV_TQ3=1` path. When
the KV type is selected via `-ctk tq3_0` / `-ctv tq3_0` (which sets
`DFLASH27B_KV_K` / `DFLASH27B_KV_V` instead), the bump never ran. With
stride at 32 the mask was sized at `align_up(N, 32)` while the kernel
read at `align_up(N, 256)`, walking off the end of the mask buffer into
uninitialized GPU memory. The OOB fp16 garbage occasionally encoded NaN,
producing per-thread NaN at any prefill ubatch `N` where
`align_up(N, 256) > align_up(N, 32)` — notably 256, 512, 1024, 2048.
Re-check `DFLASH27B_KV_K` / `DFLASH27B_KV_V` after argv parsing and bump
`g_kq_stride_pad` to 256 if either starts with "tq3" (case-insensitive).
The legacy `DFLASH27B_KV_TQ3=1` block is left in place; this is
additive.
Reproducer: `-ctk tq3_0 -ctv tq3_0` with prefill ubatch in {256, 512,
1024, 2048} produced random non-determinism on RTX 5090 (sm_120) before
this fix, and produces the same output across runs after.
The PFlash compress flow parks the target + draft to VRAM-backed scratch before loading the drafter, then restores them after scoring. The park/restore protects 24 GB cards (target ~16 GB Q4_K_M + draft ~3 GB + drafter KV ~1.3 GB + BSA scratch ~600 MB exceeds 24 GB without it) but adds ~2 s of fixed cost per compress call. On a 32 GB card (RTX 5090) all three models fit co-resident, so the park/restore round trip is pure overhead. Plumbs a `skip_park` flag from CLI to daemon: - `--prefill-skip-park` argparse flag (default off, behaviour unchanged for existing users). - `PrefillConfig.skip_park` carries the choice through both server.py and server_tools.py compress paths. - `compress_text_via_daemon` skips the `park target / park draft` and `free drafter / unpark target / unpark draft` round-trips when `skip_park=True`. - Server start sets `DFLASH_COMPRESS_NO_PARK=1` in the daemon env when the flag is on; daemon `compress` handler then leaves the scorer resident, avoiding the ~2 s reload on every subsequent request. The flag is opt-in. Users on 24 GB cards who pass it will OOM during compress; the help text calls out the 32 GB minimum.
The daemon-spawn path was POSIX-only. Three things break on Windows: 1. `LD_LIBRARY_PATH` is meaningless; Windows uses `PATH` for DLL resolution. 2. `subprocess.Popen(pass_fds=...)` is unsupported on Windows. 3. The child receives the parent file descriptor via `--stream-fd`, which on Windows must be a Win32 HANDLE (an `int`-castable handle value), not a CRT fd integer. Fixes: - Add a `sys.platform == "win32"` branch when extending the loader search path. The `else` branch is unchanged, so Linux behaviour is byte-identical to before. - On Windows, mark the pipe write-end inheritable (`os.set_inheritable`) and pass `msvcrt.get_osfhandle(...)` as `--stream-fd`. The daemon already accepts a HANDLE value here on Windows. - On Windows, spawn with `close_fds=False` so the child inherits the marked handle; on POSIX, keep the existing `pass_fds=(self.w_pipe,)` path. The pattern matches the in-process port already used in `dflash/scripts/server.py`.
Three small, related changes around Qwen3.x thinking-mode handling and
the NIAH harness.
server.py: default `enable_thinking=False`
The OpenAI-compatible endpoint applied the tokenizer chat template with
no `enable_thinking` kwarg, which falls through to the Qwen3.x template
default of `True`. With thinking on, every reply burns its `n_gen`
budget on a chain-of-thought rollout that DFlash's drafter cannot
predict, collapsing acceptance rate (typically ~60% -> single digits)
and frequently exhausting `max_tokens` before the answer block starts.
The default flips to `False`. Clients keep full control by sending
`"chat_template_kwargs": {"enable_thinking": true}` per request, which
the existing template-kwargs passthrough merges in. This repo only
serves Qwen3.5/3.6, so the default is safe; the kwarg is silently
ignored by tokenizers whose templates do not reference it.
bench_niah_cpp.py: `--no-thinking` flag
The NIAH bench builds its own chat prompt (independent of the server
path) and hit the same problem at 117K context: Qwen3.6 spent the full
generation budget thinking and never reached the answer. Adds a
`--no-thinking` flag that injects `enable_thinking=False` into the
target's `apply_chat_template` call. Default is off, so existing
invocations are unchanged.
niah_gen.py: `--seed-base` flag
Replaces the hardcoded `seed=42 + i` with a `--seed-base` argument
(default 42). Lets you generate independent case sets without changing
`--n`, e.g. for held-out validation passes.
perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context
…flags feat: disable Qwen3 thinking by default + NIAH bench QoL flags
…ge-vram feat(dflash): --prefill-skip-park for 32 GB+ GPUs
fix(pflash): support Windows in DflashClient
- target_loader: pin tok_embd to mmap host before backend alloc so the GPU allocator skips it (saves 110 MiB VRAM on RTX 3090, verified 19842 -> 19732 MiB at idle). - target_loader: validate expert_used_count <= expert_count. - bench_laguna_pflash: drop double free_drafter on the embed-failure path; drafter was already freed before the laguna target load. - parity_laguna: cache HF reference model in a module-level dict so it is loaded once across context lengths instead of per iteration. - RESULTS.md: relabel 27 364 tok/s as effective on the 131 072-token original prompt (target processes 13 120 compressed); the prior label "on the 13 120-token compressed prompt" was the wrong basis. Co-Authored-By: WOZCODE <contact@withwoz.com>
fix(dflash): pad KV mask to 256 when TQ3_0 is selected via -ctk/-ctv
Resolve conflict in dflash/test/test_dflash.cpp: keep both main's TQ3_0 KV-stride padding block (q4rot fix from PR Luce-Org#108) and the PR's laguna-aware early-exit (skips the prompt/out positional-arg check when the target is laguna and dispatches to run_laguna_daemon). Co-Authored-By: WOZCODE <contact@withwoz.com>
feat(dflash): scaffold Laguna-XS.2 target arch (loader + graph + bench)
Adds a Supported Models table near the top with TTFT and Decode speedups on RTX 3090 vs vendored llama.cpp for megakernel Qwen3.5-0.8B, dflash Qwen3.5/3.6-27B, and Laguna-XS.2. Removes the dedicated section 04 Laguna-XS.2 writeup. Laguna is a supported model on the existing dflash + PFlash path, not a new speedup, so it belongs in the table only. Hero-card list and "Three projects today" updated to match. Co-Authored-By: WOZCODE <contact@withwoz.com>
…ernel in table Co-Authored-By: WOZCODE <contact@withwoz.com>
…odels docs(readme): supported-models table, drop dedicated Laguna section
fix(dflash): expose qwen reasoning in streaming and non-streaming response
ggml's flash_attn_ext asserts Q->type == GGML_TYPE_F32 on all SM targets. The q8 fallback path was passing BF16 Q tensors from the drafter's persistent buffers, causing an assertion failure on Turing (sm_75) GPUs during pflash drafter forward. Add a ggml_cast(Q, F32) when the source type is not already F32. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
unpark draft unconditionally called load_draft_safetensors(), failing with 'bad header length' when the draft is a GGUF file. Use the same extension-based dispatch (.gguf → load_draft_gguf) already used at initial load time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow-up to merged PR Luce-Org#70 ("in-process speculative prefill for dflash daemon"). PR Luce-Org#70 implements the kernels (mean_K / score / select / sparse_fwd), the daemon protocol (compress / free drafter / park / unpark) and the --prefill-* CLI flags. Running that as a multi-tenant production lane exposed two small operator-level rough edges that are worth documenting in pflash/README.md so other operators do not rediscover them. 1. Queue budget. A 64K PFlash request takes ~24 s end-to-end on a live Qwen3.6 lane. Default queue budget (max_queue_requests=4, queue_timeout_s=12.0) produced avoidable timeouts on bursts of long prompts. Recommend tuning to 8 / 90.0 for PFlash long-context lanes. 2. Drafter selection. PFlash compress works best with the Qwen3-0.6B drafter (BF16 safetensors, ~5 attention layers). Decode after unpark uses the larger DFlash drafter. Document this so the dual-drafter layout is explicit. This commit also documents an apples-to-apples long-context measurement methodology vs Ollama native /api/chat on the same 64K unique-prompt summary task (Ollama 68.614 s vs Lucebox + PFlash compress 23.748 s; ~2.89x). The speedup comes from Luce-Org#70; this section only records how to reproduce it without prompt-prefix cache artefacts (Ollama OpenAI-compat endpoint returned empty content with the response in message.reasoning, so the comparison was done on the native endpoint). This is a docs-only PR. No kernels or daemon protocol are changed. The bulk of the integration (the OpenAI-compatible proxy, the YAML manifest, the benchmark harness) lives in a downstream wrapper and stays local. Verification vs existing community PRs: COMP-COMPL with Luce-Org#70 (merged 2026-04-30). Luce-Org#70 introduces the in-process pflash daemon mode; this PR only adds operator notes on top of it. Author: Javier Pazo <xabicasa@gmail.com>
…tor-notes docs(pflash): operator notes for queue budget + drafter selection
daemon: auto-detect .gguf draft format on unpark
flashprefill_q8: cast Q to F32 for ggml flash_attn_ext on sm_75
This branch should only add HIP/ROCm support for the dflash C++ exact path (drafter scoring + qwen35-0.8b dispatch). The Python approx path (PromptCompressor + llama.cpp CLI wrapper + RULER comparison harness) landed in 28fd493 as part of "strix halo rocm compatibility" but is unrelated to the HIP port and is being removed. - git rm --cached the 8 approx-only files (kept on disk via .gitignore) - revert pflash/__init__.py, pyproject.toml, README.md to main (all diffs in those three were approx-related) - add pflash/.gitignore so the local files stay untracked - keep tracked: dflash_client.py (drafter_arch + ROCm VRAM telemetry), platform.py, bench_niah_cpp.py (--drafter-arch flag), all dflash/src
Strix Halo branch should minimize diff against main: only HIP/ROCm code support, no doc churn or branch-marker prose. - Untrack dflash/docs/STRIX_HALO_PFLASH.md via dflash/.gitignore (kept on disk; restore via `git show 388867a:dflash/docs/STRIX_HALO_PFLASH.md`) - Revert dflash/README.md, README.md, pflash/README.md to origin/main (the additions were Strix Halo prose; not load-bearing for the build)
Strip extras that landed in 28fd493 / 4a93505 alongside the actual HIP/ROCm port. Keeps the same end-to-end smoke test passing (qwen3-0.6B drafter + 27B target on gfx1151, NIAH single-needle). - pflash/pflash/platform.py removed; the only consumer was dflash_client's boot wait. Replaced with an inline _query_nvidia_vram_mib() that silently falls back to a 5s time-based check when nvidia-smi is absent (Strix Halo / any non-NVIDIA box). - dflash/src/flashprefill.cpp: keep only the cuda_runtime.h -> device_runtime.h header swap. Drop the per-device cudaMalloc scratch cache (CUDA-only perf opt; HIP path uses flash_prefill_forward_q8 so this code never runs on HIP). - dflash/src/qwen3_0p6b_loader.cpp: drop the src-vs-dst type/byte defensive checks (unrelated to HIP). - dflash/include/dflash27b.h: drop QWEN35_9B macro alt-config block (macro-guarded, no internal callers). - dflash/test/test_flashprefill_kernels.cpp: fully revert to origin/main. Gate test build on NOT DFLASH27B_USE_HIP in CMakeLists; this test exercises BF16 WMMA kernels that the HIP path force-disables anyway. - root .gitignore: drop dflash/build-hip and pflash/models entries. Net diff vs origin/main: 28 -> 20 files, +2376/-613 -> +1734/-587.
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore
…alloc checks
- CMakeLists.txt: pflash_daemon and spike_thin_copy now link against
${_dflash27b_ggml_backend_lib} instead of hardcoded ggml-cuda, fixing
HIP build link failures reported by weicj.
- gguf_target_loader.cpp: restore structural validation removed in
prior cleanup — kl!=vl check, n_layer%fai divisibility, full
rope_sections validation (presence, count, bounds vs head_dim),
EOS metadata assignment, capture_layer_ids recomputation, and
output.weight requirement in top-level tensor check.
- bsa_launcher_hip.cu: check hipMalloc return values; on failure,
roll back partial allocations, reset kv_buf_cap, and return -1
instead of silently proceeding with null buffers.
f285bff to
5d15549
Compare
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5d15549 to
73678fa
Compare
- README.md: take upstream's GPU column + RTX 5090 row in supported models table - pflash/tests/bench_niah_cpp.py: take upstream's simplified user prompt - .gitmodules: repoint llama.cpp submodule from smpurkis/llama.cpp:master to Luce-Org/llama.cpp-dflash-ggml:luce-dflash (companion PR Luce-Org#8 merged 2026-05-08, HIP fixes are now upstream) - dflash/deps/llama.cpp: advance to c79573c9b (includes HIP dflash aliases + TQ3_0 FA guard + turbo_wht + laguna arch) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…flicts
Resolve merge conflicts from upstream/main which refactored the CMake
build system from DFLASH27B_USE_HIP boolean to DFLASH27B_GPU_BACKEND
string option ("cuda"/"hip"). Key changes:
- Adopt main's DFLASH27B_GPU_BACKEND / DFLASH27B_GGML_BACKEND_TARGET
naming throughout CMakeLists.txt
- Preserve our branch's HIP-specific additions: ROCm rpath, gfx1151
default arch, rocWMMA Phase 2 flashprefill, BSA HIP launcher
- Update DFLASH27B_USE_HIP → DFLASH27B_BACKEND_HIP compile definition
across all source files (device_runtime.h, qwen3_0p6b_graph.cpp,
qwen3_drafter.cpp, flashprefill.cpp, flashprefill_kernels.cu)
- Take main's ggml-hip compile_definitions shim and hip_compat include
- Take main's cuda_cross_device_copy.cpp for CUDA builds
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The file was listed in _dflash27b_sources but never committed to either branch. Caught by cmake configure during build verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
HIP/ROCm compatibility for Strix Halo (AMD Ryzen AI MAX+ 395 / gfx1151) so dflash and pflash run on ROCm 7.2. Pure code-support — no CUDA behavior changes.
main, 6 commits.-DDFLASH27B_USE_HIP=ON.dflash/src/qwen3_0p6b_graph.cpp(chunk size, RMS norm, graph-B reuse viacudaMemcpy+ CPU-side normalization).dflash/src/qwen35_target_graph.cppgeneralized to back the 0.8B drafter as well as the 27B target — hardcoded constants removed.dflash/src/device_runtime.haliases CUDA→HIP types/symbols at the dflash layer..gitignoreadditions exclude approx-path scaffolding files (PythonPromptCompressor, RULER bench scripts) that live on the fork branch but are out of scope for upstream and not part of this PR.Llama.cpp dependency
The submodule needs three HIP fixes that are not yet on
Luce-Org/llama.cpp-dflash-ggml:luce-dflash. Companion PR: Luce-Org/llama.cpp-dflash-ggml#8 (cublas/cudaStream aliases + TQ3_0 FA guard). Until that lands, this branch's.gitmodulespoints atsmpurkis/llama.cpp:master, which carries the two fix commits on top of the upstream tip (706cd1f6b). After #8 is merged,.gitmodulescan be repointed atLuce-Org/llama.cpp-dflash-ggml:luce-dflashin a one-line follow-up.Build (Strix Halo / gfx1151 / ROCm 7.2)
Run
Verified
Test plan
-DDFLASH27B_USE_HIP=ONagainst the HIP-fixed submodulePre-existing CMake note (not in scope here)
Two
dflash/CMakeLists.txttargets onmainhardcodeggml-cudaand break a widecmake --build .on HIP:pflash_daemon(de31881)spike_thin_copy(b833dce)Not introduced by this PR; the canonical
--target ...set above sidesteps both. Worth a follow-up cleanup PR to use${_dflash27b_ggml_backend_lib}.