Add HIP/ROCm support for Strix Halo (gfx1151) by smpurkis · Pull Request #119 · Luce-Org/lucebox-hub

smpurkis · 2026-05-07T10:59:15Z

Note: This PR was vibe-coded with AI assistance (Claude Code). Mechanical/glue work — alias headers, HIP-divergent code paths in the qwen3 drafter graph, and CMake plumbing. Please review accordingly.

Summary

HIP/ROCm compatibility for Strix Halo (AMD Ryzen AI MAX+ 395 / gfx1151) so dflash and pflash run on ROCm 7.2. Pure code-support — no CUDA behavior changes.

20 files, +1734 / -587 vs main, 6 commits.
HIP build paths gated by -DDFLASH27B_USE_HIP=ON.
HIP-divergent paths in dflash/src/qwen3_0p6b_graph.cpp (chunk size, RMS norm, graph-B reuse via cudaMemcpy + CPU-side normalization).
dflash/src/qwen35_target_graph.cpp generalized to back the 0.8B drafter as well as the 27B target — hardcoded constants removed.
dflash/src/device_runtime.h aliases CUDA→HIP types/symbols at the dflash layer.
2 .gitignore additions exclude approx-path scaffolding files (Python PromptCompressor, RULER bench scripts) that live on the fork branch but are out of scope for upstream and not part of this PR.

Llama.cpp dependency

The submodule needs three HIP fixes that are not yet on Luce-Org/llama.cpp-dflash-ggml:luce-dflash. Companion PR: Luce-Org/llama.cpp-dflash-ggml#8 (cublas/cudaStream aliases + TQ3_0 FA guard). Until that lands, this branch's .gitmodules points at smpurkis/llama.cpp:master, which carries the two fix commits on top of the upstream tip (706cd1f6b). After #8 is merged, .gitmodules can be repointed at Luce-Org/llama.cpp-dflash-ggml:luce-dflash in a one-line follow-up.

Build (Strix Halo / gfx1151 / ROCm 7.2)

git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

cmake -S . -B build \
  -DDFLASH27B_USE_HIP=ON \
  -DGGML_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151

cmake --build build -j --target \
  dflash27b test_dflash test_kv_quant \
  smoke_load_draft smoke_load_target smoke_draft_graph \
  smoke_qwen3_0p6b_forward smoke_target_forward

Run

# Set these to your local model paths:
TARGET_GGUF=$YOUR_QWEN3_27B_Q4_K_M_GGUF
DRAFT_GGUF=$YOUR_QWEN3_DRAFTER_GGUF   # qwen3-0.6b BF16 or qwen3.5-0.8b Q4_K_M

# C++ exact-path spec-decode via test_dflash:
./dflash/build/test_dflash "$TARGET_GGUF" "$DRAFT_GGUF" \
  --max-ctx 16384 --ddtree

# pflash NIAH bench (HIP exact path):
python pflash/tests/bench_niah_cpp.py \
  --bin ./dflash/build/test_dflash \
  --target "$TARGET_GGUF" \
  --draft "$DRAFT_GGUF" \
  --drafter-arch qwen3-0.6b

Verified

All three HIP drafter paths verified end-to-end against the Qwen3.6-27B Q4_K_M target via NIAH single-needle smokes:
- qwen3-0.6B BF16 GGUF drafter
- qwen3.5-0.8B Q4_K_M GGUF drafter
- null/baseline llama.cpp HIP path
Spec-decode draft (Qwen3.6-27B-DFlash safetensors) runs on HIP via the ggml-hip backend.

Test plan

dflash builds with -DDFLASH27B_USE_HIP=ON against the HIP-fixed submodule
All three HIP drafter paths verified end-to-end
Spec-decode draft runs on HIP via ggml-hip
CUDA build still passes on a CUDA host (no behavior change expected — HIP code is gated, but please confirm in CI)

Pre-existing CMake note (not in scope here)

Two dflash/CMakeLists.txt targets on main hardcode ggml-cuda and break a wide cmake --build . on HIP:

L300 — pflash_daemon (de31881)
L333 — spike_thin_copy (b833dce)

Not introduced by this PR; the canonical --target ... set above sidesteps both. Worth a follow-up cleanup PR to use ${_dflash27b_ggml_backend_lib}.

Companion to the short-context RTX 5090 section (HumanEval / Math500 / GSM8K, added in Luce-Org#86). Different layer of validation: short-context tests pure speculative decoding at 50–250 token prompts (PFlash compression not engaged), long-context tests the full PFlash drafter scoring + ~20× compression + DFlash decode pipeline at 117K tokens. Validated config: keep_ratio=0.05, alpha=0.70, ddtree_budget=22, fa_window=4096, kv_tq3=0 (FP16 KV — 5090 has the VRAM headroom). 20/20 NIAH across 2 runs of n=10. Decode avg 210.7 t/s sustained over 117K context. Notable cross-reference: my long-context budget sweep converges on budget=22 as throughput-optimal, matching Luce-Org#86's short-context finding. This makes budget=22 a stable default across context regimes for Qwen3.6-27B on Blackwell, not a knob that needs per-regime tuning. Includes notes on kv_tq3 (skip on 32 GB at this context, keep on 24 GB) and alpha (the docs default of 0.85 fails 2/10 NIAH at this setup; 0.60–0.70 pass 10/10). Depends on Luce-Org#86. This PR is a pure append to dflash/RESULTS.md after the same anchor point Luce-Org#86 appends to, so the rebase once Luce-Org#86 lands is mechanical — my section just stacks below Luce-Org#86's section.

…ance Six review-feedback fixes on the new RTX 5090 long-context section: 1-3. Throughput unit consistency: rename Decode t/s → Decode tok/s in all three throughput tables (headline, budget sweep, alpha sweep) to match the rest of dflash/RESULTS.md and the short-context section. 4. Runtime config interface ambiguity: replace shorthand names (keep_ratio=, alpha=, ddtree_budget=, ...) with the actual reproducible interfaces — bench_niah_cpp.py CLI flags added in Luce-Org#90 (--keep-ratio=, --alpha=, --ddtree-budget=, --fa-window=, --kv-tq3=, --n-gen=) and daemon env vars (DFLASH_FP_USE_BSA=, optional DFLASH27B_FA_WINDOW=). One concrete way to reproduce, not three shorthand aliases. 5. niah_gen.py path: reference becomes a relative markdown link to ../pflash/tests/niah_gen.py. 6. P2 — sweep provenance and apparent headline-vs-table mismatch: the headline 210.7 tok/s comes from Phase 4 (n=20 reliability run at the V4 config: keep=0.05/alpha=0.70/budget=22), but the budget table was at keep=0.08/budget=varying and the alpha table was at keep=0.08/budget=28/alpha=varying. A reader who notices both "budget=22 → 217.4" (sweep) and "budget=22 → 210.7" (headline) would correctly conclude the section is mixing configs. Fix: each sweep table now declares its held-fixed parameters in the subsection title (Phase 1 / 2 / 3, with "held: ..."). Added the missing keep-ratio sweep (Phase 3) so the reader can see why keep=0.05 was chosen over the per-token-faster keep=0.08 (TTFT and compression-ratio tradeoff). New paragraph between the headline and the sweeps spells out the methodology so the cross-table comparison is grounded.

… Q8_0 Four review-feedback fixes on the new long-context section: 1+3. The runtime config and the alpha-sweep section both documented `--alpha=0.70` as a CLI flag, but bench_niah_cpp.py on main does not have a --alpha flag (that flag is in the still-open bsa-cli-flag PR). Alpha is configured via the daemon env var DFLASH_FP_ALPHA. Renamed all references: - Runtime config bullet now reads "DFLASH_FP_USE_BSA=1 and DFLASH_FP_ALPHA=0.70 ... (both are daemon env vars)". - Phase 1 section heading + table column header changed from "--alpha" to "DFLASH_FP_ALPHA". - "The docs default of --alpha=0.85" → "DFLASH_FP_ALPHA=0.85". - Phase 2 + Phase 3 "held:" lists also corrected from "--alpha=0.70" to "DFLASH_FP_ALPHA=0.70". 2. Runtime config preface said "the bracketed names are the exact interfaces" but no brackets in the list. Reworded to "each bullet leads with the exact interface" so prose matches presentation. 4. KV-cache type relabeled from FP16 → Q8_0. The reviewer correctly notes that --kv-tq3=0 alone leaves the daemon at its Q8_0 KV default (per dflash/src/kv_quant.h); FP16 would require setting DFLASH27B_KV_K=f16 / DFLASH27B_KV_V=f16 explicitly, which the recorded runtime config does not include. Conservative interpretation: the actual benchmark used Q8_0 (the daemon default). Updated: - Runtime config bullet: "Q8_0 KV cache — the daemon default when TQ3_0 is disabled and no other KV type is set". - "Note on --kv-tq3" subsection: explains the Q8_0 default and points users at DFLASH27B_KV_K/V for explicit overrides. The "5090 has VRAM headroom; TQ3_0 not needed" rationale holds for either Q8_0 or FP16, so the section's argument is unaffected.

cubic-dev-ai

9 issues found across 20 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/prefix_cache.cpp">

<violation number="1" location="dflash/src/prefix_cache.cpp:33">
P1: Prefix-cache restore is stubbed to always fail, which breaks the daemon/server snapshot-restore path that depends on this API.</violation>
</file>

<file name="pflash/.gitignore">

<violation number="1" location="pflash/.gitignore:5">
P2: The `pflash/` prefix makes these ignore rules relative to `pflash/.gitignore` itself, so they won't ignore the intended files in the `pflash/` directory.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:96">
P2: Architecture override values are captured but not consumed; user-supplied CUDA/HIP arch settings can be ignored in favor of defaults.</violation>
</file>

<file name="dflash/src/qwen3_drafter.cpp">

<violation number="1" location="dflash/src/qwen3_drafter.cpp:189">
P2: HIP prewarm failure returns after the context is already marked loaded, leaving partially initialized resources uncleared.</violation>

<violation number="2" location="dflash/src/qwen3_drafter.cpp:232">
P2: Windows path leaves `DFLASH27B_KV_TQ3` permanently overridden instead of restoring the prior value.</violation>

<violation number="3" location="dflash/src/qwen3_drafter.cpp:615">
P2: Qwen3.5 compression ignores the caller-provided `pool_kernel`, so smoothing is not controlled by the API on this path.</violation>
</file>

<file name="dflash/src/qwen3_0p6b_graph.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:631">
P0: HIP Graph-B normalizes from `h_after` before `gf_proj_add` has produced the current chunk's value, so FFN inputs are stale/incorrect.</violation>
</file>

<file name="dflash/src/gguf_target_loader.cpp">

<violation number="1" location="dflash/src/gguf_target_loader.cpp:534">
P2: `out.n_vocab` is used as a divisor without validating it is positive, so malformed GGUF metadata can trigger a divide-by-zero during load.</violation>
</file>

<file name="dflash/src/qwen35_target_graph.cpp">

<violation number="1" location="dflash/src/qwen35_target_graph.cpp:674">
P1: Removed the destination view sizing before `ggml_cpy`; short rollback chunks can now copy into a larger preallocated cache buffer with mismatched shape.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…prefill-last-logits fix(dflash): resolve target split prefill OOM from full-prompt logits

davide221 · 2026-05-07T15:00:51Z

@smpurkis thanks for the great contribution! We can't wait to integrate AMD, can you check and fix P0-P1 from cubic?

Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

smpurkis · 2026-05-07T15:27:00Z

Thanks for the review! All P0–P2 from cubic addressed in commit f285bff.

P0 — qwen3_0p6b_graph.cpp HIP Graph-B ordering: gf_proj_add runs before reading h_after. Comment block at lines 627–630 documents the invariant. (Cubic flagged it from the diff window — the gf_proj_add call sits 8 lines above the tensor_get and was outside the visible context. Ordering is correct.)

P1 — prefix_cache.cpp stub removed. Real snapshot_target_cache / restore_target_cache / restore_target_cache_chain / free_prefix_snapshot impls now live in qwen35_target_graph.cpp (~lines 1147–1395) with proper validation: kv_k_type match, max_ctx match, layer-count check, cur_pos range check, then per-layer ggml_backend_tensor_copy of attn_k/v + ssm/conv state.

P1 — qwen35_target_graph.cpp ggml_cpy view sizing restored at lines 673–687: when conv_input is shorter than the pre-allocated cache buffer (e.g. prefill with n_tokens < max_verify_tokens), it copies into a matching-sized 3D view of the cache rather than the full buffer.

P2 also in:

dflash/CMakeLists.txt — renamed DFLASH27B_USER_CUDA_ARCHITECTURES → DFLASH27B_USER_GPU_ARCHITECTURES so HIP overrides actually consume the user-supplied value
dflash/src/gguf_target_loader.cpp — n_vocab > 0 guard before using as divisor
dflash/src/qwen3_drafter.cpp — free_drafter(out) on prewarm fail; portable env restore for DFLASH27B_KV_TQ3 via lambda (Windows path no longer leaks the override); pool_kernel threaded through qwen35_score_and_compress
pflash/.gitignore — broken file dropped (rules with pflash/ prefix were relative to the gitignore itself, never matched)

Re-validated post-fix on gfx1151 / ROCm 7.2:

Phase 2 smoke (S=2048): forward+score 1.92s, kept 1024/2048, FP kernel 0.20s
Phase 1 smoke (S=2048): forward+score 2.17s, kept 1024/2048, FP kernel 0.48s (~2.4× faster on Phase 2 — consistent with the prior 2.2×@2k result; gap grows with S, hits 5.0× at 16K).
Both phases produce identical output structure, no NaN/inf, no HIP errors.

RESULTS.md: add RTX 5090 long-context NIAH validation section

weicj · 2026-05-07T18:17:20Z

Nice work pushing the ROCm/HIP path forward. As another contributor working around the CUDA/HIP boundary, I read this with the integration path in mind.

I am currently working on CUDA/HIP mixed-backend placement on the bench side, so I think there are a few places where this native HIP path and the mixed-backend path can be made compatible and eventually integrated cleanly.

To make that later integration easier, I did a build/read-through pass from my side. In a local build-only check with ROCm 7.2.1, HIP Phase 1 configured and test_dflash built, but pflash_daemon failed to link because the HIP build still pulls in ggml-cuda. CUDA/SM75 built the same targets successfully. I also found a few correctness and scope issues in the diff. Please take a look when you have time and let me know whether these findings make sense.

weicj · 2026-05-07T18:18:41Z

I locally reproduced a HIP build failure here. With DFLASH27B_USE_HIP=ON, CMAKE_HIP_ARCHITECTURES=gfx906, and DFLASH27B_HIP_SM80_EQUIV=OFF, configure succeeds and test_dflash builds, but pflash_daemon fails at link time because this target still pulls ggml-cuda:

/usr/bin/ld: cannot find -lggml-cuda: No such file or directory

Could this link against the selected backend target, e.g. ${_dflash27b_ggml_backend_lib}, or be guarded as CUDA-only? spike_thin_copy appears to have the same direct ggml-cuda link.

weicj · 2026-05-07T18:19:43Z

-    }
-
    // rope dimension_sections (array of 4 uint32)
    int rope_sections[4] = {0, 0, 0, 0};


This still appears to weaken target-loader validation. Missing/short qwen35.rope.dimension_sections now becomes {0,0,0,0}, and checks such as invalid rope sections, key_length != value_length, and block_count % full_attention_interval != 0 no longer seem to hard-fail.

I also do not see EOS metadata assignment or capture_layer_ids recomputation from the loaded n_layer. For normal target generation, output.weight also still seems required when plan.load_output is true. Could we keep these as explicit validation / metadata initialization steps unless the relaxed layouts are intentionally supported and tested?

weicj · 2026-05-07T18:20:47Z

+    if (kv_bytes > kv_buf_cap) {
+        if (kv_buf_K) hipFree(kv_buf_K);
+        if (kv_buf_V) hipFree(kv_buf_V);
+        hipMalloc(&kv_buf_K, kv_bytes);


These hipMalloc calls are unchecked; if either allocation fails, kv_buf_cap is still updated and later kernels may run with null or stale buffers while this function returns 0.

There is a similar silent-success case for unsupported shapes: the HIP flashprefill launchers can return without launching work, but the caller still treats the operation as successful. Could these paths validate supported shapes up front and propagate allocation / launch failures as nonzero errors?

…sponses Parse <think>...</think> tags from Qwen model output and expose reasoning separately: - OpenAI: reasoning_content field on the message / streaming delta - Anthropic: thinking content blocks with thinking_delta events Streaming uses a sliding-window state machine (consume_stream_piece) to detect tag boundaries across token boundaries. Respects chat_template_kwargs.enable_thinking — when disabled, tokens stream as plain content without tag parsing.

The host-side `build_causal_mask` sizes the mask buffer as `align_up(kv_len, g_kq_stride_pad)`. `g_kq_stride_pad` is bumped from `KQ_MASK_PAD=32` to `256` so the mask matches the TurboQuant chunked-FA driver, which pads `kv_len` to `align_up(kv_len, FATTN_KQ_STRIDE=256)` when KV is TQ3_0. The bump used to fire only on the legacy `DFLASH27B_KV_TQ3=1` path. When the KV type is selected via `-ctk tq3_0` / `-ctv tq3_0` (which sets `DFLASH27B_KV_K` / `DFLASH27B_KV_V` instead), the bump never ran. With stride at 32 the mask was sized at `align_up(N, 32)` while the kernel read at `align_up(N, 256)`, walking off the end of the mask buffer into uninitialized GPU memory. The OOB fp16 garbage occasionally encoded NaN, producing per-thread NaN at any prefill ubatch `N` where `align_up(N, 256) > align_up(N, 32)` — notably 256, 512, 1024, 2048. Re-check `DFLASH27B_KV_K` / `DFLASH27B_KV_V` after argv parsing and bump `g_kq_stride_pad` to 256 if either starts with "tq3" (case-insensitive). The legacy `DFLASH27B_KV_TQ3=1` block is left in place; this is additive. Reproducer: `-ctk tq3_0 -ctv tq3_0` with prefill ubatch in {256, 512, 1024, 2048} produced random non-determinism on RTX 5090 (sm_120) before this fix, and produces the same output across runs after.

The PFlash compress flow parks the target + draft to VRAM-backed scratch before loading the drafter, then restores them after scoring. The park/restore protects 24 GB cards (target ~16 GB Q4_K_M + draft ~3 GB + drafter KV ~1.3 GB + BSA scratch ~600 MB exceeds 24 GB without it) but adds ~2 s of fixed cost per compress call. On a 32 GB card (RTX 5090) all three models fit co-resident, so the park/restore round trip is pure overhead. Plumbs a `skip_park` flag from CLI to daemon: - `--prefill-skip-park` argparse flag (default off, behaviour unchanged for existing users). - `PrefillConfig.skip_park` carries the choice through both server.py and server_tools.py compress paths. - `compress_text_via_daemon` skips the `park target / park draft` and `free drafter / unpark target / unpark draft` round-trips when `skip_park=True`. - Server start sets `DFLASH_COMPRESS_NO_PARK=1` in the daemon env when the flag is on; daemon `compress` handler then leaves the scorer resident, avoiding the ~2 s reload on every subsequent request. The flag is opt-in. Users on 24 GB cards who pass it will OOM during compress; the help text calls out the 32 GB minimum.

The daemon-spawn path was POSIX-only. Three things break on Windows: 1. `LD_LIBRARY_PATH` is meaningless; Windows uses `PATH` for DLL resolution. 2. `subprocess.Popen(pass_fds=...)` is unsupported on Windows. 3. The child receives the parent file descriptor via `--stream-fd`, which on Windows must be a Win32 HANDLE (an `int`-castable handle value), not a CRT fd integer. Fixes: - Add a `sys.platform == "win32"` branch when extending the loader search path. The `else` branch is unchanged, so Linux behaviour is byte-identical to before. - On Windows, mark the pipe write-end inheritable (`os.set_inheritable`) and pass `msvcrt.get_osfhandle(...)` as `--stream-fd`. The daemon already accepts a HANDLE value here on Windows. - On Windows, spawn with `close_fds=False` so the child inherits the marked handle; on POSIX, keep the existing `pass_fds=(self.w_pipe,)` path. The pattern matches the in-process port already used in `dflash/scripts/server.py`.

Three small, related changes around Qwen3.x thinking-mode handling and the NIAH harness. server.py: default `enable_thinking=False` The OpenAI-compatible endpoint applied the tokenizer chat template with no `enable_thinking` kwarg, which falls through to the Qwen3.x template default of `True`. With thinking on, every reply burns its `n_gen` budget on a chain-of-thought rollout that DFlash's drafter cannot predict, collapsing acceptance rate (typically ~60% -> single digits) and frequently exhausting `max_tokens` before the answer block starts. The default flips to `False`. Clients keep full control by sending `"chat_template_kwargs": {"enable_thinking": true}` per request, which the existing template-kwargs passthrough merges in. This repo only serves Qwen3.5/3.6, so the default is safe; the kwarg is silently ignored by tokenizers whose templates do not reference it. bench_niah_cpp.py: `--no-thinking` flag The NIAH bench builds its own chat prompt (independent of the server path) and hit the same problem at 117K context: Qwen3.6 spent the full generation budget thinking and never reached the answer. Adds a `--no-thinking` flag that injects `enable_thinking=False` into the target's `apply_chat_template` call. Default is off, so existing invocations are unchanged. niah_gen.py: `--seed-base` flag Replaces the hardcoded `seed=42 + i` with a `--seed-base` argument (default 42). Lets you generate independent case sets without changing `--n`, e.g. for held-out validation passes.

perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context

…flags feat: disable Qwen3 thinking by default + NIAH bench QoL flags

…ge-vram feat(dflash): --prefill-skip-park for 32 GB+ GPUs

fix(pflash): support Windows in DflashClient

- target_loader: pin tok_embd to mmap host before backend alloc so the GPU allocator skips it (saves 110 MiB VRAM on RTX 3090, verified 19842 -> 19732 MiB at idle). - target_loader: validate expert_used_count <= expert_count. - bench_laguna_pflash: drop double free_drafter on the embed-failure path; drafter was already freed before the laguna target load. - parity_laguna: cache HF reference model in a module-level dict so it is loaded once across context lengths instead of per iteration. - RESULTS.md: relabel 27 364 tok/s as effective on the 131 072-token original prompt (target processes 13 120 compressed); the prior label "on the 13 120-token compressed prompt" was the wrong basis. Co-Authored-By: WOZCODE <contact@withwoz.com>

fix(dflash): pad KV mask to 256 when TQ3_0 is selected via -ctk/-ctv

Resolve conflict in dflash/test/test_dflash.cpp: keep both main's TQ3_0 KV-stride padding block (q4rot fix from PR Luce-Org#108) and the PR's laguna-aware early-exit (skips the prompt/out positional-arg check when the target is laguna and dispatches to run_laguna_daemon). Co-Authored-By: WOZCODE <contact@withwoz.com>

feat(dflash): scaffold Laguna-XS.2 target arch (loader + graph + bench)

Adds a Supported Models table near the top with TTFT and Decode speedups on RTX 3090 vs vendored llama.cpp for megakernel Qwen3.5-0.8B, dflash Qwen3.5/3.6-27B, and Laguna-XS.2. Removes the dedicated section 04 Laguna-XS.2 writeup. Laguna is a supported model on the existing dflash + PFlash path, not a new speedup, so it belongs in the table only. Hero-card list and "Three projects today" updated to match. Co-Authored-By: WOZCODE <contact@withwoz.com>

…ernel in table Co-Authored-By: WOZCODE <contact@withwoz.com>

…odels docs(readme): supported-models table, drop dedicated Laguna section

fix(dflash): expose qwen reasoning in streaming and non-streaming response

ggml's flash_attn_ext asserts Q->type == GGML_TYPE_F32 on all SM targets. The q8 fallback path was passing BF16 Q tensors from the drafter's persistent buffers, causing an assertion failure on Turing (sm_75) GPUs during pflash drafter forward. Add a ggml_cast(Q, F32) when the source type is not already F32. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

unpark draft unconditionally called load_draft_safetensors(), failing with 'bad header length' when the draft is a GGUF file. Use the same extension-based dispatch (.gguf → load_draft_gguf) already used at initial load time. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Follow-up to merged PR Luce-Org#70 ("in-process speculative prefill for dflash daemon"). PR Luce-Org#70 implements the kernels (mean_K / score / select / sparse_fwd), the daemon protocol (compress / free drafter / park / unpark) and the --prefill-* CLI flags. Running that as a multi-tenant production lane exposed two small operator-level rough edges that are worth documenting in pflash/README.md so other operators do not rediscover them. 1. Queue budget. A 64K PFlash request takes ~24 s end-to-end on a live Qwen3.6 lane. Default queue budget (max_queue_requests=4, queue_timeout_s=12.0) produced avoidable timeouts on bursts of long prompts. Recommend tuning to 8 / 90.0 for PFlash long-context lanes. 2. Drafter selection. PFlash compress works best with the Qwen3-0.6B drafter (BF16 safetensors, ~5 attention layers). Decode after unpark uses the larger DFlash drafter. Document this so the dual-drafter layout is explicit. This commit also documents an apples-to-apples long-context measurement methodology vs Ollama native /api/chat on the same 64K unique-prompt summary task (Ollama 68.614 s vs Lucebox + PFlash compress 23.748 s; ~2.89x). The speedup comes from Luce-Org#70; this section only records how to reproduce it without prompt-prefix cache artefacts (Ollama OpenAI-compat endpoint returned empty content with the response in message.reasoning, so the comparison was done on the native endpoint). This is a docs-only PR. No kernels or daemon protocol are changed. The bulk of the integration (the OpenAI-compatible proxy, the YAML manifest, the benchmark harness) lives in a downstream wrapper and stays local. Verification vs existing community PRs: COMP-COMPL with Luce-Org#70 (merged 2026-04-30). Luce-Org#70 introduces the in-process pflash daemon mode; this PR only adds operator notes on top of it. Author: Javier Pazo <xabicasa@gmail.com>

…tor-notes docs(pflash): operator notes for queue budget + drafter selection

daemon: auto-detect .gguf draft format on unpark

flashprefill_q8: cast Q to F32 for ggml flash_attn_ext on sm_75

davide221 · 2026-05-09T21:48:49Z

@smpurkis can you rebase and sync with @weicj requests so that we can push AMD support to production? It would be awesome

This branch should only add HIP/ROCm support for the dflash C++ exact path (drafter scoring + qwen35-0.8b dispatch). The Python approx path (PromptCompressor + llama.cpp CLI wrapper + RULER comparison harness) landed in 28fd493 as part of "strix halo rocm compatibility" but is unrelated to the HIP port and is being removed. - git rm --cached the 8 approx-only files (kept on disk via .gitignore) - revert pflash/__init__.py, pyproject.toml, README.md to main (all diffs in those three were approx-related) - add pflash/.gitignore so the local files stay untracked - keep tracked: dflash_client.py (drafter_arch + ROCm VRAM telemetry), platform.py, bench_niah_cpp.py (--drafter-arch flag), all dflash/src

Strix Halo branch should minimize diff against main: only HIP/ROCm code support, no doc churn or branch-marker prose. - Untrack dflash/docs/STRIX_HALO_PFLASH.md via dflash/.gitignore (kept on disk; restore via `git show 388867a:dflash/docs/STRIX_HALO_PFLASH.md`) - Revert dflash/README.md, README.md, pflash/README.md to origin/main (the additions were Strix Halo prose; not load-bearing for the build)

Strip extras that landed in 28fd493 / 4a93505 alongside the actual HIP/ROCm port. Keeps the same end-to-end smoke test passing (qwen3-0.6B drafter + 27B target on gfx1151, NIAH single-needle). - pflash/pflash/platform.py removed; the only consumer was dflash_client's boot wait. Replaced with an inline _query_nvidia_vram_mib() that silently falls back to a 5s time-based check when nvidia-smi is absent (Strix Halo / any non-NVIDIA box). - dflash/src/flashprefill.cpp: keep only the cuda_runtime.h -> device_runtime.h header swap. Drop the per-device cudaMalloc scratch cache (CUDA-only perf opt; HIP path uses flash_prefill_forward_q8 so this code never runs on HIP). - dflash/src/qwen3_0p6b_loader.cpp: drop the src-vs-dst type/byte defensive checks (unrelated to HIP). - dflash/include/dflash27b.h: drop QWEN35_9B macro alt-config block (macro-guarded, no internal callers). - dflash/test/test_flashprefill_kernels.cpp: fully revert to origin/main. Gate test build on NOT DFLASH27B_USE_HIP in CMakeLists; this test exercises BF16 WMMA kernels that the HIP path force-disables anyway. - root .gitignore: drop dflash/build-hip and pflash/models entries. Net diff vs origin/main: 28 -> 20 files, +2376/-613 -> +1734/-587.

Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore

…alloc checks - CMakeLists.txt: pflash_daemon and spike_thin_copy now link against ${_dflash27b_ggml_backend_lib} instead of hardcoded ggml-cuda, fixing HIP build link failures reported by weicj. - gguf_target_loader.cpp: restore structural validation removed in prior cleanup — kl!=vl check, n_layer%fai divisibility, full rope_sections validation (presence, count, bounds vs head_dim), EOS metadata assignment, capture_layer_ids recomputation, and output.weight requirement in top-level tensor check. - bsa_launcher_hip.cu: check hipMalloc return values; on failure, roll back partial allocations, reset kv_buf_cap, and return -1 instead of silently proceeding with null buffers.

Imports rocWMMA-native flashprefill kernels (mean / score-GEMM / select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON. Phase 1 (default) keeps the ggml q8 fallback unchanged. On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases, output byte-identical. Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP in qwen3_0p6b_graph.cpp, so the new kernels were linked but never reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80. Also addresses cubic-dev-ai review on PR Luce-Org#119: - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading h_after) - P1: restore prefix-cache impls in qwen35_target_graph; drop the stub prefix_cache.cpp - P1: cap->conv_input view sizing - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP - P2: n_vocab guard in gguf_target_loader - P2: free_drafter on prewarm fail; portable env restore via lambda (Windows _dupenv_s/_putenv_s); thread pool_kernel through qwen35_score_and_compress - P2: drop broken pflash/.gitignore Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- README.md: take upstream's GPU column + RTX 5090 row in supported models table - pflash/tests/bench_niah_cpp.py: take upstream's simplified user prompt - .gitmodules: repoint llama.cpp submodule from smpurkis/llama.cpp:master to Luce-Org/llama.cpp-dflash-ggml:luce-dflash (companion PR Luce-Org#8 merged 2026-05-08, HIP fixes are now upstream) - dflash/deps/llama.cpp: advance to c79573c9b (includes HIP dflash aliases + TQ3_0 FA guard + turbo_wht + laguna arch) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…flicts Resolve merge conflicts from upstream/main which refactored the CMake build system from DFLASH27B_USE_HIP boolean to DFLASH27B_GPU_BACKEND string option ("cuda"/"hip"). Key changes: - Adopt main's DFLASH27B_GPU_BACKEND / DFLASH27B_GGML_BACKEND_TARGET naming throughout CMakeLists.txt - Preserve our branch's HIP-specific additions: ROCm rpath, gfx1151 default arch, rocWMMA Phase 2 flashprefill, BSA HIP launcher - Update DFLASH27B_USE_HIP → DFLASH27B_BACKEND_HIP compile definition across all source files (device_runtime.h, qwen3_0p6b_graph.cpp, qwen3_drafter.cpp, flashprefill.cpp, flashprefill_kernels.cu) - Take main's ggml-hip compile_definitions shim and hip_compat include - Take main's cuda_cross_device_copy.cpp for CUDA builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The file was listed in _dflash27b_sources but never committed to either branch. Caught by cmake configure during build verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ppsx added 3 commits May 6, 2026 17:41

smpurkis mentioned this pull request May 7, 2026

fix(hip): dflash compatibility aliases + TQ3_0 FA guard Luce-Org/llama.cpp-dflash-ggml#8

Merged

3 tasks

cubic-dev-ai Bot reviewed May 7, 2026

View reviewed changes

smpurkis mentioned this pull request May 7, 2026

Initial support for gfx1151/strix halo #117

Open

Merge pull request Luce-Org#118 from weicj/bench-dflash-target-split-…

c506646

…prefill-last-logits fix(dflash): resolve target split prefill OOM from full-prompt logits

davide221 added 2 commits May 7, 2026 18:29

Merge pull request Luce-Org#96 from ppsx/results-long-ctx-niah-blackwell

db53398

RESULTS.md: add RTX 5090 long-context NIAH validation section

Merge branch 'main' into feat/dflash-laguna-target

858933b

weicj reviewed May 7, 2026

View reviewed changes

jkyamog and others added 14 commits May 8, 2026 05:31

Merge pull request Luce-Org#108 from howard0su/q4rot

43593ee

perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context

Merge pull request Luce-Org#125 from phazei/feat/qwen3-thinking-mode-…

8af467b

…flags feat: disable Qwen3 thinking by default + NIAH bench QoL flags

Merge pull request Luce-Org#127 from phazei/feat/dflash-skip-park-lar…

00839b2

…ge-vram feat(dflash): --prefill-skip-park for 32 GB+ GPUs

Merge pull request Luce-Org#126 from phazei/fix/pflash-windows-support

519f609

fix(pflash): support Windows in DflashClient

Merge pull request Luce-Org#128 from phazei/fix/dflash-tq3-mask-stride

d4c8bc1

fix(dflash): pad KV mask to 256 when TQ3_0 is selected via -ctk/-ctv

Merge pull request Luce-Org#116 from Luce-Org/feat/dflash-laguna-target

29326ec

feat(dflash): scaffold Laguna-XS.2 target arch (loader + graph + bench)

davide221 and others added 10 commits May 8, 2026 17:34

docs(readme): bump Qwen3.6-27B decode to ~3x, capitalize DFlash/Megak…

fb5daa7

…ernel in table Co-Authored-By: WOZCODE <contact@withwoz.com>

Merge pull request Luce-Org#130 from Luce-Org/docs/readme-supported-m…

24d14dc

…odels docs(readme): supported-models table, drop dedicated Laguna section

Update README.md

69dc996

Merge pull request Luce-Org#124 from jkyamog/fix-reasoning-stream

c3f03e8

fix(dflash): expose qwen reasoning in streaming and non-streaming response

Merge pull request Luce-Org#136 from javierpazo/xabicasa/pflash-opera…

6a068c7

…tor-notes docs(pflash): operator notes for queue budget + drafter selection

Merge pull request Luce-Org#134 from howard0su/daemon2

6ad9f24

daemon: auto-detect .gguf draft format on unpark

Merge pull request Luce-Org#133 from howard0su/pfill

d50451e

flashprefill_q8: cast Q to F32 for ggml flash_attn_ext on sm_75

smpurkis added 8 commits May 10, 2026 02:05

add strix halo rocm compatibility

5657166

generalize qwen35 drafter for 0.8b and remove hardcoded constants

7f54392

fix(dflash): tune Strix Halo compression paths

2949eed

smpurkis force-pushed the strix-halo-rocm-compat branch from f285bff to 5d15549 Compare May 10, 2026 01:07

smpurkis force-pushed the strix-halo-rocm-compat branch from 5d15549 to 73678fa Compare May 10, 2026 07:39

smpurkis and others added 3 commits May 10, 2026 14:00

fix: remove non-existent prefix_cache.cpp from sources list

0a67a66

The file was listed in _dflash27b_sources but never committed to either branch. Caught by cmake configure during build verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HIP/ROCm support for Strix Halo (gfx1151)#119

Add HIP/ROCm support for Strix Halo (gfx1151)#119
smpurkis wants to merge 41 commits into
Luce-Org:mainfrom
smpurkis:strix-halo-rocm-compat

smpurkis commented May 7, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davide221 commented May 7, 2026

Uh oh!

smpurkis commented May 7, 2026

Uh oh!

weicj commented May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

weicj May 7, 2026

Uh oh!

davide221 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

smpurkis commented May 7, 2026

Summary

Llama.cpp dependency

Build (Strix Halo / gfx1151 / ROCm 7.2)

Run

Verified

Test plan

Pre-existing CMake note (not in scope here)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davide221 commented May 7, 2026

Uh oh!

smpurkis commented May 7, 2026

Uh oh!

weicj commented May 7, 2026

Uh oh!

weicj May 7, 2026

Choose a reason for hiding this comment

Uh oh!

weicj May 7, 2026

Choose a reason for hiding this comment

Uh oh!

weicj May 7, 2026

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants