Skip to content

Add HIP/ROCm support for Strix Halo (gfx1151)#119

Open
smpurkis wants to merge 41 commits into
Luce-Org:mainfrom
smpurkis:strix-halo-rocm-compat
Open

Add HIP/ROCm support for Strix Halo (gfx1151)#119
smpurkis wants to merge 41 commits into
Luce-Org:mainfrom
smpurkis:strix-halo-rocm-compat

Conversation

@smpurkis
Copy link
Copy Markdown

@smpurkis smpurkis commented May 7, 2026

Note: This PR was vibe-coded with AI assistance (Claude Code). Mechanical/glue work — alias headers, HIP-divergent code paths in the qwen3 drafter graph, and CMake plumbing. Please review accordingly.

Summary

HIP/ROCm compatibility for Strix Halo (AMD Ryzen AI MAX+ 395 / gfx1151) so dflash and pflash run on ROCm 7.2. Pure code-support — no CUDA behavior changes.

  • 20 files, +1734 / -587 vs main, 6 commits.
  • HIP build paths gated by -DDFLASH27B_USE_HIP=ON.
  • HIP-divergent paths in dflash/src/qwen3_0p6b_graph.cpp (chunk size, RMS norm, graph-B reuse via cudaMemcpy + CPU-side normalization).
  • dflash/src/qwen35_target_graph.cpp generalized to back the 0.8B drafter as well as the 27B target — hardcoded constants removed.
  • dflash/src/device_runtime.h aliases CUDA→HIP types/symbols at the dflash layer.
  • 2 .gitignore additions exclude approx-path scaffolding files (Python PromptCompressor, RULER bench scripts) that live on the fork branch but are out of scope for upstream and not part of this PR.

Llama.cpp dependency

The submodule needs three HIP fixes that are not yet on Luce-Org/llama.cpp-dflash-ggml:luce-dflash. Companion PR: Luce-Org/llama.cpp-dflash-ggml#8 (cublas/cudaStream aliases + TQ3_0 FA guard). Until that lands, this branch's .gitmodules points at smpurkis/llama.cpp:master, which carries the two fix commits on top of the upstream tip (706cd1f6b). After #8 is merged, .gitmodules can be repointed at Luce-Org/llama.cpp-dflash-ggml:luce-dflash in a one-line follow-up.

Build (Strix Halo / gfx1151 / ROCm 7.2)

git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

cmake -S . -B build \
  -DDFLASH27B_USE_HIP=ON \
  -DGGML_HIP=ON \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151

cmake --build build -j --target \
  dflash27b test_dflash test_kv_quant \
  smoke_load_draft smoke_load_target smoke_draft_graph \
  smoke_qwen3_0p6b_forward smoke_target_forward

Run

# Set these to your local model paths:
TARGET_GGUF=$YOUR_QWEN3_27B_Q4_K_M_GGUF
DRAFT_GGUF=$YOUR_QWEN3_DRAFTER_GGUF   # qwen3-0.6b BF16 or qwen3.5-0.8b Q4_K_M

# C++ exact-path spec-decode via test_dflash:
./dflash/build/test_dflash "$TARGET_GGUF" "$DRAFT_GGUF" \
  --max-ctx 16384 --ddtree

# pflash NIAH bench (HIP exact path):
python pflash/tests/bench_niah_cpp.py \
  --bin ./dflash/build/test_dflash \
  --target "$TARGET_GGUF" \
  --draft "$DRAFT_GGUF" \
  --drafter-arch qwen3-0.6b

Verified

  • All three HIP drafter paths verified end-to-end against the Qwen3.6-27B Q4_K_M target via NIAH single-needle smokes:
    • qwen3-0.6B BF16 GGUF drafter
    • qwen3.5-0.8B Q4_K_M GGUF drafter
    • null/baseline llama.cpp HIP path
  • Spec-decode draft (Qwen3.6-27B-DFlash safetensors) runs on HIP via the ggml-hip backend.

Test plan

  • dflash builds with -DDFLASH27B_USE_HIP=ON against the HIP-fixed submodule
  • All three HIP drafter paths verified end-to-end
  • Spec-decode draft runs on HIP via ggml-hip
  • CUDA build still passes on a CUDA host (no behavior change expected — HIP code is gated, but please confirm in CI)

Pre-existing CMake note (not in scope here)

Two dflash/CMakeLists.txt targets on main hardcode ggml-cuda and break a wide cmake --build . on HIP:

Not introduced by this PR; the canonical --target ... set above sidesteps both. Worth a follow-up cleanup PR to use ${_dflash27b_ggml_backend_lib}.

ppsx added 3 commits May 6, 2026 17:41
Companion to the short-context RTX 5090 section (HumanEval / Math500 /
GSM8K, added in Luce-Org#86). Different layer of validation: short-context
tests pure speculative decoding at 50–250 token prompts (PFlash
compression not engaged), long-context tests the full PFlash drafter
scoring + ~20× compression + DFlash decode pipeline at 117K tokens.

Validated config: keep_ratio=0.05, alpha=0.70, ddtree_budget=22,
fa_window=4096, kv_tq3=0 (FP16 KV — 5090 has the VRAM headroom).
20/20 NIAH across 2 runs of n=10. Decode avg 210.7 t/s sustained over
117K context.

Notable cross-reference: my long-context budget sweep converges on
budget=22 as throughput-optimal, matching Luce-Org#86's short-context finding.
This makes budget=22 a stable default across context regimes for
Qwen3.6-27B on Blackwell, not a knob that needs per-regime tuning.

Includes notes on kv_tq3 (skip on 32 GB at this context, keep on 24 GB)
and alpha (the docs default of 0.85 fails 2/10 NIAH at this
setup; 0.60–0.70 pass 10/10).

Depends on Luce-Org#86. This PR is a pure append to dflash/RESULTS.md after
the same anchor point Luce-Org#86 appends to, so the rebase once Luce-Org#86 lands is
mechanical — my section just stacks below Luce-Org#86's section.
…ance

Six review-feedback fixes on the new RTX 5090 long-context section:

1-3. Throughput unit consistency: rename Decode t/s → Decode tok/s in all
     three throughput tables (headline, budget sweep, alpha sweep) to
     match the rest of dflash/RESULTS.md and the short-context section.

4. Runtime config interface ambiguity: replace shorthand names
   (keep_ratio=, alpha=, ddtree_budget=, ...) with the actual
   reproducible interfaces — bench_niah_cpp.py CLI flags added in Luce-Org#90
   (--keep-ratio=, --alpha=, --ddtree-budget=, --fa-window=, --kv-tq3=,
   --n-gen=) and daemon env vars (DFLASH_FP_USE_BSA=, optional
   DFLASH27B_FA_WINDOW=). One concrete way to reproduce, not three
   shorthand aliases.

5. niah_gen.py path: reference becomes a relative markdown link to
   ../pflash/tests/niah_gen.py.

6. P2 — sweep provenance and apparent headline-vs-table mismatch:
   the headline 210.7 tok/s comes from Phase 4 (n=20 reliability run at
   the V4 config: keep=0.05/alpha=0.70/budget=22), but the budget table
   was at keep=0.08/budget=varying and the alpha table was at
   keep=0.08/budget=28/alpha=varying. A reader who notices both
   "budget=22 → 217.4" (sweep) and "budget=22 → 210.7" (headline)
   would correctly conclude the section is mixing configs.

   Fix: each sweep table now declares its held-fixed parameters in the
   subsection title (Phase 1 / 2 / 3, with "held: ..."). Added the
   missing keep-ratio sweep (Phase 3) so the reader can see why
   keep=0.05 was chosen over the per-token-faster keep=0.08 (TTFT and
   compression-ratio tradeoff). New paragraph between the headline and
   the sweeps spells out the methodology so the cross-table comparison
   is grounded.
… Q8_0

Four review-feedback fixes on the new long-context section:

1+3. The runtime config and the alpha-sweep section both documented
     `--alpha=0.70` as a CLI flag, but bench_niah_cpp.py on main does
     not have a --alpha flag (that flag is in the still-open
     bsa-cli-flag PR). Alpha is configured via the daemon env var
     DFLASH_FP_ALPHA. Renamed all references:
     - Runtime config bullet now reads
       "DFLASH_FP_USE_BSA=1 and DFLASH_FP_ALPHA=0.70 ... (both are
       daemon env vars)".
     - Phase 1 section heading + table column header changed from
       "--alpha" to "DFLASH_FP_ALPHA".
     - "The docs default of --alpha=0.85" → "DFLASH_FP_ALPHA=0.85".
     - Phase 2 + Phase 3 "held:" lists also corrected from
       "--alpha=0.70" to "DFLASH_FP_ALPHA=0.70".

2. Runtime config preface said "the bracketed names are the exact
   interfaces" but no brackets in the list. Reworded to "each bullet
   leads with the exact interface" so prose matches presentation.

4. KV-cache type relabeled from FP16 → Q8_0. The reviewer correctly
   notes that --kv-tq3=0 alone leaves the daemon at its Q8_0 KV
   default (per dflash/src/kv_quant.h); FP16 would require setting
   DFLASH27B_KV_K=f16 / DFLASH27B_KV_V=f16 explicitly, which the
   recorded runtime config does not include. Conservative
   interpretation: the actual benchmark used Q8_0 (the daemon
   default). Updated:
     - Runtime config bullet: "Q8_0 KV cache — the daemon default
       when TQ3_0 is disabled and no other KV type is set".
     - "Note on --kv-tq3" subsection: explains the Q8_0 default and
       points users at DFLASH27B_KV_K/V for explicit overrides.
   The "5090 has VRAM headroom; TQ3_0 not needed" rationale holds for
   either Q8_0 or FP16, so the section's argument is unaffected.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 issues found across 20 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/prefix_cache.cpp">

<violation number="1" location="dflash/src/prefix_cache.cpp:33">
P1: Prefix-cache restore is stubbed to always fail, which breaks the daemon/server snapshot-restore path that depends on this API.</violation>
</file>

<file name="pflash/.gitignore">

<violation number="1" location="pflash/.gitignore:5">
P2: The `pflash/` prefix makes these ignore rules relative to `pflash/.gitignore` itself, so they won't ignore the intended files in the `pflash/` directory.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:96">
P2: Architecture override values are captured but not consumed; user-supplied CUDA/HIP arch settings can be ignored in favor of defaults.</violation>
</file>

<file name="dflash/src/qwen3_drafter.cpp">

<violation number="1" location="dflash/src/qwen3_drafter.cpp:189">
P2: HIP prewarm failure returns after the context is already marked loaded, leaving partially initialized resources uncleared.</violation>

<violation number="2" location="dflash/src/qwen3_drafter.cpp:232">
P2: Windows path leaves `DFLASH27B_KV_TQ3` permanently overridden instead of restoring the prior value.</violation>

<violation number="3" location="dflash/src/qwen3_drafter.cpp:615">
P2: Qwen3.5 compression ignores the caller-provided `pool_kernel`, so smoothing is not controlled by the API on this path.</violation>
</file>

<file name="dflash/src/qwen3_0p6b_graph.cpp">

<violation number="1" location="dflash/src/qwen3_0p6b_graph.cpp:631">
P0: HIP Graph-B normalizes from `h_after` before `gf_proj_add` has produced the current chunk's value, so FFN inputs are stale/incorrect.</violation>
</file>

<file name="dflash/src/gguf_target_loader.cpp">

<violation number="1" location="dflash/src/gguf_target_loader.cpp:534">
P2: `out.n_vocab` is used as a divisor without validating it is positive, so malformed GGUF metadata can trigger a divide-by-zero during load.</violation>
</file>

<file name="dflash/src/qwen35_target_graph.cpp">

<violation number="1" location="dflash/src/qwen35_target_graph.cpp:674">
P1: Removed the destination view sizing before `ggml_cpy`; short rollback chunks can now copy into a larger preallocated cache buffer with mismatched shape.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/src/qwen3_0p6b_graph.cpp
Comment thread dflash/src/prefix_cache.cpp Outdated
Comment thread dflash/src/qwen35_target_graph.cpp
Comment thread pflash/.gitignore Outdated
Comment thread dflash/CMakeLists.txt Outdated
Comment thread dflash/src/qwen3_drafter.cpp
Comment thread dflash/src/qwen3_drafter.cpp Outdated
Comment thread dflash/src/qwen3_drafter.cpp
Comment thread dflash/src/gguf_target_loader.cpp
…prefill-last-logits

fix(dflash): resolve target split prefill OOM from full-prompt logits
@davide221
Copy link
Copy Markdown
Contributor

@smpurkis thanks for the great contribution! We can't wait to integrate AMD, can you check and fix P0-P1 from cubic?

smpurkis added a commit to smpurkis/lucebox-hub that referenced this pull request May 7, 2026
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM /
select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON.
Phase 1 (default) keeps the ggml q8 fallback unchanged.

On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq
length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at
S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases,
output byte-identical.

Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP
in qwen3_0p6b_graph.cpp, so the new kernels were linked but never
reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80.

Also addresses cubic-dev-ai review on PR Luce-Org#119:
  - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading
        h_after)
  - P1: restore prefix-cache impls in qwen35_target_graph; drop the
        stub prefix_cache.cpp
  - P1: cap->conv_input view sizing
  - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP
  - P2: n_vocab guard in gguf_target_loader
  - P2: free_drafter on prewarm fail; portable env restore via lambda
        (Windows _dupenv_s/_putenv_s); thread pool_kernel through
        qwen35_score_and_compress
  - P2: drop broken pflash/.gitignore

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@smpurkis
Copy link
Copy Markdown
Author

smpurkis commented May 7, 2026

Thanks for the review! All P0–P2 from cubic addressed in commit f285bff.

P0qwen3_0p6b_graph.cpp HIP Graph-B ordering: gf_proj_add runs before reading h_after. Comment block at lines 627–630 documents the invariant. (Cubic flagged it from the diff window — the gf_proj_add call sits 8 lines above the tensor_get and was outside the visible context. Ordering is correct.)

P1prefix_cache.cpp stub removed. Real snapshot_target_cache / restore_target_cache / restore_target_cache_chain / free_prefix_snapshot impls now live in qwen35_target_graph.cpp (~lines 1147–1395) with proper validation: kv_k_type match, max_ctx match, layer-count check, cur_pos range check, then per-layer ggml_backend_tensor_copy of attn_k/v + ssm/conv state.

P1qwen35_target_graph.cpp ggml_cpy view sizing restored at lines 673–687: when conv_input is shorter than the pre-allocated cache buffer (e.g. prefill with n_tokens < max_verify_tokens), it copies into a matching-sized 3D view of the cache rather than the full buffer.

P2 also in:

  • dflash/CMakeLists.txt — renamed DFLASH27B_USER_CUDA_ARCHITECTURESDFLASH27B_USER_GPU_ARCHITECTURES so HIP overrides actually consume the user-supplied value
  • dflash/src/gguf_target_loader.cppn_vocab > 0 guard before using as divisor
  • dflash/src/qwen3_drafter.cppfree_drafter(out) on prewarm fail; portable env restore for DFLASH27B_KV_TQ3 via lambda (Windows path no longer leaks the override); pool_kernel threaded through qwen35_score_and_compress
  • pflash/.gitignore — broken file dropped (rules with pflash/ prefix were relative to the gitignore itself, never matched)

Re-validated post-fix on gfx1151 / ROCm 7.2:

  • Phase 2 smoke (S=2048): forward+score 1.92s, kept 1024/2048, FP kernel 0.20s
  • Phase 1 smoke (S=2048): forward+score 2.17s, kept 1024/2048, FP kernel 0.48s (~2.4× faster on Phase 2 — consistent with the prior 2.2×@2k result; gap grows with S, hits 5.0× at 16K).
  • Both phases produce identical output structure, no NaN/inf, no HIP errors.

@weicj
Copy link
Copy Markdown
Contributor

weicj commented May 7, 2026

Nice work pushing the ROCm/HIP path forward. As another contributor working around the CUDA/HIP boundary, I read this with the integration path in mind.

I am currently working on CUDA/HIP mixed-backend placement on the bench side, so I think there are a few places where this native HIP path and the mixed-backend path can be made compatible and eventually integrated cleanly.

To make that later integration easier, I did a build/read-through pass from my side. In a local build-only check with ROCm 7.2.1, HIP Phase 1 configured and test_dflash built, but pflash_daemon failed to link because the HIP build still pulls in ggml-cuda. CUDA/SM75 built the same targets successfully. I also found a few correctness and scope issues in the diff. Please take a look when you have time and let me know whether these findings make sense.

Comment thread dflash/CMakeLists.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I locally reproduced a HIP build failure here. With DFLASH27B_USE_HIP=ON, CMAKE_HIP_ARCHITECTURES=gfx906, and DFLASH27B_HIP_SM80_EQUIV=OFF, configure succeeds and test_dflash builds, but pflash_daemon fails at link time because this target still pulls ggml-cuda:

/usr/bin/ld: cannot find -lggml-cuda: No such file or directory

Could this link against the selected backend target, e.g. ${_dflash27b_ggml_backend_lib}, or be guarded as CUDA-only? spike_thin_copy appears to have the same direct ggml-cuda link.

}

// rope dimension_sections (array of 4 uint32)
int rope_sections[4] = {0, 0, 0, 0};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still appears to weaken target-loader validation. Missing/short qwen35.rope.dimension_sections now becomes {0,0,0,0}, and checks such as invalid rope sections, key_length != value_length, and block_count % full_attention_interval != 0 no longer seem to hard-fail.

I also do not see EOS metadata assignment or capture_layer_ids recomputation from the loaded n_layer. For normal target generation, output.weight also still seems required when plan.load_output is true. Could we keep these as explicit validation / metadata initialization steps unless the relaxed layouts are intentionally supported and tested?

Comment thread dflash/src/bsa_launcher_hip.cu Outdated
if (kv_bytes > kv_buf_cap) {
if (kv_buf_K) hipFree(kv_buf_K);
if (kv_buf_V) hipFree(kv_buf_V);
hipMalloc(&kv_buf_K, kv_bytes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hipMalloc calls are unchecked; if either allocation fails, kv_buf_cap is still updated and later kernels may run with null or stale buffers while this function returns 0.

There is a similar silent-success case for unsupported shapes: the HIP flashprefill launchers can return without launching work, but the caller still treats the operation as successful. Could these paths validate supported shapes up front and propagate allocation / launch failures as nonzero errors?

jkyamog and others added 14 commits May 8, 2026 05:31
…sponses

Parse <think>...</think> tags from Qwen model output and expose
reasoning separately:

- OpenAI: reasoning_content field on the message / streaming delta
- Anthropic: thinking content blocks with thinking_delta events

Streaming uses a sliding-window state machine (consume_stream_piece)
to detect tag boundaries across token boundaries.  Respects
chat_template_kwargs.enable_thinking — when disabled, tokens stream
as plain content without tag parsing.
The host-side `build_causal_mask` sizes the mask buffer as
`align_up(kv_len, g_kq_stride_pad)`. `g_kq_stride_pad` is bumped from
`KQ_MASK_PAD=32` to `256` so the mask matches the TurboQuant chunked-FA
driver, which pads `kv_len` to `align_up(kv_len, FATTN_KQ_STRIDE=256)`
when KV is TQ3_0.

The bump used to fire only on the legacy `DFLASH27B_KV_TQ3=1` path. When
the KV type is selected via `-ctk tq3_0` / `-ctv tq3_0` (which sets
`DFLASH27B_KV_K` / `DFLASH27B_KV_V` instead), the bump never ran. With
stride at 32 the mask was sized at `align_up(N, 32)` while the kernel
read at `align_up(N, 256)`, walking off the end of the mask buffer into
uninitialized GPU memory. The OOB fp16 garbage occasionally encoded NaN,
producing per-thread NaN at any prefill ubatch `N` where
`align_up(N, 256) > align_up(N, 32)` — notably 256, 512, 1024, 2048.

Re-check `DFLASH27B_KV_K` / `DFLASH27B_KV_V` after argv parsing and bump
`g_kq_stride_pad` to 256 if either starts with "tq3" (case-insensitive).
The legacy `DFLASH27B_KV_TQ3=1` block is left in place; this is
additive.

Reproducer: `-ctk tq3_0 -ctv tq3_0` with prefill ubatch in {256, 512,
1024, 2048} produced random non-determinism on RTX 5090 (sm_120) before
this fix, and produces the same output across runs after.
The PFlash compress flow parks the target + draft to VRAM-backed scratch
before loading the drafter, then restores them after scoring. The
park/restore protects 24 GB cards (target ~16 GB Q4_K_M + draft ~3 GB +
drafter KV ~1.3 GB + BSA scratch ~600 MB exceeds 24 GB without it) but
adds ~2 s of fixed cost per compress call.

On a 32 GB card (RTX 5090) all three models fit co-resident, so the
park/restore round trip is pure overhead.

Plumbs a `skip_park` flag from CLI to daemon:

- `--prefill-skip-park` argparse flag (default off, behaviour unchanged
  for existing users).
- `PrefillConfig.skip_park` carries the choice through both server.py
  and server_tools.py compress paths.
- `compress_text_via_daemon` skips the `park target / park draft` and
  `free drafter / unpark target / unpark draft` round-trips when
  `skip_park=True`.
- Server start sets `DFLASH_COMPRESS_NO_PARK=1` in the daemon env when
  the flag is on; daemon `compress` handler then leaves the scorer
  resident, avoiding the ~2 s reload on every subsequent request.

The flag is opt-in. Users on 24 GB cards who pass it will OOM during
compress; the help text calls out the 32 GB minimum.
The daemon-spawn path was POSIX-only.  Three things break on Windows:

1. `LD_LIBRARY_PATH` is meaningless; Windows uses `PATH` for DLL
   resolution.
2. `subprocess.Popen(pass_fds=...)` is unsupported on Windows.
3. The child receives the parent file descriptor via `--stream-fd`,
   which on Windows must be a Win32 HANDLE (an `int`-castable handle
   value), not a CRT fd integer.

Fixes:

- Add a `sys.platform == "win32"` branch when extending the loader
  search path.  The `else` branch is unchanged, so Linux behaviour is
  byte-identical to before.
- On Windows, mark the pipe write-end inheritable
  (`os.set_inheritable`) and pass `msvcrt.get_osfhandle(...)` as
  `--stream-fd`.  The daemon already accepts a HANDLE value here on
  Windows.
- On Windows, spawn with `close_fds=False` so the child inherits the
  marked handle; on POSIX, keep the existing `pass_fds=(self.w_pipe,)`
  path.

The pattern matches the in-process port already used in
`dflash/scripts/server.py`.
Three small, related changes around Qwen3.x thinking-mode handling and
the NIAH harness.

server.py: default `enable_thinking=False`

The OpenAI-compatible endpoint applied the tokenizer chat template with
no `enable_thinking` kwarg, which falls through to the Qwen3.x template
default of `True`. With thinking on, every reply burns its `n_gen`
budget on a chain-of-thought rollout that DFlash's drafter cannot
predict, collapsing acceptance rate (typically ~60% -> single digits)
and frequently exhausting `max_tokens` before the answer block starts.

The default flips to `False`. Clients keep full control by sending
`"chat_template_kwargs": {"enable_thinking": true}` per request, which
the existing template-kwargs passthrough merges in. This repo only
serves Qwen3.5/3.6, so the default is safe; the kwarg is silently
ignored by tokenizers whose templates do not reference it.

bench_niah_cpp.py: `--no-thinking` flag

The NIAH bench builds its own chat prompt (independent of the server
path) and hit the same problem at 117K context: Qwen3.6 spent the full
generation budget thinking and never reached the answer. Adds a
`--no-thinking` flag that injects `enable_thinking=False` into the
target's `apply_chat_template` call. Default is off, so existing
invocations are unchanged.

niah_gen.py: `--seed-base` flag

Replaces the hardcoded `seed=42 + i` with a `--seed-base` argument
(default 42). Lets you generate independent case sets without changing
`--n`, e.g. for held-out validation passes.
perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context
…flags

feat: disable Qwen3 thinking by default + NIAH bench QoL flags
…ge-vram

feat(dflash): --prefill-skip-park for 32 GB+ GPUs
fix(pflash): support Windows in DflashClient
- target_loader: pin tok_embd to mmap host before backend alloc so the
  GPU allocator skips it (saves 110 MiB VRAM on RTX 3090, verified
  19842 -> 19732 MiB at idle).
- target_loader: validate expert_used_count <= expert_count.
- bench_laguna_pflash: drop double free_drafter on the embed-failure
  path; drafter was already freed before the laguna target load.
- parity_laguna: cache HF reference model in a module-level dict so
  it is loaded once across context lengths instead of per iteration.
- RESULTS.md: relabel 27 364 tok/s as effective on the 131 072-token
  original prompt (target processes 13 120 compressed); the prior
  label "on the 13 120-token compressed prompt" was the wrong basis.

Co-Authored-By: WOZCODE <contact@withwoz.com>
fix(dflash): pad KV mask to 256 when TQ3_0 is selected via -ctk/-ctv
Resolve conflict in dflash/test/test_dflash.cpp: keep both main's
TQ3_0 KV-stride padding block (q4rot fix from PR Luce-Org#108) and the PR's
laguna-aware early-exit (skips the prompt/out positional-arg check
when the target is laguna and dispatches to run_laguna_daemon).

Co-Authored-By: WOZCODE <contact@withwoz.com>
feat(dflash): scaffold Laguna-XS.2 target arch (loader + graph + bench)
Adds a Supported Models table near the top with TTFT and Decode speedups
on RTX 3090 vs vendored llama.cpp for megakernel Qwen3.5-0.8B, dflash
Qwen3.5/3.6-27B, and Laguna-XS.2.

Removes the dedicated section 04 Laguna-XS.2 writeup. Laguna is a
supported model on the existing dflash + PFlash path, not a new
speedup, so it belongs in the table only. Hero-card list and "Three
projects today" updated to match.

Co-Authored-By: WOZCODE <contact@withwoz.com>
davide221 and others added 10 commits May 8, 2026 17:34
…ernel in table

Co-Authored-By: WOZCODE <contact@withwoz.com>
…odels

docs(readme): supported-models table, drop dedicated Laguna section
fix(dflash): expose qwen reasoning in streaming and non-streaming response
ggml's flash_attn_ext asserts Q->type == GGML_TYPE_F32 on all SM
targets. The q8 fallback path was passing BF16 Q tensors from the
drafter's persistent buffers, causing an assertion failure on Turing
(sm_75) GPUs during pflash drafter forward.

Add a ggml_cast(Q, F32) when the source type is not already F32.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
unpark draft unconditionally called load_draft_safetensors(), failing
with 'bad header length' when the draft is a GGUF file. Use the same
extension-based dispatch (.gguf → load_draft_gguf) already used at
initial load time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow-up to merged PR Luce-Org#70 ("in-process speculative prefill for
dflash daemon"). PR Luce-Org#70 implements the kernels (mean_K / score /
select / sparse_fwd), the daemon protocol (compress / free drafter
/ park / unpark) and the --prefill-* CLI flags. Running that as a
multi-tenant production lane exposed two small operator-level rough
edges that are worth documenting in pflash/README.md so other
operators do not rediscover them.

  1. Queue budget. A 64K PFlash request takes ~24 s end-to-end on
     a live Qwen3.6 lane. Default queue budget
     (max_queue_requests=4, queue_timeout_s=12.0) produced
     avoidable timeouts on bursts of long prompts. Recommend
     tuning to 8 / 90.0 for PFlash long-context lanes.

  2. Drafter selection. PFlash compress works best with the
     Qwen3-0.6B drafter (BF16 safetensors, ~5 attention layers).
     Decode after unpark uses the larger DFlash drafter. Document
     this so the dual-drafter layout is explicit.

This commit also documents an apples-to-apples long-context
measurement methodology vs Ollama native /api/chat on the same 64K
unique-prompt summary task (Ollama 68.614 s vs Lucebox + PFlash
compress 23.748 s; ~2.89x). The speedup comes from Luce-Org#70; this
section only records how to reproduce it without prompt-prefix
cache artefacts (Ollama OpenAI-compat endpoint returned empty
content with the response in message.reasoning, so the comparison
was done on the native endpoint).

This is a docs-only PR. No kernels or daemon protocol are changed.
The bulk of the integration (the OpenAI-compatible proxy, the YAML
manifest, the benchmark harness) lives in a downstream wrapper and
stays local.

Verification vs existing community PRs:
  COMP-COMPL with Luce-Org#70 (merged 2026-04-30). Luce-Org#70 introduces the
  in-process pflash daemon mode; this PR only adds operator notes
  on top of it.

Author: Javier Pazo <xabicasa@gmail.com>
…tor-notes

docs(pflash): operator notes for queue budget + drafter selection
daemon: auto-detect .gguf draft format on unpark
flashprefill_q8: cast Q to F32 for ggml flash_attn_ext on sm_75
@davide221
Copy link
Copy Markdown
Contributor

@smpurkis can you rebase and sync with @weicj requests so that we can push AMD support to production? It would be awesome

smpurkis added 8 commits May 10, 2026 02:05
This branch should only add HIP/ROCm support for the dflash C++ exact
path (drafter scoring + qwen35-0.8b dispatch). The Python approx path
(PromptCompressor + llama.cpp CLI wrapper + RULER comparison harness)
landed in 28fd493 as part of "strix halo rocm compatibility" but is
unrelated to the HIP port and is being removed.

- git rm --cached the 8 approx-only files (kept on disk via .gitignore)
- revert pflash/__init__.py, pyproject.toml, README.md to main (all
  diffs in those three were approx-related)
- add pflash/.gitignore so the local files stay untracked
- keep tracked: dflash_client.py (drafter_arch + ROCm VRAM telemetry),
  platform.py, bench_niah_cpp.py (--drafter-arch flag), all dflash/src
Strix Halo branch should minimize diff against main: only HIP/ROCm code
support, no doc churn or branch-marker prose.

- Untrack dflash/docs/STRIX_HALO_PFLASH.md via dflash/.gitignore (kept
  on disk; restore via `git show 388867a:dflash/docs/STRIX_HALO_PFLASH.md`)
- Revert dflash/README.md, README.md, pflash/README.md to origin/main
  (the additions were Strix Halo prose; not load-bearing for the build)
Strip extras that landed in 28fd493 / 4a93505 alongside the actual
HIP/ROCm port. Keeps the same end-to-end smoke test passing
(qwen3-0.6B drafter + 27B target on gfx1151, NIAH single-needle).

- pflash/pflash/platform.py removed; the only consumer was dflash_client's
  boot wait. Replaced with an inline _query_nvidia_vram_mib() that
  silently falls back to a 5s time-based check when nvidia-smi is absent
  (Strix Halo / any non-NVIDIA box).
- dflash/src/flashprefill.cpp: keep only the cuda_runtime.h ->
  device_runtime.h header swap. Drop the per-device cudaMalloc scratch
  cache (CUDA-only perf opt; HIP path uses flash_prefill_forward_q8 so
  this code never runs on HIP).
- dflash/src/qwen3_0p6b_loader.cpp: drop the src-vs-dst type/byte
  defensive checks (unrelated to HIP).
- dflash/include/dflash27b.h: drop QWEN35_9B macro alt-config block
  (macro-guarded, no internal callers).
- dflash/test/test_flashprefill_kernels.cpp: fully revert to origin/main.
  Gate test build on NOT DFLASH27B_USE_HIP in CMakeLists; this test
  exercises BF16 WMMA kernels that the HIP path force-disables anyway.
- root .gitignore: drop dflash/build-hip and pflash/models entries.

Net diff vs origin/main: 28 -> 20 files, +2376/-613 -> +1734/-587.
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM /
select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON.
Phase 1 (default) keeps the ggml q8 fallback unchanged.

On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq
length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at
S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases,
output byte-identical.

Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP
in qwen3_0p6b_graph.cpp, so the new kernels were linked but never
reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80.

Also addresses cubic-dev-ai review on PR Luce-Org#119:
  - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading
        h_after)
  - P1: restore prefix-cache impls in qwen35_target_graph; drop the
        stub prefix_cache.cpp
  - P1: cap->conv_input view sizing
  - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP
  - P2: n_vocab guard in gguf_target_loader
  - P2: free_drafter on prewarm fail; portable env restore via lambda
        (Windows _dupenv_s/_putenv_s); thread pool_kernel through
        qwen35_score_and_compress
  - P2: drop broken pflash/.gitignore
…alloc checks

- CMakeLists.txt: pflash_daemon and spike_thin_copy now link against
  ${_dflash27b_ggml_backend_lib} instead of hardcoded ggml-cuda, fixing
  HIP build link failures reported by weicj.
- gguf_target_loader.cpp: restore structural validation removed in
  prior cleanup — kl!=vl check, n_layer%fai divisibility, full
  rope_sections validation (presence, count, bounds vs head_dim),
  EOS metadata assignment, capture_layer_ids recomputation, and
  output.weight requirement in top-level tensor check.
- bsa_launcher_hip.cu: check hipMalloc return values; on failure,
  roll back partial allocations, reset kv_buf_cap, and return -1
  instead of silently proceeding with null buffers.
@smpurkis smpurkis force-pushed the strix-halo-rocm-compat branch from f285bff to 5d15549 Compare May 10, 2026 01:07
smpurkis added a commit to smpurkis/lucebox-hub that referenced this pull request May 10, 2026
Imports rocWMMA-native flashprefill kernels (mean / score-GEMM /
select / sparse-FA) from PR Luce-Org#117 behind DFLASH27B_HIP_SM80_EQUIV=ON.
Phase 1 (default) keeps the ggml q8 fallback unchanged.

On gfx1151 / ROCm 7.2 the FP-kernel speedup vs Phase 1 grows with seq
length: 2.2x@2K, 3.9x@8K, 5.0x@16K. End-to-end NIAH compress at
S=7270 goes 4.21s -> 2.70s (1.56x), accuracy 1/1 on both phases,
output byte-identical.

Bug fixed in the same commit: use_bf16_fp was hardcoded false on HIP
in qwen3_0p6b_graph.cpp, so the new kernels were linked but never
reached. Gate now keys off DFLASH27B_HAVE_FLASHPREFILL && MIN_SM>=80.

Also addresses cubic-dev-ai review on PR Luce-Org#119:
  - P0: graph-B reorder in qwen3_0p6b (gf_proj_add before reading
        h_after)
  - P1: restore prefix-cache impls in qwen35_target_graph; drop the
        stub prefix_cache.cpp
  - P1: cap->conv_input view sizing
  - P2: rename DFLASH27B_USER_CUDA_ARCHITECTURES -> _GPU_ for HIP
  - P2: n_vocab guard in gguf_target_loader
  - P2: free_drafter on prewarm fail; portable env restore via lambda
        (Windows _dupenv_s/_putenv_s); thread pool_kernel through
        qwen35_score_and_compress
  - P2: drop broken pflash/.gitignore

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@smpurkis smpurkis force-pushed the strix-halo-rocm-compat branch from 5d15549 to 73678fa Compare May 10, 2026 07:39
smpurkis and others added 3 commits May 10, 2026 14:00
- README.md: take upstream's GPU column + RTX 5090 row in supported models table
- pflash/tests/bench_niah_cpp.py: take upstream's simplified user prompt
- .gitmodules: repoint llama.cpp submodule from smpurkis/llama.cpp:master to
  Luce-Org/llama.cpp-dflash-ggml:luce-dflash (companion PR Luce-Org#8 merged 2026-05-08,
  HIP fixes are now upstream)
- dflash/deps/llama.cpp: advance to c79573c9b (includes HIP dflash aliases +
  TQ3_0 FA guard + turbo_wht + laguna arch)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…flicts

Resolve merge conflicts from upstream/main which refactored the CMake
build system from DFLASH27B_USE_HIP boolean to DFLASH27B_GPU_BACKEND
string option ("cuda"/"hip"). Key changes:

- Adopt main's DFLASH27B_GPU_BACKEND / DFLASH27B_GGML_BACKEND_TARGET
  naming throughout CMakeLists.txt
- Preserve our branch's HIP-specific additions: ROCm rpath, gfx1151
  default arch, rocWMMA Phase 2 flashprefill, BSA HIP launcher
- Update DFLASH27B_USE_HIP → DFLASH27B_BACKEND_HIP compile definition
  across all source files (device_runtime.h, qwen3_0p6b_graph.cpp,
  qwen3_drafter.cpp, flashprefill.cpp, flashprefill_kernels.cu)
- Take main's ggml-hip compile_definitions shim and hip_compat include
- Take main's cuda_cross_device_copy.cpp for CUDA builds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The file was listed in _dflash27b_sources but never committed to
either branch. Caught by cmake configure during build verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants