dusterbloom
diff --git a/‎README.md‎
Lines changed: 28 additions & 0 deletions b/‎README.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎dflash/CMakeLists.txt‎
Lines changed: 159 additions & 63 deletions b/‎dflash/CMakeLists.txt‎
Lines changed: 159 additions & 63 deletions
diff --git a/‎dflash/docs/SPEC_PREFILL.md‎
Lines changed: 19 additions & 7 deletions b/‎dflash/docs/SPEC_PREFILL.md‎
Lines changed: 19 additions & 7 deletions
diff --git a/‎dflash/hip_compat/cuda_bf16.h‎
Lines changed: 66 additions & 0 deletions b/‎dflash/hip_compat/cuda_bf16.h‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎dflash/hip_compat/cuda_fp16.h‎
Lines changed: 6 additions & 0 deletions b/‎dflash/hip_compat/cuda_fp16.h‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎dflash/hip_compat/cuda_runtime.h‎
Lines changed: 92 additions & 0 deletions b/‎dflash/hip_compat/cuda_runtime.h‎
Lines changed: 92 additions & 0 deletions
diff --git a/‎dflash/hip_compat/mma.h‎
Lines changed: 15 additions & 0 deletions b/‎dflash/hip_compat/mma.h‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎dflash/include/dflash27b.h‎
Lines changed: 2 additions & 2 deletions b/‎dflash/include/dflash27b.h‎
Lines changed: 2 additions & 2 deletions
@@ -11,6 +11,7 @@
 <p align="center">
   <a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-e8e8ed?style=for-the-badge&labelColor=090909" alt="Apache 2.0"></a>
   <a href="https://developer.nvidia.com/cuda-toolkit"><img src="https://img.shields.io/badge/CUDA-12%2B-76b900?style=for-the-badge&logo=nvidia&logoColor=76b900&labelColor=090909" alt="CUDA 12+"></a>
+  <a href="https://rocm.docs.amd.com/projects/HIP/en/latest/"><img src="https://img.shields.io/badge/HIP-7%2B-ed1c24?style=for-the-badge&logo=amd&logoColor=ed1c24&labelColor=090909" alt="HIP 7+"></a>
   <a href="https://isocpp.org"><img src="https://img.shields.io/badge/C%2B%2B-17-e8e8ed?style=for-the-badge&logo=cplusplus&logoColor=e8e8ed&labelColor=090909" alt="C++17"></a>
 </p>
 
@@ -49,6 +50,7 @@ All speedups measured vs vendored llama.cpp (`-fa 1`, matching KV quant).
 | RTX 3090 | Qwen 3.6-27B Q4_K_M (DFlash + PFlash) | **10.4×** @ 128K | **~3×** vs AR |
 | RTX 3090 | Laguna-XS.2 33B-A3B Q4_K_M (DFlash + PFlash) | **5.4×** @ 128K | AR (draft pending) |
 | RTX 5090 | Qwen 3.6-27B Q4_K_M (DFlash + DDTree) | — | **4.84×** vs AR (205 tok/s) |
+| Ryzen AI MAX+ 395 (gfx1151) | Qwen 3.5-27B Q4_K_M (DFlash + PFlash, HIP) | **2.24×** @ 16K | **3.08×** vs llama.cpp HIP AR (37 tok/s) |
 
 ## 01 · Megakernel Qwen3.5 0.8B on RTX 3090
 
@@ -232,6 +234,32 @@ DFLASH_FP_PROFILE=1     # log mean / score / select / forward stage timings
 
 ---
 
+## AMD Strix Halo (HIP backend)
+
+**Same DFlash + PFlash stack on an AMD iGPU.** PR #119 ports the Phase 2 rocWMMA flashprefill kernels to HIP. End-to-end on a single Ryzen AI MAX+ 395 box (Radeon 8060S iGPU, gfx1151, 128 GiB LPDDR5X-8000 unified): **37.0 tok/s** DFlash decode on Qwen3.5-27B Q4_K_M, **27.6 s** TTFT at 16K context with NIAH retrieval intact. That is **3.08×** decode and **2.24×** prefill over llama.cpp HIP AR on the same iGPU. End-to-end wall clock at a realistic 16K prompt + 1K generation workload: **2.66×** faster than vanilla llama.cpp.
+
+```bash
+git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/dflash
+
+# Build for gfx1151 (Strix Halo). Swap the arch for gfx1100 / gfx1201.
+cmake -B build -S . \
+  -DCMAKE_BUILD_TYPE=Release \
+  -DDFLASH27B_GPU_BACKEND=hip \
+  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
+  -DDFLASH27B_HIP_SM80_EQUIV=ON
+cmake --build build --target test_dflash -j
+```
+
+`DFLASH27B_HIP_SM80_EQUIV=ON` enables the rocWMMA Phase 2 flashprefill kernels (the path that delivers the prefill speedup). `OFF` falls back to ggml's `flash_attn_ext` (slower but no rocwmma headers needed).
+
+**Per-arch DDTree tuning**: gfx1151 (Strix Halo iGPU, bandwidth-bound on LPDDR5X) peaks at `--ddtree-budget=22`. gfx1100 (7900 XTX, GDDR6) prefers `budget=8` per the [PR #156 cross-arch perf plan](https://github.com/Luce-Org/lucebox-hub/pull/156). Run `scripts/bench_he.py --ddtree-budget N` to verify on your card.
+
+**Drafter recipe for max decode**: target = Qwen3.5-27B Q4_K_M, drafter = same gen quantized to Q8_0 via `dflash/scripts/quantize_draft_q8.py`. The matching Q8_0 GGUF on the unsloth Qwen3.6 target needs `DFLASH27B_DRAFT_SWA=2048` for sliding-window correctness.
+
+[Blog post →](https://lucebox.com/blog/amd) · [PR #119 →](https://github.com/Luce-Org/lucebox-hub/pull/119) · [PR #156 cross-arch perf plan →](https://github.com/Luce-Org/lucebox-hub/pull/156)
+
+---
+
 ## Why this exists
 
 Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't.
 
@@ -7,7 +7,7 @@ This doc is the build / runtime / tunables reference for the C++ daemon
 path described in [`pflash/README.md`](../../pflash/README.md) and on the
 [blog post](https://lucebox.com/blog/pflash):
 
-- **Drafter** (Qwen3-0.6B) loaded via a custom forward (`qwen3_0p6b_*`)
+- **Drafter** (Qwen3-0.6B) loaded via a custom forward (`qwen3_*`)
   with the FlashPrefill block-sparse attention kernel for long-context
   scoring.
 - **Target** (Qwen3.6-27B Q4_K_M) loaded directly via ggml.
@@ -81,17 +81,29 @@ src/
   flashprefill_select.cpp       Host fallback for block_select (rarely used)
   bsa_launcher.cu               BSA launcher: blockmask conversion + Flash_fwd_params
   bsa_fwd_inst.cu               Single-TU instantiation of BSA's hdim128 kernel
-  qwen3_0p6b_loader.cpp         GGUF → Qwen3-0.6B BF16 weight tensors
-  qwen3_0p6b_graph.cpp          Custom Qwen3-0.6B forward (per-layer A/FP/B graphs)
-  qwen3_drafter.{h,cpp}         drafter_score_and_compress() entry point
-  qwen35_target_graph.cpp       Qwen3.5/3.6 target graph (ggml)
-  qwen3_dflash_graph.cpp        DFlash speculative draft head
+  qwen3/                   Qwen3-0.6B drafter model code
+    qwen3_loader.cpp       GGUF → Qwen3-0.6B BF16 weight tensors
+    qwen3_graph.cpp        Custom Qwen3-0.6B forward (per-layer A/FP/B graphs)
+    qwen3_drafter.{h,cpp}       drafter_score_and_compress() entry point
+  qwen35/                       Qwen3.5/3.6 target + DFlash draft model code
+    qwen35_target_graph.cpp     Qwen3.5/3.6 target graph (ggml)
+    gguf_target_loader.cpp      Qwen3.5 target GGUF loader
+  draft/                        Special DFlash draft model code
+    draft_dflash_graph.cpp      DFlash speculative draft head
+    draft_gguf_loader.cpp       Draft GGUF loader
+    draft_safetensors_loader.cpp Draft safetensors loader
+  laguna/                       Laguna target + daemon model code
+    laguna_target_loader.cpp    Laguna GGUF loader
+    laguna_target_graph.cpp     Laguna forward graph
+    laguna_daemon.{h,cpp}       Laguna daemon protocol/runtime
+  common/                       Shared runtime helpers
+    sampler.{h,cpp}             Shared CPU sampler chain
   kv_cache.cpp / kv_quant.cpp   Q4_0 KV cache + asymmetric quant
 test/
   test_dflash.cpp               daemon executable; supports
                                   `compress / generate / park / unpark / free drafter`
   test_flashprefill_kernels.cpp parity tests for the 4 FP kernels
-  smoke_qwen3_0p6b_forward.cpp  drafter forward smoke at S=8K-128K
+  smoke_qwen3_forward.cpp  drafter forward smoke at S=8K-128K
 deps/
   llama.cpp/                    submodule (ggml only; libllama not built)
   Block-Sparse-Attention/       submodule (BSA + cutlass)
 
@@ -0,0 +1,66 @@
+// HIP compatibility shim for <cuda_bf16.h>
+#pragma once
+
+// cuda_runtime.h (our compat) must be included first to ensure __HIP_PLATFORM_AMD__ is set
+// before hip_bfloat16.h is parsed. If included in isolation, set it now.
+#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_PLATFORM_NVIDIA__)
+#  define __HIP_PLATFORM_AMD__
+#endif
+
+#include <hip/hip_bfloat16.h>
+#include <cstring>  // memcpy for raw bit reinterpretation on host
+
+// Type alias: CUDA __nv_bfloat16 → AMD hip_bfloat16
+using __nv_bfloat16 = hip_bfloat16;
+
+// hip_bfloat162 does not exist in all ROCm versions; skip the alias.
+// Tests and source code that reference __nv_bfloat162 will need guarding.
+
+// Conversion intrinsics.
+//
+// When compiled by hipcc, hip_bfloat16's constructor and operator float() are
+// __host__ __device__. When compiled by g++ (plain CXX sources), __HOST_DEVICE__
+// collapses to __device__, making them unavailable on the host.
+//
+// Provide host-side helpers via raw bit manipulation so that test code and
+// pure-CXX source files can use these conversions without the device compiler.
+
+#ifdef __HIPCC__
+// hipcc path: use the type's own constructors / conversions
+__device__ __host__ inline float __bfloat162float(hip_bfloat16 x) {
+    return static_cast<float>(x);
+}
+__device__ __host__ inline hip_bfloat16 __float2bfloat16(float x) {
+    return hip_bfloat16(x);
+}
+__device__ __host__ inline hip_bfloat16 __float2bfloat16_rn(float x) {
+    return hip_bfloat16(x);
+}
+#else
+// g++ / plain CXX path: bit-cast approach, no device attributes
+namespace __hip_bf16_compat_detail {
+    // Truncating float→bf16: drop lower 16 mantissa bits.
+    inline uint16_t float_to_bf16_bits(float f) {
+        uint32_t u;
+        std::memcpy(&u, &f, sizeof(u));
+        return static_cast<uint16_t>(u >> 16);
+    }
+    inline float bf16_bits_to_float(uint16_t b) {
+        uint32_t u = static_cast<uint32_t>(b) << 16;
+        float f;
+        std::memcpy(&f, &u, sizeof(f));
+        return f;
+    }
+}
+inline float __bfloat162float(hip_bfloat16 x) {
+    return __hip_bf16_compat_detail::bf16_bits_to_float(x.data);
+}
+inline hip_bfloat16 __float2bfloat16(float x) {
+    hip_bfloat16 r;
+    r.data = __hip_bf16_compat_detail::float_to_bf16_bits(x);
+    return r;
+}
+inline hip_bfloat16 __float2bfloat16_rn(float x) {
+    return __float2bfloat16(x);
+}
+#endif
@@ -0,0 +1,6 @@
+// HIP compatibility shim for <cuda_fp16.h>
+#pragma once
+#include <hip/hip_fp16.h>
+
+// __half is the same name in HIP — no alias needed.
+// Intrinsics like __half2float, __float2half, __hadd, etc. are available directly.
@@ -0,0 +1,92 @@
+// HIP compatibility shim: maps <cuda_runtime.h> to HIP equivalents.
+// Included transparently when building with -I hip_compat on ROCm.
+#pragma once
+
+// hip/hip_runtime.h requires exactly one of __HIP_PLATFORM_AMD__ or
+// __HIP_PLATFORM_NVIDIA__ to be defined. hipcc sets it automatically;
+// g++ (used for plain CXX sources in the dflash build) does not.
+#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_PLATFORM_NVIDIA__)
+#  define __HIP_PLATFORM_AMD__
+#endif
+
+#include <hip/hip_runtime.h>
+#include <hip/hip_runtime_api.h>
+
+// Type aliases
+using cudaStream_t          = hipStream_t;
+using cudaEvent_t           = hipEvent_t;
+using cudaError_t           = hipError_t;
+using cudaMemcpyKind        = hipMemcpyKind;
+using cudaDeviceProp        = hipDeviceProp_t;
+
+// Memcpy kind constants
+#define cudaMemcpyHostToHost        hipMemcpyHostToHost
+#define cudaMemcpyHostToDevice      hipMemcpyHostToDevice
+#define cudaMemcpyDeviceToHost      hipMemcpyDeviceToHost
+#define cudaMemcpyDeviceToDevice    hipMemcpyDeviceToDevice
+#define cudaMemcpyDefault           hipMemcpyDefault
+
+// Error codes
+#define cudaSuccess                 hipSuccess
+#define cudaErrorInvalidValue       hipErrorInvalidValue
+
+// Memory functions
+#define cudaMalloc                  hipMalloc
+#define cudaMallocHost              hipHostMalloc
+#define cudaFree                    hipFree
+#define cudaFreeHost                hipHostFree
+#define cudaMemcpy                  hipMemcpy
+#define cudaMemcpyAsync             hipMemcpyAsync
+#define cudaMemcpy2DAsync           hipMemcpy2DAsync
+#define cudaMemcpyPeerAsync         hipMemcpyPeerAsync
+#define cudaMemset                  hipMemset
+#define cudaMemsetAsync             hipMemsetAsync
+
+// Stream functions
+#define cudaStreamCreate            hipStreamCreate
+#define cudaStreamDestroy           hipStreamDestroy
+#define cudaStreamSynchronize       hipStreamSynchronize
+#define cudaStreamDefault           hipStreamDefault
+#define cudaStreamNonBlocking       hipStreamNonBlocking
+
+// Device functions
+#define cudaGetDevice               hipGetDevice
+#define cudaSetDevice               hipSetDevice
+#define cudaDeviceSynchronize       hipDeviceSynchronize
+#define cudaGetDeviceProperties     hipGetDeviceProperties
+#define cudaDeviceReset             hipDeviceReset
+
+// Event functions
+#define cudaEventCreate             hipEventCreate
+#define cudaEventDestroy            hipEventDestroy
+#define cudaEventRecord             hipEventRecord
+#define cudaEventSynchronize        hipEventSynchronize
+#define cudaEventElapsedTime        hipEventElapsedTime
+#define cudaEventCreateWithFlags    hipEventCreateWithFlags
+#define cudaEventDisableTiming      hipEventDisableTiming
+
+// Kernel attribute
+#define cudaFuncSetAttribute        hipFuncSetAttribute
+#define cudaFuncAttributeMaxDynamicSharedMemorySize hipFuncAttributeMaxDynamicSharedMemorySize
+
+// Error checking
+#define cudaGetLastError            hipGetLastError
+#define cudaGetErrorString          hipGetErrorString
+
+// Launch bounds
+#define __launch_bounds__           __launch_bounds__
+
+// Stream capture status (added CUDA 10.0 — ROCm compat headers may omit this)
+#define cudaStreamCaptureStatus             hipStreamCaptureStatus
+#define cudaStreamCaptureStatusNone         hipStreamCaptureStatusNone
+#define cudaStreamCaptureStatusActive       hipStreamCaptureStatusActive
+#define cudaStreamCaptureStatusInvalidated  hipStreamCaptureStatusInvalidated
+#define cudaStreamIsCapturing               hipStreamIsCapturing
+
+// Peer device access
+#define cudaDeviceCanAccessPeer             hipDeviceCanAccessPeer
+#define cudaDeviceEnablePeerAccess          hipDeviceEnablePeerAccess
+#define cudaErrorPeerAccessAlreadyEnabled   hipErrorPeerAccessAlreadyEnabled
+
+// Device count
+#define cudaGetDeviceCount                  hipGetDeviceCount
@@ -0,0 +1,15 @@
+// HIP compatibility shim for <mma.h> (NVIDIA WMMA).
+//
+// Phase 1: empty — flashprefill_kernels.cu is excluded from the Phase 1 build
+//          (DFLASH27B_HAVE_FLASHPREFILL not defined), so this file is never reached.
+//
+// Phase 2: replace nvcuda::wmma with rocwmma. Add:
+//   #include <rocwmma/rocwmma.hpp>
+//   namespace nvcuda { namespace wmma = rocwmma; }  // approximate alias
+//   Then fix the accumulator fragment register layout in sparse_flash_forward_kernel_bf16
+//   (lines 408-443 of flashprefill_kernels.cu) to match AMD's m16n16k16 layout.
+//
+// NOTE: a namespace alias is not sufficient — the fragment register layouts differ
+// between NVIDIA sm_80 and AMD gfx1151. The manual row/col extraction code in
+// kernel 4 must be rewritten per the rocWMMA accumulator layout docs.
+#pragma once
@@ -23,8 +23,8 @@ extern "C" {
 // dimensions (z-lab draft: 32 Q heads, 8 KV heads, 128 head_dim). The TARGET
 // Qwen3.5-27B qwen35 hybrid uses 24 Q heads, 4 KV heads, 256 head_dim, which
 // live in `src/internal.h` (n_embd_head_k/v, N_HEAD, N_HEAD_KV). Naming is
-// historical — do not change without updating safetensors_draft.cpp +
-// qwen3_dflash_graph.cpp which consume these as draft-side constants.
+// historical — do not change without updating draft_safetensors_loader.cpp +
+// draft_dflash_graph.cpp which consume these as draft-side constants.
 #define DFLASH27B_TARGET_N_HEADS       32
 #define DFLASH27B_TARGET_N_KV_HEADS    8
 #define DFLASH27B_TARGET_HEAD_DIM      128