Skip to content

Commit 0e4fc06

Browse files
committed
Merge origin/main into feature/gemma4-support — post-Luce-Org#138/Luce-Org#119/Luce-Org#149 reorg
Brings in HIP/Strix Halo backend (PRs Luce-Org#119, Luce-Org#149), dflash source-layout reorg (Luce-Org#138 — qwen35/, draft/, qwen3/ subdirs), GGUF draft loader fixes, daemon ubatch defaults, prefix cache + streaming tool-call fixes. Conflicts resolved: - dflash/CMakeLists.txt: take main's reorganized source paths; keep our gemma4_*.cpp entries; preserve the DFLASH27B_MIN_SM backwards- compat shim so gemma4_dflash_graph.cpp:621 keeps building under main's renamed _dflash27b_cuda_min_sm variable. - dflash/deps/llama.cpp: keep our submodule pointer (eb3676f40 on feature/tq3-kv-cache-clean). Main's c79573c9b lacks the TQ3 dispatcher fixes required for Gemma4 KV correctness; if useful upstream commits land there, they should be cherry-picked into our submodule branch separately. Verified: TQ3 64K MTP gamma=2 pflash post-merge: decode 10.58 tok/s, prefill 463 tok/s, accept 0.78 — matches pre-merge baseline (10.25 / 445 / 0.78) within noise.
2 parents 80881ca + 9f47ab9 commit 0e4fc06

49 files changed

Lines changed: 4678 additions & 979 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
<p align="center">
1212
<a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-e8e8ed?style=for-the-badge&labelColor=090909" alt="Apache 2.0"></a>
1313
<a href="https://developer.nvidia.com/cuda-toolkit"><img src="https://img.shields.io/badge/CUDA-12%2B-76b900?style=for-the-badge&logo=nvidia&logoColor=76b900&labelColor=090909" alt="CUDA 12+"></a>
14+
<a href="https://rocm.docs.amd.com/projects/HIP/en/latest/"><img src="https://img.shields.io/badge/HIP-7%2B-ed1c24?style=for-the-badge&logo=amd&logoColor=ed1c24&labelColor=090909" alt="HIP 7+"></a>
1415
<a href="https://isocpp.org"><img src="https://img.shields.io/badge/C%2B%2B-17-e8e8ed?style=for-the-badge&logo=cplusplus&logoColor=e8e8ed&labelColor=090909" alt="C++17"></a>
1516
</p>
1617

@@ -49,6 +50,7 @@ All speedups measured vs vendored llama.cpp (`-fa 1`, matching KV quant).
4950
| RTX 3090 | Qwen 3.6-27B Q4_K_M (DFlash + PFlash) | **10.4×** @ 128K | **~** vs AR |
5051
| RTX 3090 | Laguna-XS.2 33B-A3B Q4_K_M (DFlash + PFlash) | **5.4×** @ 128K | AR (draft pending) |
5152
| RTX 5090 | Qwen 3.6-27B Q4_K_M (DFlash + DDTree) || **4.84×** vs AR (205 tok/s) |
53+
| Ryzen AI MAX+ 395 (gfx1151) | Qwen 3.5-27B Q4_K_M (DFlash + PFlash, HIP) | **2.24×** @ 16K | **3.08×** vs llama.cpp HIP AR (37 tok/s) |
5254

5355
## 01 · Megakernel Qwen3.5 0.8B on RTX 3090
5456

@@ -232,6 +234,32 @@ DFLASH_FP_PROFILE=1 # log mean / score / select / forward stage timings
232234

233235
---
234236

237+
## AMD Strix Halo (HIP backend)
238+
239+
**Same DFlash + PFlash stack on an AMD iGPU.** PR #119 ports the Phase 2 rocWMMA flashprefill kernels to HIP. End-to-end on a single Ryzen AI MAX+ 395 box (Radeon 8060S iGPU, gfx1151, 128 GiB LPDDR5X-8000 unified): **37.0 tok/s** DFlash decode on Qwen3.5-27B Q4_K_M, **27.6 s** TTFT at 16K context with NIAH retrieval intact. That is **3.08×** decode and **2.24×** prefill over llama.cpp HIP AR on the same iGPU. End-to-end wall clock at a realistic 16K prompt + 1K generation workload: **2.66×** faster than vanilla llama.cpp.
240+
241+
```bash
242+
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub && cd lucebox-hub/dflash
243+
244+
# Build for gfx1151 (Strix Halo). Swap the arch for gfx1100 / gfx1201.
245+
cmake -B build -S . \
246+
-DCMAKE_BUILD_TYPE=Release \
247+
-DDFLASH27B_GPU_BACKEND=hip \
248+
-DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
249+
-DDFLASH27B_HIP_SM80_EQUIV=ON
250+
cmake --build build --target test_dflash -j
251+
```
252+
253+
`DFLASH27B_HIP_SM80_EQUIV=ON` enables the rocWMMA Phase 2 flashprefill kernels (the path that delivers the prefill speedup). `OFF` falls back to ggml's `flash_attn_ext` (slower but no rocwmma headers needed).
254+
255+
**Per-arch DDTree tuning**: gfx1151 (Strix Halo iGPU, bandwidth-bound on LPDDR5X) peaks at `--ddtree-budget=22`. gfx1100 (7900 XTX, GDDR6) prefers `budget=8` per the [PR #156 cross-arch perf plan](https://github.com/Luce-Org/lucebox-hub/pull/156). Run `scripts/bench_he.py --ddtree-budget N` to verify on your card.
256+
257+
**Drafter recipe for max decode**: target = Qwen3.5-27B Q4_K_M, drafter = same gen quantized to Q8_0 via `dflash/scripts/quantize_draft_q8.py`. The matching Q8_0 GGUF on the unsloth Qwen3.6 target needs `DFLASH27B_DRAFT_SWA=2048` for sliding-window correctness.
258+
259+
[Blog post →](https://lucebox.com/blog/amd) · [PR #119](https://github.com/Luce-Org/lucebox-hub/pull/119) · [PR #156 cross-arch perf plan →](https://github.com/Luce-Org/lucebox-hub/pull/156)
260+
261+
---
262+
235263
## Why this exists
236264

237265
Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't.

dflash/CMakeLists.txt

Lines changed: 159 additions & 63 deletions
Large diffs are not rendered by default.

dflash/docs/SPEC_PREFILL.md

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This doc is the build / runtime / tunables reference for the C++ daemon
77
path described in [`pflash/README.md`](../../pflash/README.md) and on the
88
[blog post](https://lucebox.com/blog/pflash):
99

10-
- **Drafter** (Qwen3-0.6B) loaded via a custom forward (`qwen3_0p6b_*`)
10+
- **Drafter** (Qwen3-0.6B) loaded via a custom forward (`qwen3_*`)
1111
with the FlashPrefill block-sparse attention kernel for long-context
1212
scoring.
1313
- **Target** (Qwen3.6-27B Q4_K_M) loaded directly via ggml.
@@ -81,17 +81,29 @@ src/
8181
flashprefill_select.cpp Host fallback for block_select (rarely used)
8282
bsa_launcher.cu BSA launcher: blockmask conversion + Flash_fwd_params
8383
bsa_fwd_inst.cu Single-TU instantiation of BSA's hdim128 kernel
84-
qwen3_0p6b_loader.cpp GGUF → Qwen3-0.6B BF16 weight tensors
85-
qwen3_0p6b_graph.cpp Custom Qwen3-0.6B forward (per-layer A/FP/B graphs)
86-
qwen3_drafter.{h,cpp} drafter_score_and_compress() entry point
87-
qwen35_target_graph.cpp Qwen3.5/3.6 target graph (ggml)
88-
qwen3_dflash_graph.cpp DFlash speculative draft head
84+
qwen3/ Qwen3-0.6B drafter model code
85+
qwen3_loader.cpp GGUF → Qwen3-0.6B BF16 weight tensors
86+
qwen3_graph.cpp Custom Qwen3-0.6B forward (per-layer A/FP/B graphs)
87+
qwen3_drafter.{h,cpp} drafter_score_and_compress() entry point
88+
qwen35/ Qwen3.5/3.6 target + DFlash draft model code
89+
qwen35_target_graph.cpp Qwen3.5/3.6 target graph (ggml)
90+
gguf_target_loader.cpp Qwen3.5 target GGUF loader
91+
draft/ Special DFlash draft model code
92+
draft_dflash_graph.cpp DFlash speculative draft head
93+
draft_gguf_loader.cpp Draft GGUF loader
94+
draft_safetensors_loader.cpp Draft safetensors loader
95+
laguna/ Laguna target + daemon model code
96+
laguna_target_loader.cpp Laguna GGUF loader
97+
laguna_target_graph.cpp Laguna forward graph
98+
laguna_daemon.{h,cpp} Laguna daemon protocol/runtime
99+
common/ Shared runtime helpers
100+
sampler.{h,cpp} Shared CPU sampler chain
89101
kv_cache.cpp / kv_quant.cpp Q4_0 KV cache + asymmetric quant
90102
test/
91103
test_dflash.cpp daemon executable; supports
92104
`compress / generate / park / unpark / free drafter`
93105
test_flashprefill_kernels.cpp parity tests for the 4 FP kernels
94-
smoke_qwen3_0p6b_forward.cpp drafter forward smoke at S=8K-128K
106+
smoke_qwen3_forward.cpp drafter forward smoke at S=8K-128K
95107
deps/
96108
llama.cpp/ submodule (ggml only; libllama not built)
97109
Block-Sparse-Attention/ submodule (BSA + cutlass)

dflash/hip_compat/cuda_bf16.h

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
// HIP compatibility shim for <cuda_bf16.h>
2+
#pragma once
3+
4+
// cuda_runtime.h (our compat) must be included first to ensure __HIP_PLATFORM_AMD__ is set
5+
// before hip_bfloat16.h is parsed. If included in isolation, set it now.
6+
#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_PLATFORM_NVIDIA__)
7+
# define __HIP_PLATFORM_AMD__
8+
#endif
9+
10+
#include <hip/hip_bfloat16.h>
11+
#include <cstring> // memcpy for raw bit reinterpretation on host
12+
13+
// Type alias: CUDA __nv_bfloat16 → AMD hip_bfloat16
14+
using __nv_bfloat16 = hip_bfloat16;
15+
16+
// hip_bfloat162 does not exist in all ROCm versions; skip the alias.
17+
// Tests and source code that reference __nv_bfloat162 will need guarding.
18+
19+
// Conversion intrinsics.
20+
//
21+
// When compiled by hipcc, hip_bfloat16's constructor and operator float() are
22+
// __host__ __device__. When compiled by g++ (plain CXX sources), __HOST_DEVICE__
23+
// collapses to __device__, making them unavailable on the host.
24+
//
25+
// Provide host-side helpers via raw bit manipulation so that test code and
26+
// pure-CXX source files can use these conversions without the device compiler.
27+
28+
#ifdef __HIPCC__
29+
// hipcc path: use the type's own constructors / conversions
30+
__device__ __host__ inline float __bfloat162float(hip_bfloat16 x) {
31+
return static_cast<float>(x);
32+
}
33+
__device__ __host__ inline hip_bfloat16 __float2bfloat16(float x) {
34+
return hip_bfloat16(x);
35+
}
36+
__device__ __host__ inline hip_bfloat16 __float2bfloat16_rn(float x) {
37+
return hip_bfloat16(x);
38+
}
39+
#else
40+
// g++ / plain CXX path: bit-cast approach, no device attributes
41+
namespace __hip_bf16_compat_detail {
42+
// Truncating float→bf16: drop lower 16 mantissa bits.
43+
inline uint16_t float_to_bf16_bits(float f) {
44+
uint32_t u;
45+
std::memcpy(&u, &f, sizeof(u));
46+
return static_cast<uint16_t>(u >> 16);
47+
}
48+
inline float bf16_bits_to_float(uint16_t b) {
49+
uint32_t u = static_cast<uint32_t>(b) << 16;
50+
float f;
51+
std::memcpy(&f, &u, sizeof(f));
52+
return f;
53+
}
54+
}
55+
inline float __bfloat162float(hip_bfloat16 x) {
56+
return __hip_bf16_compat_detail::bf16_bits_to_float(x.data);
57+
}
58+
inline hip_bfloat16 __float2bfloat16(float x) {
59+
hip_bfloat16 r;
60+
r.data = __hip_bf16_compat_detail::float_to_bf16_bits(x);
61+
return r;
62+
}
63+
inline hip_bfloat16 __float2bfloat16_rn(float x) {
64+
return __float2bfloat16(x);
65+
}
66+
#endif

dflash/hip_compat/cuda_fp16.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
// HIP compatibility shim for <cuda_fp16.h>
2+
#pragma once
3+
#include <hip/hip_fp16.h>
4+
5+
// __half is the same name in HIP — no alias needed.
6+
// Intrinsics like __half2float, __float2half, __hadd, etc. are available directly.

dflash/hip_compat/cuda_runtime.h

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
// HIP compatibility shim: maps <cuda_runtime.h> to HIP equivalents.
2+
// Included transparently when building with -I hip_compat on ROCm.
3+
#pragma once
4+
5+
// hip/hip_runtime.h requires exactly one of __HIP_PLATFORM_AMD__ or
6+
// __HIP_PLATFORM_NVIDIA__ to be defined. hipcc sets it automatically;
7+
// g++ (used for plain CXX sources in the dflash build) does not.
8+
#if !defined(__HIP_PLATFORM_AMD__) && !defined(__HIP_PLATFORM_NVIDIA__)
9+
# define __HIP_PLATFORM_AMD__
10+
#endif
11+
12+
#include <hip/hip_runtime.h>
13+
#include <hip/hip_runtime_api.h>
14+
15+
// Type aliases
16+
using cudaStream_t = hipStream_t;
17+
using cudaEvent_t = hipEvent_t;
18+
using cudaError_t = hipError_t;
19+
using cudaMemcpyKind = hipMemcpyKind;
20+
using cudaDeviceProp = hipDeviceProp_t;
21+
22+
// Memcpy kind constants
23+
#define cudaMemcpyHostToHost hipMemcpyHostToHost
24+
#define cudaMemcpyHostToDevice hipMemcpyHostToDevice
25+
#define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
26+
#define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
27+
#define cudaMemcpyDefault hipMemcpyDefault
28+
29+
// Error codes
30+
#define cudaSuccess hipSuccess
31+
#define cudaErrorInvalidValue hipErrorInvalidValue
32+
33+
// Memory functions
34+
#define cudaMalloc hipMalloc
35+
#define cudaMallocHost hipHostMalloc
36+
#define cudaFree hipFree
37+
#define cudaFreeHost hipHostFree
38+
#define cudaMemcpy hipMemcpy
39+
#define cudaMemcpyAsync hipMemcpyAsync
40+
#define cudaMemcpy2DAsync hipMemcpy2DAsync
41+
#define cudaMemcpyPeerAsync hipMemcpyPeerAsync
42+
#define cudaMemset hipMemset
43+
#define cudaMemsetAsync hipMemsetAsync
44+
45+
// Stream functions
46+
#define cudaStreamCreate hipStreamCreate
47+
#define cudaStreamDestroy hipStreamDestroy
48+
#define cudaStreamSynchronize hipStreamSynchronize
49+
#define cudaStreamDefault hipStreamDefault
50+
#define cudaStreamNonBlocking hipStreamNonBlocking
51+
52+
// Device functions
53+
#define cudaGetDevice hipGetDevice
54+
#define cudaSetDevice hipSetDevice
55+
#define cudaDeviceSynchronize hipDeviceSynchronize
56+
#define cudaGetDeviceProperties hipGetDeviceProperties
57+
#define cudaDeviceReset hipDeviceReset
58+
59+
// Event functions
60+
#define cudaEventCreate hipEventCreate
61+
#define cudaEventDestroy hipEventDestroy
62+
#define cudaEventRecord hipEventRecord
63+
#define cudaEventSynchronize hipEventSynchronize
64+
#define cudaEventElapsedTime hipEventElapsedTime
65+
#define cudaEventCreateWithFlags hipEventCreateWithFlags
66+
#define cudaEventDisableTiming hipEventDisableTiming
67+
68+
// Kernel attribute
69+
#define cudaFuncSetAttribute hipFuncSetAttribute
70+
#define cudaFuncAttributeMaxDynamicSharedMemorySize hipFuncAttributeMaxDynamicSharedMemorySize
71+
72+
// Error checking
73+
#define cudaGetLastError hipGetLastError
74+
#define cudaGetErrorString hipGetErrorString
75+
76+
// Launch bounds
77+
#define __launch_bounds__ __launch_bounds__
78+
79+
// Stream capture status (added CUDA 10.0 — ROCm compat headers may omit this)
80+
#define cudaStreamCaptureStatus hipStreamCaptureStatus
81+
#define cudaStreamCaptureStatusNone hipStreamCaptureStatusNone
82+
#define cudaStreamCaptureStatusActive hipStreamCaptureStatusActive
83+
#define cudaStreamCaptureStatusInvalidated hipStreamCaptureStatusInvalidated
84+
#define cudaStreamIsCapturing hipStreamIsCapturing
85+
86+
// Peer device access
87+
#define cudaDeviceCanAccessPeer hipDeviceCanAccessPeer
88+
#define cudaDeviceEnablePeerAccess hipDeviceEnablePeerAccess
89+
#define cudaErrorPeerAccessAlreadyEnabled hipErrorPeerAccessAlreadyEnabled
90+
91+
// Device count
92+
#define cudaGetDeviceCount hipGetDeviceCount

dflash/hip_compat/mma.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
// HIP compatibility shim for <mma.h> (NVIDIA WMMA).
2+
//
3+
// Phase 1: empty — flashprefill_kernels.cu is excluded from the Phase 1 build
4+
// (DFLASH27B_HAVE_FLASHPREFILL not defined), so this file is never reached.
5+
//
6+
// Phase 2: replace nvcuda::wmma with rocwmma. Add:
7+
// #include <rocwmma/rocwmma.hpp>
8+
// namespace nvcuda { namespace wmma = rocwmma; } // approximate alias
9+
// Then fix the accumulator fragment register layout in sparse_flash_forward_kernel_bf16
10+
// (lines 408-443 of flashprefill_kernels.cu) to match AMD's m16n16k16 layout.
11+
//
12+
// NOTE: a namespace alias is not sufficient — the fragment register layouts differ
13+
// between NVIDIA sm_80 and AMD gfx1151. The manual row/col extraction code in
14+
// kernel 4 must be rewritten per the rocWMMA accumulator layout docs.
15+
#pragma once

dflash/include/dflash27b.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ extern "C" {
2323
// dimensions (z-lab draft: 32 Q heads, 8 KV heads, 128 head_dim). The TARGET
2424
// Qwen3.5-27B qwen35 hybrid uses 24 Q heads, 4 KV heads, 256 head_dim, which
2525
// live in `src/internal.h` (n_embd_head_k/v, N_HEAD, N_HEAD_KV). Naming is
26-
// historical — do not change without updating safetensors_draft.cpp +
27-
// qwen3_dflash_graph.cpp which consume these as draft-side constants.
26+
// historical — do not change without updating draft_safetensors_loader.cpp +
27+
// draft_dflash_graph.cpp which consume these as draft-side constants.
2828
#define DFLASH27B_TARGET_N_HEADS 32
2929
#define DFLASH27B_TARGET_N_KV_HEADS 8
3030
#define DFLASH27B_TARGET_HEAD_DIM 128

0 commit comments

Comments
 (0)