Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131
Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131dusterbloom wants to merge 60 commits into
Conversation
Full implementation of Gemma4 architecture for lucebox-hub DFlash: Target model (GGUF loader + forward pass graph builder): - Per-layer head_count_kv array (8 for SWA, 2 for full-attention) - Dual head_dim: 256 (SWA) / 512 (full-attention) with correct cache sizing - V=K sharing on full-attention layers (attention_k_eq_v) - MoE FFN: 128 experts, top-8 routing with shared expert + softmax gating - Sliding window attention pattern from BOOL GGUF array - Proportional RoPE (p-RoPE) with per-layer freq_factors - Embedding scaled by sqrt(hidden_size) per HF reference - CUDA FA 256-alignment for head_dim>=512 (FATTN_KQ_STRIDE) - TurboQuant TQ3_0 KV cache with 256-byte alignment padding - Logit softcapping: 30 * tanh(logits / 30) Draft model (safetensors loader + forward pass): - 5-layer transformer with SwiGLU FFN - FC projection: 6 * target_hidden -> draft_hidden - Tied LM head using target tok_embd - Block-diffusion speculative decoding architecture
5 smoke tests validating the Gemma4 implementation: - smoke_load_gemma4_target: GGUF metadata, per-layer head_kv, SWA pattern - smoke_gemma4_target_forward: full 26B-A4B forward pass, logits in [-30,30] - smoke_load_gemma4_draft: safetensors loading, fc/layer shape validation - smoke_gemma4_draft_forward: draft forward with injected target tok_embd - test_gemma4_kv_tq3: TQ3 cache 256-alignment, shared layer donors Plus test_gemma4_dflash driver for combined target+draft benchmarking.
The evenly-spaced formula produced wrong IDs for both Gemma4 variants.
Use the actual values from the z-lab DFlash draft model config.json:
- 26B-A4B (30 layers): {1, 6, 11, 17, 22, 27}
- 31B (60 layers): {1, 12, 23, 35, 46, 57}
Fall back to evenly-spaced for unknown layer counts.
The draft model was stateless (no KV cache), giving 0% speculative acceptance. Add prefix-direct KV materialization: target features are projected through FC → hidden_norm → per-layer K/V, stored in a dedicated draft KV cache. The draft forward now attends to this cache, matching the SGLang/vLLM DFlash architecture. Gemma4-26B-A4B with draft: avg 10.67 tokens accepted per step, ~250 tok/s decode on RTX 3090 (vs ~67 tok/s baseline).
Replace single-token autoregressive prefill with chunked batched forward. Each chunk processes up to swa_window tokens in a single GPU dispatch, cutting prefill from ~66 tok/s to ~830-1060 tok/s on RTX 3090. Add swa_mask to GemmaGraphInputs so SWA attention layers use a sliding-window mask during batched prefill while full-attention layers keep the standard causal mask.
Add --csv flag for direct use with test_gemma4_dflash --tokens. Default model changed to google/gemma-4-26b-a4b-it. Add --verbose flag, local_files_only caching, and --add-bos option.
Converts BF16 safetensors draft weights to Q8_0 GGUF format. Projection weights quantized to Q8_0 (~50% size), norms kept F32. Includes Gemma4-specific GGUF metadata (sliding_window, logit_softcap, target_layer_ids). Requires a C++ GGUF loader to be used at inference.
Three bugs prevented coherent speculative decoding output: 1. Missing BOS token: Gemma4 requires BOS (token 2) at position 0. Auto-prepend from GGUF bos_token_id when not already present. 2. Missing EOT fallback: many Gemma4 GGUFs omit eot_token_id, so eos_chat_id stayed -1 and <end_of_turn> (107) was never caught. Default to 107 when the key is absent. 3. Uninitialized SWA mask in speculative verify: when n_tokens > 1, build_gemma4_step allocates swa_mask but only attn_mask was filled. SWA layers used garbage memory, corrupting all hidden states and collapsing output to token 0 (padding) from step 2 onward. Verified: DFlash now produces identical output to AR baseline and stops at EOS. Gemma4-31B Q4_K_M + TQ3_0 KV = 80.82 tok/s (2.37x over AR 34.14 tok/s) on RTX 3090.
… script load_gemma4_draft_gguf() reads Q8_0-quantized draft weights from GGUF, auto-detected by .gguf extension on --draft path. Q8_0 drafter matches BF16 acceptance (AL=6.74) while loading 44% faster and using 380MB less VRAM. quantize_gemma4_draft_q8.py now reads config.json for model dimensions instead of hardcoding 26B constants, supporting both 26B-A4B and 31B drafters.
…ttention Layer-by-layer prefill using FlashPrefill block-sparse WMMA attention for full-attention layers and ggml FA for SWA layers. Includes gallocr pre-reserve to eliminate graph allocator overhead and fused [B+SWA] graphs to reduce hidden_buf round-trips. Benchmarks at 6K tokens (26B-A4B): 4073 tok/s (+12% over chunked prefill). Real gains expected at 64K+ where attention density drops below 10%.
Add --pflash, --pflash-alpha, and --tokens-file flags to test harness. --tokens-file reads comma-separated IDs from a file, bypassing ARG_MAX limits for prompts >16K tokens. Fix draft KV cache overflow crash when prompt exceeds draft sliding window (2096 slots). Clamp prefill to trailing window, adjust ring-buffer offset, and add defensive assert in build_draft_kv_prefill_graph().
… in FA FWHT rotation for TQ3_0 KV cache is now handled inside the Flash Attention CUDA kernel via warp-cooperative shuffle. Remove the separate ggml_turbo_wht graph ops from build_swa_attn_block() and build_full_attn_block().
SWA layers only need swa_window slots, not the full context. At 64K with Gemma4 (50 SWA, 10 full-attn layers), this saves 81.8% of KV VRAM. Ring-buffer read/write positions use modular arithmetic so SWA cache views never exceed tensor boundaries at long contexts. Verified: 31B Dense at 64K uses 22.06 GB (target-only), 24.00 GB (full stack with Q8_0 draft + TQ3_0 KV + DFlash decode at 29.26 tok/s).
After prefill fills all 2096 draft KV slots, the first decode step would crash with "draft KV overflow". Now wraps draft_kv_pos with modulo arithmetic, treating the draft cache as a ring buffer.
Decouple Graph A/B chunk size (32K) from SWA window (1K-2K). Batch consecutive SWA layers into single ggml graphs to reduce graph build overhead. SWA_CHUNK now tracks actual cache allocation. Full-attn layers keep the existing Graph A → pFlash → Graph B path. pFlash integration into single-graph-per-chunk architecture is next.
Replaces the layer-by-layer gemma4_pflash_prefill() with a single-graph- per-chunk path using the new GGML_OP_FLASH_ATTN_SPARSE op for full- attention layers. SWA layers continue to use ggml_flash_attn_ext. Perf (MoE 26B-A4B at 64K, RTX 3090, Q8_0 KV): chunked baseline: 1867 tok/s prefill, 100.6 tok/s decode, 10.67/16 accept + --pflash: 3374 tok/s prefill (1.81x), 101.8 tok/s decode Changes: - Adapter (pflash_ggml_adapter.cpp/h) registers the pFlash CUDA kernel with the ggml op. Maps alpha>=1.0 to fully-dense mode. - build_full_attn_block() conditionally uses ggml_flash_attn_sparse when use_pflash is set. - attn_mask is skipped (in graph + driver) when use_pflash=true since the sparse op applies block-level causal internally. - gemma4_pflash_prefill.cpp removed (replaced by chunked path). - test/test_flash_attn_sparse.cpp: TDD coverage for the ggml op (dense vs sparse @ alpha=1.0 within BF16 precision; alpha<1.0 liveness). Ported upstream fixes: - TQ3_0 mask stride (PR Luce-Org#128): bump g_kq_stride_pad to 256 when KV is selected via DFLASH27B_KV_K/V env vars. Prevents NaN at chunk sizes 256/512/1024/2048 with TQ3_0 KV. - last_token_logits_only (PR Luce-Org#108): skip lm_head matmul over all but last token during prefill chunks. Saves ~1GB output tensor and ~1000x lm_head compute per chunk on Gemma4-31B (vocab=262144).
Three correctness fixes after benchmarking exposed silent corruption when --pflash was combined with quantized KV: 1. Graph-level type check in build_full_attn_block: dispatch to ggml_flash_attn_sparse only when K/V are F16/Q8_0/Q4_0. TQ3_0 falls back to ggml_flash_attn_ext because TQ3's WHT rotation requires special handling not yet in the sparse path. 2. Always allocate attn_mask in test_gemma4_dflash (previously skipped when use_pflash=true). When some full-attn layers fall back to dense FA (non-supported KV types), the mask is required. 3. Guard ggml_backend_tensor_set on attn_mask/swa_mask buffer existence: when all full-attn layers use sparse FA, the mask tensor is unreferenced by any compute op so gallocr leaves its buffer NULL. ggml_set_output is added as a hint but doesn't force allocation; skip the write when buffer is NULL. swa_mask gets the same defensive check. Measured on Gemma-4-31B Q4_K_M, RTX 3090, Q8_0 KV: 4K: 1348 -> 1483 tok/s prefill (+10%), output matches baseline 8K: 1441 -> 1546 tok/s prefill (+7.3%), block-sparse approximation Earlier MoE 64K "1.81x speedup" claim was on the broken sparse path (reading Q8 bytes as F16); that data point is invalid. The current numbers are on verified-correct execution. TQ3_0 + chunked path is broken independently of pflash (produces token 0); needs separate debug.
The host-built SWA causal mask was filled in absolute KV coordinates (mask[q][abs_k] = 0 for valid keys) but the FA CUDA kernel reads it indexed by view position (k_view = 0..effective_win_len-1, where slot 0 = the cache offset where the K view starts). For every prefill chunk where kv_start > 0, the K view starts at ring_win_start in the cache (computed in build_swa_attn_block as kv_start - swa_window aligned to the ring buffer). The mask cell [q][k_view=0] was written assuming absolute slot 0, which is far before the window's lo bound, so it stayed -inf. The kernel then saw every K-view position as -inf for q rows touching that chunk. Symptoms: - Q8/F16 KV: degraded but plausible-looking output (NaNs absorbed by saturating arithmetic; argmax landed on some non-zero index) - TQ3_0 KV: clean NaN propagation through WHT-rotated FA path; argmax over NaN-containing logits returns 0 (because `if (x[i] > best)` is false for NaN). This is why "TQ3 produces token 0" was the visible failure mode. Fix: - Add SwaView struct + compute_swa_view() helper in internal.h / gemma4_target_graph.cpp encapsulating the (abs_win_start, effective_win_len, ring_win_start) math - build_swa_attn_block calls the helper instead of inlining - build_swa_causal_mask in test driver takes (abs_win_start, win_len, n_tokens, kv_start, swa_window); writes mask[q][k_view] for k_view in [0, win_len), using abs_win_start + k_view to check the absolute causal window - swa_mask tensor sized [align_up(effective_win_len, g_kq_stride_pad), q_pad] instead of [align_up(kv_len, g_kq_stride_pad), q_pad] - Both prefill chunk loop and spec-decode verify loop call the helper to get matching geometry Measured impact (Gemma-4-31B Q4_K_M, RTX 3090): 8K Q8 baseline last sampled token: 236770 (broken) -> 236799 (correct) 8K Q8 +pflash: 1284 -> 1497 tok/s (+16.6%) Bug entered with chunked prefill (commit 7ce68ac); SWA ring-buffer (commit f2c36bc) made the offset non-monotonic in kv_start. The reference Qwen3.5 driver (test/test_dflash.cpp:547-565) already had this correct via `out_mask[q*kv_pad + (k - win_start)]`. TQ3_0 still produces token 0 after this fix; that is a separate TQ3-specific bug.
Wires the Gemma4 binary into scripts/server.py so the OpenAI-compatible HTTP server can serve Gemma-4-31B and Gemma-4-26B-A4B (with the pFlash + DFlash + chunked prefill stack we built this session). ## test/test_gemma4_dflash.cpp Added a daemon mode that mirrors the IPC protocol used by test_dflash (Qwen3.5 binary): - New flags: --daemon, --stream-fd=N, --max-ctx=N (alias for --ctx-size) - No-op flags accepted for cmdline compatibility with server.py: --fast-rollback, --ddtree, --ddtree-budget=B, --ddtree-temp=F, --ddtree-no-chain-seed - After model load, prints "[daemon] ready" to stdout and enters a stdin loop reading line-based commands - Supported command: <prompt_bin_path> <n_gen> [samp=t,p,k,r[,seed]] - prompt_bin_path is a binary file of int32 LE token IDs - Each generated token is written as int32 LE to stream_fd; -1 sentinel marks end of generation - Unsupported commands (RESTORE, SNAPSHOT, compress, park, ...) are acknowledged with -1 sentinel for now (out of scope for v1) ## scripts/server.py - _read_gguf_architecture() reads general.architecture from a GGUF - main() detects "gemma4" and switches DEFAULT_BIN to test_gemma4_dflash - For Gemma4 the draft argument stays as a directory (matching the binary's CLI); for Qwen3 it stays a file as before - Daemon command is built differently per arch: Gemma4 uses --model / --draft named flags and accepts --pflash, Qwen3 keeps the existing positional form - New top-level --pflash flag passes through to the Gemma4 daemon Smoke-tested locally with the 26B-A4B model + 4096-token prompt, n_gen=16: daemon prints "[daemon] ready", consumes the binary prompt file, runs chunked prefill, decodes 16 tokens streamed as int32 LE on fd=3, and emits the -1 sentinel. Tokens are valid Gemma4 vocab IDs.
The parent's submodule pointer references commits that live only on github.com/dusterbloom/llama-cpp-turboquant-cuda (our pflash sparse-FA work). Update .gitmodules so cloners fetch from that fork instead of the upstream Luce-Org/llama.cpp-dflash-ggml repo (which doesn't have these commits). Maintainer can rewrite this URL post-merge if the commits get mirrored to a Luce-Org repo.
There was a problem hiding this comment.
11 issues found across 20 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/errors.cpp">
<violation number="1" location="dflash/src/errors.cpp:30">
P2: Returns a pointer into shared mutable error storage after unlocking, so concurrent `set_last_error()` calls can invalidate the returned `const char *`.</violation>
</file>
<file name="dflash/scripts/quantize_gemma4_draft_q8.py">
<violation number="1" location="dflash/scripts/quantize_gemma4_draft_q8.py:227">
P2: Missing validation for empty `target_layer_ids` can crash quantization with modulo-by-zero when computing `TARGET_HIDDEN`.</violation>
</file>
<file name="dflash/scripts/server.py">
<violation number="1" location="dflash/scripts/server.py:69">
P2: Swallowing GGUF read errors here makes Gemma4 detection fail open, so the server silently takes the non-Gemma4 daemon path and uses the wrong argv shape instead of failing explicitly.</violation>
<violation number="2" location="dflash/scripts/server.py:893">
P2: Gemma4 draft path validation accepts non-directory paths by falling back to the parent directory, masking typos and using the wrong draft directory.</violation>
</file>
<file name="dflash/src/gemma4_target_loader.cpp">
<violation number="1" location="dflash/src/gemma4_target_loader.cpp:604">
P2: Failure paths after allocating `out.buf` return without cleaning up partial `GemmaTargetWeights`, so load errors can leak backend memory unless every caller manually frees on failure.</violation>
<violation number="2" location="dflash/src/gemma4_target_loader.cpp:675">
P2: Missing validation that `tok_embd_sz` is divisible by `n_vocab` before deriving `row_bytes` can corrupt embedding row strides for malformed GGUFs.</violation>
</file>
<file name="dflash/CMakeLists.txt">
<violation number="1" location="dflash/CMakeLists.txt:157">
P2: pFlash is gated by the first CUDA arch entry instead of the true minimum SM, which can wrongly enable sm80-only sources for unsorted mixed-arch builds.</violation>
</file>
<file name="dflash/test/test_flash_attn_sparse.cpp">
<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:107">
P2: The dense-vs-sparse correctness check is too permissive and can mask bad outputs, including non-finite values.</violation>
</file>
<file name="dflash/src/gemma4_dflash_graph.cpp">
<violation number="1" location="dflash/src/gemma4_dflash_graph.cpp:184">
P2: Missing bounds validation for kv_start + n_tokens before KV-cache writes in build_gemma4_draft_graph().</violation>
</file>
<file name="dflash/test/test_gemma4_dflash.cpp">
<violation number="1" location="dflash/test/test_gemma4_dflash.cpp:906">
P2: Daemon requests with the default seed 0 never reseed the shared RNG, so sampling becomes order-dependent across requests.</violation>
<violation number="2" location="dflash/test/test_gemma4_dflash.cpp:1591">
P2: Resetting `draft_kv_pos` to 0 on cache overflow discards the draft context instead of preserving a valid context length, so speculative decoding runs with an empty draft KV cache once capacity is reached.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
@dusterbloom can you fix the merge conflict? Great contribution!! |
Bundled defensive fixes from code review:
1. errors.cpp: thread-local snapshot of last_error before c_str() return —
prevents concurrent set_last_error() from invalidating the returned
pointer across threads.
2. server.py:69: log GGUF read failures to stderr instead of silently
returning ""; prevents Gemma4 detection from failing open and using
the wrong daemon argv shape.
3. server.py:893: explicit branches for is_dir / is_file / not-found
on --draft path; no more silent fallback to parent directory that
masks user typos.
4. quantize_gemma4_draft_q8.py: confirmed existing N_TARGET_LAYERS == 0
guard at line 215 prevents the modulo-by-zero (no edit required).
5. gemma4_target_loader.cpp: cleanup_out lambda free's out.buf and
resets state on every failure path after the buffer allocation —
prevents backend memory leak on load errors.
6. gemma4_target_loader.cpp: validate tok_embd_sz % n_vocab == 0
before computing row_bytes — fails fast on malformed GGUFs instead
of corrupting embedding strides.
7. CMakeLists.txt: replace list(GET _dflash27b_archs 0 ...) with an
explicit min loop over all configured CUDA arches — pFlash now
correctly disables when ANY arch in the list is below sm_80.
8. test_flash_attn_sparse.cpp: add explicit non-finite (NaN/inf) check
in the dense-vs-sparse comparison; printf reports nonfinite=YES/no
and the return value requires both finite values and max_diff < 1.0.
9. gemma4_dflash_graph.cpp: GGML_ABORT on out-of-bounds kv_start +
n_tokens at the top of build_gemma4_draft_graph — catch at graph
build time instead of corrupted-memory crash later.
10. test_gemma4_dflash.cpp daemon: always reseed the RNG per request
(random_device when seed=0); prevents order-dependent sampling
across concurrent daemon requests.
11. test_gemma4_dflash.cpp draft KV overflow: replace the hard reset
cache.draft_kv_pos = 0 with a sliding-window re-prefill from the
last `keep = dkv_cap - q_len` accepted tokens. This was discarding
ALL draft context once the ring filled, causing DFlash speculative
acceptance to crash from 10.67/16 (32K) to 1.23/16 (64K) — matching
the LongSpec arXiv:2502.17421 long-context regression mode for
EAGLE-style drafters.
Also includes the WIP TQ3 rotation infrastructure (submodule pointer
bump). Self-test DFLASH_TQ3_VERIFY=1 confirms the rotation is
mathematically reversible (max_diff=0.000000 on roundtrip). TQ3 chunked
output still wrong; the bug is downstream of rotation.
# Conflicts: # dflash/deps/llama.cpp # dflash/scripts/server.py
Two interlocking bugs were silently corrupting Gemma4 multi-chunk prefill, producing all-zero decoded tokens (artificially high spec accept rate because target and drafter both predict token 0 deterministically). 1. SWA ring optimization (swa_ctx_alloc = swa_window + headroom) saves VRAM at long contexts but ring-wraps during multi-chunk prefill. The K view is constrained to a single contiguous ring slice [ring_win_start, ring_size), which on wrap covers only the pre-wrap portion. Post-wrap tokens (the latest writes) are silently omitted — queries at positions spanning the wrap can't attend to themselves or recent context. Pragmatic fix: swa_ctx_alloc = max_ctx_alloc unconditionally. SWA layers behave like full-attn during prefill. We lose the VRAM optimization but restore correctness. Future work: implement double-view SWA reads (concat pre-wrap + post-wrap views) so the memory savings can come back without correctness regression. 2. SWA ring-wrap also produced a non-256-aligned win_len_padded clamp for TQ3_0 (which requires FATTN_KQ_STRIDE=256), causing SIGSEGV. Snap ring_win_start down to the nearest 256-multiple so the K view length stays aligned. The mask already excludes the extra padded tokens. Now redundant given (1) but kept as a safety net. Also adds an env-gated [CACHE-WRITE-PROBE] in the test driver (DFLASH_TQ3_PROBE_CACHE_WRITE=1) for future debugging. Submodule bump pulls in: - fix(ggml-cuda): honor view_offs in cpy data pointer - perf(ggml-cuda): skip cudaMemGetInfo on chunked-FA hot path Verified end-to-end on RTX 3090: Dense 31B + Q8 + draft @ 2.5K = real tokens (was: all zeros) Dense 31B + TQ3 + draft @ 2.5K = real tokens (was: SIGSEGV) MoE 26B + TQ3 + draft @ 16K = real tokens, 1969 tok/s prefill Dense 31B + TQ3 + draft @ 4K = real tokens, 480 tok/s prefill
Replace the disable-fix (swa_ctx_alloc = max_ctx_alloc) with a properly-
sized ring + non-monotonic mask formula. Restores 70-95% SWA cache
VRAM savings at long contexts while keeping multi-chunk correctness.
Architecture:
- Ring sized to hold the last R = 2 * swa_window keys (= 2 chunks worth).
Always contains the relevant key window for any chunk, but in non-
monotonic order after wrap (newest tokens land in pre-wrap slots).
- K view is ALWAYS the full ring (ring_win_start = 0, len = ring_size).
The kernel reads the full ring; correctness comes from the mask.
- build_swa_causal_mask uses an abs_pos formula:
latest_slot = (kv_end - 1) % ring_size
offset_back = (latest_slot - k_view + R) % R
abs_k = (kv_end - 1) - offset_back
This handles any wrap pattern correctly.
- K/V WRITE path splits on wrap: when kv_start % R + n_tokens > R,
issue two ggml_cpy ops (pre-wrap [write_pos, R) + post-wrap [0, post_n)).
- compute_swa_view returns full-ring geometry; no truncation, no
alignment-snap, no contiguous-segment assertion.
Verified on RTX 3090, ~15 min run including TQ3 trifecta:
T1 single-chunk @ 900 (Q8 + draft): sampled=236774, real tokens
T2 2-chunk @ 2.5K (Q8 + draft): decoded 514, 4755, 822, 2864...
T3 ring-wrapping @ 8K (Q8 + draft): 1340 tok/s, real tokens
T4 MoE 16K + TQ3 + draft (the one): 2489 tok/s, swa=2048, saved 72.9%
VRAM at 64K Gemma4-31B: previously 5.5 GB SWA cache (disable-fix),
now ~0.18 GB (50 SWA layers * 2048 * 1792B = 30x reduction).
Submodule bump pulls in the [TQ3-DEQ] printf re-gate.
Adds per-layer KV type machinery + a narrow override that forces Q8_0
on the small subset of full-attn layers whose hidden states are captured
for the DFlash drafter (target_feat ring). Mirrors vLLM's
kv-cache-dtype-skip-layers pattern.
Why: upstream FA dispatch (deps/llama.cpp/.../fattn.cu:441) routes
TQ3_0 + Q->ne[0]>256 to slow CHUNKED kernel. On Dense Gemma4-31B
(full-attn head_dim=512), this is a perf trap. Forcing the drafter's
captured layers to Q8 unblocks the pflash sparse fast path for the
slice the draft consumes.
Gate: kv_type==TQ3 && head_dim>256 && draft wired (capture_layer_ids
non-empty). SWA layers always exempt (don't hit the trap).
Empirical impact (RTX 3090, Dense 31B Q4_K_M + TQ3 + draft + pflash @ 4K):
- Dense override fires on 2 of 10 full-attn layers (capture IDs 12, 46)
- Prefill 48 -> 50 tok/s (marginal; 8 remaining full-attn still slow)
- MoE override fires on 2 of 4 captured (3 keep TQ3); no regression
(1464 tok/s under GPU contention vs 2489 dedicated)
- Q8 control unchanged (gate requires TQ3)
Recommendation for production: Dense 31B + draft -> use Q8_0 KV
(505 tok/s prefill in our testing) until an upstream MMA-F16 TQ3
dequant kernel for head_dim=512 lands. TQ3 KV remains optimal for
MoE 26B-A4B (2489 tok/s @ 16K).
Per-layer machinery (kv_k_type_per_layer, kv_v_type_per_layer) is kept
infrastructure for future asymmetric experiments.
Submodule commit 580246202 adds an opt-in (DFLASH_TQ3_MMA=1) route for TQ3_0 KV through the MMA-F16 tensor-core path: - New k_tq3_0_dequant_f16_full bulk-dequant kernel - Intercept in ggml_cuda_flash_attn_ext_mma_f16 with pool-allocated f16 K/V temp buffers - tq3_needs_chunked guard lifted when env var set Target prefill (Dense 31B + TQ3 + pflash, no draft): 420 -> 610 tok/s. Note: with --draft enabled, Dense+TQ3 still hits the 9x penalty bug (separate from FA dispatch). MMA fix is a building block toward closing the gap.
When --draft is a directory containing both draft-q8_0.gguf (1.6 GB) and model.safetensors (3 GB BF16), prefer the GGUF. The BF16 safetensors draft pushed Dense+TQ3 over the 24 GB VRAM ceiling on a 3090, which fragmented the allocator and triggered host-side cudaStreamSynchronize stalls (per nsys: 67% of total CUDA time, max sync 1.5s) — collapsing target prefill from 800+ tok/s to 41 tok/s. The fix detects this case, logs a warning so the user knows what happened, and loads the GGUF. Empirical impact (RTX 3090, draft path = directory): Dense 31B + TQ3 + draft + pflash @ 4K: 41 -> 797-852 tok/s (~20×) MoE 26B + TQ3 + draft + pflash @ 16K: 2489 -> 3089 tok/s (+24%) VRAM (MoE 16K): 24.0 GB -> 19.3 GB This makes 852 tok/s the new ceiling for our Dense-31B + TQ3 + spec-decode trifecta on a single RTX 3090, beating the prior best-known by ~6× (stock llama.cpp/ollama hangs at 3-4K — see ollama#15350). Bonus: explicit `--draft .../draft-q8_0.gguf` already worked; this just removes the foot-gun for users passing the directory.
…le bench harnesses This commit captures the empirical scaffolding used in this session to validate three earlier fixes (TQ3 dispatcher d758ed9bf, SWA mask 7b62c07, head_dim=512 mask f1f811e). Together those fixes unlocked TQ3 KV at all contexts and let MTP+Q8/Q8 run past step ~110. Contents: - .sisyphus/plans/gemma4-context-scaling.md — phased plan to test all configs at 1k/4k/8k/32k/64k/256k for the user-facing tuning guide. - 6 BPE-tokenised prompt files (Gemma 4 vocab; HF google/gemma-3-27b-it tokeniser is byte-identical to the GGUF) so benches are reproducible: short_chat (27 tok) — pangram-explanation chat long_open (40 tok) — robot-painting open prompt long_2k (2611 tok) — Alice in Wonderland Ch. 1 long_50k (49904 tok) — Tiny Shakespeare summarisation long_code_50k (50002 tok) — concatenated HumanEval+ tasks (code) humaneval_2 (139 tok) — single HE task, EvalPlus chat format Each prompt has a .meta sidecar with tokenizer + chat-template + source. - generate_prompts.py — the original tokenizer harness used to produce short_chat / long_open / long_2k. The 50k prompts were generated by inline scripts since they pulled from disk-local sources (Tiny Shakespeare; the in-repo HumanEval+ jsonl). - 5 reproducible bench runners (run_*.sh): run_matrix_v3.sh — pre-fix 4-cell target/MTP × Q8/TQ3 matrix run_64k_drafter_ab.sh — 3-way drafter A/B at 64k (pre-fix snapshot) run_64k_v2.sh — 3-cell post-fix 64k re-run run_scaling.sh — dense Q8 64k verify + MoE Q8 16k→256k sweep run_dm_sweep.sh — MoE dm sweep on 50k code prompt at 64k+256k - SUMMARY.md headline numbers from each completed matrix. Headline numbers from the committed SUMMARYs: - 31B dense + Q8/Q8 + dflash + dm=16 + HumanEval/2 @ 4K decode 97.81 tok/s, AL 6.56 (~30% under PR Luce-Org#131's 149 ref; gap is task-mix variance, not regression) - 31B dense + MTP + Q8/Q8 + HumanEval/2 @ 4K decode 34.36 tok/s, accept_rate 0.87 (was aborting at step ~112 pre f1f811e) - 31B dense + Q8/Q8 + pflash @ 64K (long_50k Shakespeare prompt) prefill 1402 tok/s, decode 7.96 tok/s, VRAM 22.60 GB (proves Q8/Q8 fits in 24 GB at 64K; was previously assumed to OOM) - 31B dense + TQ3/TQ3 + pflash @ 64K prefill 585 tok/s, decode 6.90 tok/s, VRAM 21.25 GB - MoE 26B + dflash + Q8/Q8 + dm=4 + pflash + ctx=256K fits at VRAM 21.74 GB on a 24 GB 3090; decode ~30 tok/s (the production-relevant 256K config — fits with 2.3 GB to spare) The dm-sweep results dir is intentionally NOT committed here (run still in progress). Per-cell raw .log files also omitted to keep the commit slim; they're reproducible from the runners + prompts on disk.
…on effect, dm sweep, ship config for 24 GB GPUs Companion narrative to PR Luce-Org#131's amended benchmark section. Documents the path from a contaminated `0.22 accept_rate` baseline (byte-fallback tokenisation on out-of-distribution input) through three correctness fixes to a 24 GB-RTX-3090 ship config that runs Gemma-4-26B-A4B MoE + dflash + Q8/Q8 + pflash + dm=4 at 35-37 tok/s decode across 64K-256K context with ~22 GB VRAM. Three commits referenced: - d758ed9bf (submodule) fix(fattn): force chunked path for TQ3 K to avoid MMA-intercept FWHT mismatch - 7b62c07 (parent) fix(gemma4): allocate+fill SWA mask for n_tokens==1 decode + bump llama.cpp - f1f811e (parent) fix(mtp): always provide FA mask for head_dim>=512 Sections: 1. The setting (hardware, models, drafters, stack) 2. Day 0: the contaminated baseline (byte-fallback tokens) 3. Bug 1: SWA mask missing for single-token decode 4. The bisect that proved the bug was older 5. Bug 2: TQ3 K dequant intercept silently strips FWHT rotation 6. Bug 3: head_dim=512 + Q8/Q8 MMA gqa-opt requires non-null mask 7. The HumanEval surprise: drafter quality is prompt-distribution-bound 8. DM sweep: PR Luce-Org#131's 64K result was over-speculation, not collapse 9. Scaling: MoE 26B + dflash fits 256K on 24 GB at 35-37 tok/s 10. What still hurts: bandwidth, not bugs (24% of theoretical ceiling) 11. Lessons that would have saved us a weekend 12. Production ship config table 13. Open questions (drafter cache size, decode-time KV sparsity, SWA-wrap branch, MoE MTP head training, head_dim=512 kernel cleanup) Aimed at engineers maintaining a fork of llama.cpp for speculative decoding (DFlash, MTP, Medusa-class) on consumer GPUs targeting Gemma-4 or any Gemma-4-like SWA + GQA + chunked-prefill model.
Three corrections to the original blog post (commit b441587): 1. Confirms pFlash IS active in the dense ladder runs — both MoE and dense logs show `[chunked+pflash, chunk_size=1024]`. The 15× prefill gap (4912 vs 319 tok/s at 64K) is architectural (MoE has ~4B active params/token over 30 layers; dense has 31B over 60 layers, ~15× compute ratio matches the observed prefill ratio), plus VRAM-cap contention on the dense path. pflash works; it just can only skip attention, not FFN compute. 2. Adds the full dense 31B + dflash + Q8/Q8 + dm=8 + pflash ladder: - 64K: 1.78 tok/s decode, AL 1.94 ← anomaly, paged at 24/24 GB cap - 128K: 24.89 tok/s decode, AL 7.11 (89% accept) ← healthy - 256K: 23.87 tok/s decode, AL 7.11 ← healthy The 64K-specific decode collapse with the same drafter + same config that decodes fine at 128K/256K is an open puzzle, likely a VRAM allocator edge case at the 64K-cache size where 50K-token prompt fills 78% of the cache. 3. Updates the "Production ship config" table: - Adds prefill tok/s column (was missing — that's what triggered the amend; the prefill numbers tell the dense-vs-MoE story) - Reframes dense long-context cell as "viable at 128K/256K once prefill is paid" rather than the previous "not viable" claim - Adds an explicit avoid-list entry for dense + drafter + ctx=64K Net headline unchanged: MoE 26B + dflash + Q8/Q8 + pflash + dm=4 fits 256K context on a 24 GB 3090 at 35-37 tok/s decode and 4.9K tok/s prefill. Dense 31B is now positioned as "viable at long ctx for users willing to pay 3.5 minutes prefill on a 50K prompt", not "not viable".
… dm sweep with GPU power telemetry Adds run_scientific.sh and the resulting 24-cell results.csv. Each cell profiles GPU power at ~5 Hz via background nvidia-smi polling, integrates trapezoidal energy across the cell, and apportions prefill vs decode energy by time-fraction of the binary's reported phases. Cells: 2 models (Gemma4-31B dense, Gemma4-26B-A4B MoE) × 2 prompt distributions (HumanEval/2 code, long_open creative) × 6 draft-max budgets (1, 2, 4, 8, 16, 32). All Q8/Q8 KV, 4K ctx, n_predict=256, temp=0 seed=0 --ignore-eos, pflash on. Headlines: - Best decode tok/s: MoE+code+dm≥16 = 132 (plateau at dm=16; dm=32 wastes) - Best efficiency real-spec: MoE+creative+dm=2 = 6.6 J/tok - Dense max: 82 tok/s creative+dm=16 (the dense drafter generalises better OOD than the smaller MoE drafter — 5.12 vs 2.49 AL on creative) - MoE+code AL plateau 5.22; MoE+creative AL plateau 2.49 — MoE drafter is code-distribution-trained, weaker OOD - VRAM: dense 22.1 GB, MoE 18.9 GB across all dms Per-cell columns in results.csv: cell, rc, wall_s, prefill_ms, decode_ms, first_tok_ms, prefill_tok_s, decode_tok_s, AL, VRAM_GB, avg_power_W, total_energy_J, prefill_energy_J, decode_energy_J, decode_J_per_tok. Hardware: RTX 3090 24 GB, CUDA 13.1, 399W TDP. Active-window peak ~395W (~99% TDP) on dense+code, MoE peaks lower (~130W avg).
There was a problem hiding this comment.
You may need to provide the patch for ggml.
There was a problem hiding this comment.
why the existing quantize draft to q8 script doesn't work for you? Better you can enhance it to support qwen and gemma4.
There was a problem hiding this comment.
Addressed in 4935293. Merged quantize_gemma4_draft_q8.py back into quantize_draft_q8.py with --arch {qwen,gemma4} (auto-detect from config.json's model_type when present). Q8_0 scale formula and tensor-name mapping were already identical between the two scripts, so the merge is a clean parameterisation behind two _ARCH_DEFAULTS dicts. Net -134 lines. Old gemma4 file deleted; no external callers needed updating.
There was a problem hiding this comment.
I purposely remove this file couple days ago. We can use ggml convert function to do the same thing.
There was a problem hiding this comment.
Addressed in 9ccd827. Removed f16_convert.cu; replaced 10 call sites (5 distinct blocks) with copy_target_feat_bf16_to_f32() using ggml_cpy(ctx, bf16_view, f32_view) — dispatched to ggml-cuda's native BF16→F32 path in cpy.cu. ggml_view_2d with byte offset handles ring-buffer slot arithmetic; ggml_backend_graph_compute synchronises internally so the explicit cudaDeviceSynchronize is no longer needed. Net -76 lines. CI will verify the full CUDA build.
There was a problem hiding this comment.
we may want to build our loader in a generic way. For this PR, I think it is fine. but this should be a high priority work item to refactor this.
There was a problem hiding this comment.
Acknowledged — agreed, the loader is gemma4-specific and a generic loader would be cleaner. Tracking as a follow-up work item alongside the qwen3↔gemma4 dflash graph unification (which has 5 material deviations per r3215263575). Both refactors estimated ~3-5 engineer-days combined; will scope a dedicated PR after this one lands.
There was a problem hiding this comment.
My read is that draft model is same as qwen3.5. Did I miss anything?
There was a problem hiding this comment.
Thanks for looking carefully. The gemma4 dflash graph is not a rename of the qwen3 one — there are five material architectural differences:
- KV cache — gemma4 splits into two functions (
build_draft_kv_prefill_graph+build_gemma4_draft_graph) sharing a persistentGemmaTargetCache; qwen3 is stateless (recomputes context K/V every call). - Logit softcapping — gemma4 applies
cap * tanh(logits / cap)(cap=30) and owns a tied lm_head; qwen3 borrows the target's lm_head with no cap. - Embedding scaling — gemma4 scales input embeddings by
sqrt(hidden_size)(Gemma family trait); qwen3 does not. - 6 vs 5 captured target layers — gemma4's FC input width is
6 * target_hidden, not 5. - YaRN RoPE + mask-aware SWA truncation — gemma4 adds opt-in YaRN extrapolation and must slice the attention mask tensor when truncating SWA layers (because the mask is causal over the cache); qwen3's SWA only slices
positions_k.
A unified parameterised builder is feasible but would require addressing all five axes plus the loader divergence, and would change the calling convention. Estimated ~3-5 engineer-days for the refactor + bilateral re-validation. Deferring to a follow-up refactor is the right call for this PR — happy to scope as a separate work item.
There was a problem hiding this comment.
10 issues found across 32 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".sisyphus/notes/gemma4-baseline/matrix-v3/SUMMARY.md">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/matrix-v3/SUMMARY.md:6">
P2: Third matrix entry is missing its rc/result line, leaving the summary incomplete and ambiguous.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/run_scientific.sh">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_scientific.sh:36">
P1: Return code always 0 because `|| true` swallows command exit status before `rc=$?` captures it, making failed benchmark cells indistinguishable from successes in analysis results.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/run_64k_v2.sh">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_64k_v2.sh:49">
P2: `set -e` causes the script to abort when grep finds no matches in a log file, because the unguarded `grep -E` returns exit code 1 on no-match and `set -e` exits the shell immediately. This skips the rest of the summary generation and the decoded-text comparison.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/run_dm_sweep.sh">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_dm_sweep.sh:21">
P2: Exit code capture is broken: `|| true` masks the benchmark's exit status, so `rc=$?` always reports 0 even when the binary crashes.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/run_scaling.sh">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_scaling.sh:25">
P2: Exit-code capture is broken: `$?` after `cmd || true` is always 0, so failures/OOMs are reported as rc=0 in SUMMARY.md.</violation>
<violation number="2" location=".sisyphus/notes/gemma4-baseline/run_scaling.sh:61">
P2: `grep` in summary loop can abort the script under `set -e` when no lines match; non-zero grep exit is treated as a fatal error.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/run_matrix_v3.sh">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_matrix_v3.sh:8">
P2: Missing `pipefail` lets the Python summary step fail silently while the script still reaches `DONE`.</violation>
<violation number="2" location=".sisyphus/notes/gemma4-baseline/run_matrix_v3.sh:8">
P2: `set -e` causes expected failing cells to abort the whole matrix run, preventing full summary generation.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/matrix-64k/SUMMARY.md">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/matrix-64k/SUMMARY.md:61">
P2: Benchmark token extraction in benchmark summary is polluted by telemetry/debug numbers, making the reported 'first 80 generated tokens' unreliable and misleading.</violation>
</file>
<file name=".sisyphus/notes/gemma4-baseline/run_64k_drafter_ab.sh">
<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_64k_drafter_ab.sh:5">
P2: Hardcoded absolute local path (`/home/peppi/Dev/lucebox-hub`) makes the script non-portable and causes immediate failure on any other machine due to `set -e`</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
| local POW_PID=$! | ||
|
|
||
| local t0=$(date +%s.%N) | ||
| ./dflash/build/test_gemma4_dflash \ |
There was a problem hiding this comment.
P1: Return code always 0 because || true swallows command exit status before rc=$? captures it, making failed benchmark cells indistinguishable from successes in analysis results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_scientific.sh, line 36:
<comment>Return code always 0 because `|| true` swallows command exit status before `rc=$?` captures it, making failed benchmark cells indistinguishable from successes in analysis results.</comment>
<file context>
@@ -0,0 +1,164 @@
+ local POW_PID=$!
+
+ local t0=$(date +%s.%N)
+ ./dflash/build/test_gemma4_dflash \
+ --model $model \
+ --draft $draft \
</file context>
Tip: Review your code locally with the cubic CLI to iterate faster.
| echo "" >> $LOGDIR/SUMMARY.md | ||
| echo "### ${cell}" >> $LOGDIR/SUMMARY.md | ||
| echo '```' >> $LOGDIR/SUMMARY.md | ||
| grep -E "kv types|narrow asymmetric|^\[draft\] KV|prefill.*tokens in|context_used|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT" $log >> $LOGDIR/SUMMARY.md |
There was a problem hiding this comment.
P2: set -e causes the script to abort when grep finds no matches in a log file, because the unguarded grep -E returns exit code 1 on no-match and set -e exits the shell immediately. This skips the rest of the summary generation and the decoded-text comparison.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_64k_v2.sh, line 49:
<comment>`set -e` causes the script to abort when grep finds no matches in a log file, because the unguarded `grep -E` returns exit code 1 on no-match and `set -e` exits the shell immediately. This skips the rest of the summary generation and the decoded-text comparison.</comment>
<file context>
@@ -0,0 +1,77 @@
+ echo "" >> $LOGDIR/SUMMARY.md
+ echo "### ${cell}" >> $LOGDIR/SUMMARY.md
+ echo '```' >> $LOGDIR/SUMMARY.md
+ grep -E "kv types|narrow asymmetric|^\[draft\] KV|prefill.*tokens in|context_used|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT" $log >> $LOGDIR/SUMMARY.md
+ echo '```' >> $LOGDIR/SUMMARY.md
+done
</file context>
| run() { | ||
| local tag=$1; local ctx=$2; local dm=$3 | ||
| echo "=== ${tag} (ctx=${ctx} dm=${dm}) starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md | ||
| ./dflash/build/test_gemma4_dflash \ |
There was a problem hiding this comment.
P2: Exit code capture is broken: || true masks the benchmark's exit status, so rc=$? always reports 0 even when the binary crashes.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_dm_sweep.sh, line 21:
<comment>Exit code capture is broken: `|| true` masks the benchmark's exit status, so `rc=$?` always reports 0 even when the binary crashes.</comment>
<file context>
@@ -0,0 +1,58 @@
+run() {
+ local tag=$1; local ctx=$2; local dm=$3
+ echo "=== ${tag} (ctx=${ctx} dm=${dm}) starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md
+ ./dflash/build/test_gemma4_dflash \
+ --model $MOE \
+ --draft $MOE_DFLASH \
</file context>
| echo "" >> $LOGDIR/SUMMARY.md | ||
| echo "### $tag" >> $LOGDIR/SUMMARY.md | ||
| echo '```' >> $LOGDIR/SUMMARY.md | ||
| grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md |
There was a problem hiding this comment.
P2: grep in summary loop can abort the script under set -e when no lines match; non-zero grep exit is treated as a fatal error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_scaling.sh, line 61:
<comment>`grep` in summary loop can abort the script under `set -e` when no lines match; non-zero grep exit is treated as a fatal error.</comment>
<file context>
@@ -0,0 +1,66 @@
+ echo "" >> $LOGDIR/SUMMARY.md
+ echo "### $tag" >> $LOGDIR/SUMMARY.md
+ echo '```' >> $LOGDIR/SUMMARY.md
+ grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md
+ echo '```' >> $LOGDIR/SUMMARY.md
+done
</file context>
| grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md | |
| grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md || true |
| local tag=$1; local logfile=$LOGDIR/${tag}.log | ||
| shift | ||
| echo "=== $tag starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md | ||
| ./dflash/build/test_gemma4_dflash "$@" \ |
There was a problem hiding this comment.
P2: Exit-code capture is broken: $? after cmd || true is always 0, so failures/OOMs are reported as rc=0 in SUMMARY.md.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_scaling.sh, line 25:
<comment>Exit-code capture is broken: `$?` after `cmd || true` is always 0, so failures/OOMs are reported as rc=0 in SUMMARY.md.</comment>
<file context>
@@ -0,0 +1,66 @@
+ local tag=$1; local logfile=$LOGDIR/${tag}.log
+ shift
+ echo "=== $tag starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md
+ ./dflash/build/test_gemma4_dflash "$@" \
+ --n-predict 256 --temp 0 --seed 0 --ignore-eos \
+ > $logfile 2>&1 || true
</file context>
| # N2: target-only K=Q8 V=Q8 (control, expect coherent — was M2 in v2) | ||
| # N3: MTP K=Q8 V=TQ3 (the production ship target — measure accept_rate) | ||
| # N4: MTP K=Q8 V=Q8 (previous safe baseline — was M4 in v2; expect crash ~step 210) | ||
| set -e |
There was a problem hiding this comment.
P2: Missing pipefail lets the Python summary step fail silently while the script still reaches DONE.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_matrix_v3.sh, line 8:
<comment>Missing `pipefail` lets the Python summary step fail silently while the script still reaches `DONE`.</comment>
<file context>
@@ -0,0 +1,97 @@
+# N2: target-only K=Q8 V=Q8 (control, expect coherent — was M2 in v2)
+# N3: MTP K=Q8 V=TQ3 (the production ship target — measure accept_rate)
+# N4: MTP K=Q8 V=Q8 (previous safe baseline — was M4 in v2; expect crash ~step 210)
+set -e
+cd /home/peppi/Dev/lucebox-hub
+export PATH=/usr/local/cuda-13.1/bin:$PATH
</file context>
| # N2: target-only K=Q8 V=Q8 (control, expect coherent — was M2 in v2) | ||
| # N3: MTP K=Q8 V=TQ3 (the production ship target — measure accept_rate) | ||
| # N4: MTP K=Q8 V=Q8 (previous safe baseline — was M4 in v2; expect crash ~step 210) | ||
| set -e |
There was a problem hiding this comment.
P2: set -e causes expected failing cells to abort the whole matrix run, preventing full summary generation.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_matrix_v3.sh, line 8:
<comment>`set -e` causes expected failing cells to abort the whole matrix run, preventing full summary generation.</comment>
<file context>
@@ -0,0 +1,97 @@
+# N2: target-only K=Q8 V=Q8 (control, expect coherent — was M2 in v2)
+# N3: MTP K=Q8 V=TQ3 (the production ship target — measure accept_rate)
+# N4: MTP K=Q8 V=Q8 (previous safe baseline — was M4 in v2; expect crash ~step 210)
+set -e
+cd /home/peppi/Dev/lucebox-hub
+export PATH=/usr/local/cuda-13.1/bin:$PATH
</file context>
| # 64k context, TQ3 KV, pFlash on, dense 31B. | ||
| # Compare drafters: target-only vs MTP vs dflash. | ||
| set -e | ||
| cd /home/peppi/Dev/lucebox-hub |
There was a problem hiding this comment.
P2: Hardcoded absolute local path (/home/peppi/Dev/lucebox-hub) makes the script non-portable and causes immediate failure on any other machine due to set -e
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_64k_drafter_ab.sh, line 5:
<comment>Hardcoded absolute local path (`/home/peppi/Dev/lucebox-hub`) makes the script non-portable and causes immediate failure on any other machine due to `set -e`</comment>
<file context>
@@ -0,0 +1,96 @@
+# 64k context, TQ3 KV, pFlash on, dense 31B.
+# Compare drafters: target-only vs MTP vs dflash.
+set -e
+cd /home/peppi/Dev/lucebox-hub
+export PATH=/usr/local/cuda-13.1/bin:$PATH
+
</file context>
| cd /home/peppi/Dev/lucebox-hub | |
| +cd "$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/../../.." |
Adapts PR Luce-Org#129 (howard0su/swa) — sliding-window attention truncation for the qwen3 draft graph — to the gemma4 cached-KV draft layout. draft graph (gemma4_dflash_graph.cpp): * draft_swa_trunc_enabled() — opt-in via DFLASH_DRAFT_SWA_TRUNC=1. For SWA layers (first n-1; final layer stays full attn), restrict K_full / V_full views to the last (sliding_window + n_tokens) slots and copy a contiguous mask slice (ggml_cont) for FA. * draft_rope() — single wrapper around ggml_rope_ext used at all three draft RoPE sites. Optional YaRN scaling via DFLASH_DRAFT_YARN=1 (default n_ctx_orig=32768, override via DFLASH_DRAFT_YARN_NCTX_ORIG). test harness (test_gemma4_dflash.cpp): * --draft-swa-trunc CLI flag mirroring the env var. * Bundles the bench-harness infrastructure that has been in-progress on this branch: adaptive draft_max, --draft-kv-cap override, --mem-diag, --fa-window plumbing through build_gemma4_step. Bench (RTX 3090, gemma-4-31B-Q4_K_M target + qwen3 5-layer draft, 50K-token prompt, n_predict=64, ctx=65536, NO_VMM=1): cap | SWA | AL | decode tok/s ------+-----+------+-------------- 2096 | off | 1.36 | 1.31 2096 | on | 1.73 | 1.68 8192 | off | 1.02 | 4.29 8192 | on | 1.68 | 6.96 <-- +65% AL, +62% decode The SWA truncation does not fix the underlying long-position acceptance collapse (the qwen3 draft model itself appears to have been effectively trained at <=32K positions; see comment near the sliding re-prefill block in test_gemma4_dflash.cpp). It is a real partial improvement shippable today; the residual cliff needs a long-context drafter. The diff is large because it bundles the pre-existing harness infrastructure noted above; happy to split if reviewers prefer.
Update — SWA truncation port + YaRN opt-in (98f72c1)Just pushed Adapts PR #129 (howard0su/swa, qwen3) to gemma4's cached-KV draft graph layout. Also touches the same surface as #140 / #141 — happy to coordinate with those. What's in:
Bench (RTX 3090, gemma-4-31B-Q4_K_M target + qwen3 5L draft, 50K-token prompt, n_predict=64, ctx=65536, NO_VMM=1):
Caveat: SWA truncation does not lift AL to "healthy" (~5+). Paired probe (prose-50k vs code-50k at ctx=65536) confirms the residual AL=1 cliff is purely position-driven, not content-distribution-driven — the qwen3 draft model itself appears to have been trained at ≤32K positions (matching the in-source comment about "32K → 64K acceptance collapse"). Real fix needs a long-context drafter; this port is a meaningful partial improvement that ships today. YaRN is included as an opt-in lever for users who want to experiment with draft-side RoPE scaling; default-off, no behavior change. |
The 32 GiB CUDA VMM pool reservation fragments badly inside the last few hundred MB on a 24 GiB card. Measured impact at ctx=65536 dense Q8/Q8 with the long-prompt code workload: default (VMM on): 1.79 tok/s (verify_ms p50 ~1738) GGML_CUDA_NO_VMM=1: 2.78 tok/s (verify_ms p50 ~975) +55% Plus prefill bumps from 193 to 179 tok/s (-7%; net win is decisive). Auto-detect: cudaGetDeviceProperties.totalGlobalMem <= 25 GiB AND the user has not explicitly set GGML_CUDA_NO_VMM. Override with GGML_CUDA_NO_VMM=0 if you need VMM for some reason. Banner: "[auto] GGML_CUDA_NO_VMM=1 set (GPU has N GiB; override with GGML_CUDA_NO_VMM=0)" so it is obvious in logs.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/test/test_gemma4_dflash.cpp">
<violation number="1">
P1: Auto-disabling VMM is wired to an unsupported runtime env-var contract, so the 24 GB GPU safeguard likely never affects ggml's CUDA backend.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
The submodule's ggml_turbo_wht kernel writes dst using src strides
(turbo-wht.cu:20-21), so a non-contiguous Q (post ggml_permute) gets
scattered writes and corrupts the result. Always wrap with ggml_cont
BEFORE rotating, never after permute alone.
Six edit sites (target_graph SWA + full attn blocks; mtp_graph Q
rotate + O rotate sites):
ggml_tensor * Qfa = ggml_cont(ctx, ggml_permute(ctx, Qcur, ...));
if (q_rotate) Qfa = ggml_turbo_wht(ctx, Qfa, 0);
...
if (out_rotate) {
attn = ggml_cont(ctx, attn);
attn = ggml_turbo_wht(ctx, attn, 1);
}
This is the OUTER-REPO half of a path-asymmetric contract: graph
pre-rotates Q forward and post-rotates FA output backward; FA backends
(chunked, VEC) consume rotated K/V directly. The kernel-side half
lives in the submodule fork.
Verified: TQ3/TQ3 target-only on MoE 26B-A4B + humaneval_2 + ctx=4096
produces coherent prose matching the Q8 control's continuation:
Q8: 1106 6596 108 2063 102267 236779 5640 ...
TQ3: 1106 6596 ... (same logits, max=21.250@6596).
Plus optional cap-hint params on create_gemma4_cache /
create_draft_kv_cache (used by --draft-kv-cap and --mem-diag CLI
flags introduced in 98f72c1).
…clean Rebases the fork from merge-base fae3a2807 (early May) onto current ggml-org/llama.cpp:master HEAD 0b047287f. Addresses PR Luce-Org#131 review comment r3214286746 from howard0su ("You may need to provide the patch for ggml"). Old pin: dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache d758ed9bf 12 commits ahead of fae3a2807, 1 of 12 patches applied clean to ggml/master. New pin: dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache-clean daef232a6 11 commits on top of ggml/master 0b047287f. Commit triage applied during rebase (per user decisions in ~/.claude/plans/do-the-fork-rebase-vast-kurzweil.md): - Dropped: debug probe commits (45e492b13, 3f65b59c4) and the broken TQ3 MMA dequant intercept (580246202) and its safety gate (d758ed9bf). - Squashed: WIP rotation kernel commit (e2af945b9) into the FWHT fuse commit (fd8710abc → 694cea5e1). - New: TQ3_0 → F16 cpy dispatch (666df462d), graph-level FWHT contract refactor (6d5ca8c4b), fused MoE kernel (992aac8ac), force-chunked-for-TQ3 dispatcher fix (daef232a6). Verified TQ3 chunked correctness post-rebase against the previous working baseline: next=6596 max=21.250@6596 (matches Q8 control) 43.97 tok/s on n_predict=16, 6.58 tok/s on n_predict=64 Phase-1 upstream candidate: commit 14f90bf60 (cpy view_offs fix) applies clean to ggml/master.
Update — Submodule rebase complete (
|
| Comment | Status |
|---|---|
| r3214286746 patch for ggml | ✅ This update — rebased + Phase-1 patches identified |
| r3214287342 quantize_gemma4_draft_q8.py refactor | ⏳ Separate refactor, not in this PR |
| r3214289240 f16_convert.cu purposely removed | ✅ Not reintroduced in the rebase |
| r3214290950 generic loader refactor | ⏳ High-priority refactor backlog |
| r3214291841 draft model is same as qwen3.5 | ✅ Confirmed empirically. Internal architectural audit (dflash/REFACTOR_DFLASH_UNIFIED.md proposed) shows the qwen3 and gemma4 drafts are the same 5L Qwen3-style block-diffusion; ~4-5 engineer-day unification refactor planned, not in this PR. |
| r3214292648 group gemma4 tests | ⏳ Process suggestion |
| cubic-dev-ai P2 bots (8 items) | All addressable in follow-ups |
Phase-1 upstream PR openedThe first generic bugfix from the rebased fork is now an open PR upstream:
This is patch Once it lands upstream, we can drop |
…CONFLICTING state # Conflicts: # dflash/deps/llama.cpp # dflash/scripts/server.py
- errors.cpp: set_last_error returns thread-local copy (race fix) - quantize_gemma4_draft_q8.py: validate non-empty target_layer_ids - server.py: log GGUF detection failures explicitly - gemma4_target_loader.cpp: validate tok_embd_sz divisibility - gemma4_target_loader.cpp: free out.buf on load failure paths - CMakeLists.txt: gate pFlash on minimum CUDA arch, not first - gemma4_dflash_graph.cpp: bounds-check kv_start + n_tokens - test_mtp_loader.cpp: assert exact donor-layer mapping - gemma4_mtp_graph.cpp: validate centroid-head invariants Fixes 1-7 were already implemented in 5fb516d. This commit adds the two remaining items: Fix 8 (test_mtp_loader.cpp): strengthen Assertion 7 from a simple bounds check [0,60) to an exact semantic check: each MTP layer's donor_target_layer must equal 58 (last full-attn target layer, Dense 31B even-indexed) or 59 (last SWA target layer, odd-indexed) per the loader's own documented contract, not just any value in range. Fix 9 (gemma4_mtp_graph.cpp): add four GGML_ASSERT guards at the top of the centroid-head construction block — n_vocab > 0, n_centroids > 0, n_vocab % n_centroids == 0, and top_k in [1, n_centroids] — preventing silent corruption or div-by-zero on mismatched vocab sizes or out-of-range top_k values. Refs cubic comments: 3210041139, 3210041163, 3210041171, 3210041174, 3210041179, 3210041182, 3210041198, 3213492728, 3213492729.
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
Previous commit e05afcd called setenv("GGML_CUDA_NO_VMM", "1") at runtime, but that env-var is compile-time only — ggml's CMakeLists.txt converts it to add_compile_definitions, and the source guards on #if defined(GGML_USE_VMM), never on getenv. This commit: - Replaces the no-op setenv block with a runtime detection that emits a clear warning when <=24 GiB CUDA devices are seen and the binary was built without -DGGML_CUDA_NO_VMM=ON. - Adds a small-VRAM build hint to dflash/README.md. Refs cubic comment: 3214928808 (P1).
Merge quantize_gemma4_draft_q8.py into quantize_draft_q8.py with
--arch {qwen,gemma4} flag. Auto-detects from config.json's model_type
when present, falls back to per-arch hardcoded defaults. Removes the
duplicate script (367 lines deleted, 233 added — net -134 lines).
Q8_0 scale formula and tensor-name mapping were already identical
between the two scripts, so the merge is a clean parameterisation
behind two _ARCH_DEFAULTS dicts and one model_type sniffer.
Per howard0su's review on PR Luce-Org#131 (r3214287342).
…sion Howard0su explicitly removed this file from the codebase a few days ago, noting that ggml's built-in convert (ggml_cpy with target dtype) does the same thing. We re-added it as part of the gemma4 work; this commit removes it again and replaces 10 call sites (5 distinct blocks: daemon prefill, standard prefill, single-slot warmup, sliding re-prefill, spec-decode commit) with copy_target_feat_bf16_to_f32() which uses ggml_cpy(ctx, bf16_view, f32_view) — dispatched to ggml-cuda's native BF16→F32 path in cpy.cu. ggml_view_2d with byte offset handles the ring-buffer slot arithmetic that was previously raw pointer math; ggml_backend_graph_compute synchronises internally so cudaDeviceSynchronize is no longer needed. Per howard0su's review on PR Luce-Org#131 (r3214289240).
Move 8 gemma4-specific test files into a dedicated subdirectory: test_gemma4_dflash.cpp test_gemma4_kv_tq3.cpp test_mtp_graph_shapes.cpp test_mtp_loader.cpp smoke_gemma4_draft_forward.cpp smoke_gemma4_target_forward.cpp smoke_load_gemma4_draft.cpp smoke_load_gemma4_target.cpp dflash/test/test_flash_attn_sparse.cpp stays at the root (it's generic, no gemma4 references). dflash/CMakeLists.txt source paths updated; cmake configure passes (ggml commit daef232a6 detected, build files regenerated). Per howard0su's review on PR Luce-Org#131 (r3214292648).
Implements multi-token chained MTP speculative decoding for Gemma4
Dense 31B + assistant. Previously γ=1 was a correctness gate that
ran the drafter after every target forward without batched-verify,
yielding *slower* throughput than no-MTP (25.28 vs 26.69 tok/s).
At γ=4, the chained path now amortizes drafter cost across K+1
batched target verifications.
Architecture matches Google's HF reference (modeling_gemma4_assistant.py):
- capture point post-final-RMSNorm of last layer (`gemma4_target_graph.cpp`)
- concat [token_emb, h_target] → pre_projection
- 4 drafter blocks with Q-only cross-attn into target's KV
- constant position_ids within a chain (empirically beats incrementing)
- autoregressive h_post → h_prev feedback between chain steps
CLI surface:
- --gamma N (1..16): chain length. γ=1 routes through the existing
correctness-gate loop unchanged. γ>1 routes through the new path.
- --mtp-pos-mode {const,incr}: A/B falsification harness for the
position-semantic question. const is default (matches Google).
- γ>1 + stochastic sampling is currently fatal (--temp must be 0);
Leviathan-style rescaling for stochastic γ>1 is out of scope here.
Phase 2 correctness fix: `cache.mtp_h_prev_row` parameterizes the
slice index in the capture site (`gemma4_target_graph.cpp:1106-1123`).
γ=1 leaves the sentinel -1 (n_tokens-1, current behavior). γ>1 uses
approach A: a 1-token re-capture target forward when accept_drafts < K
to refresh the hidden state at the last accepted position.
Measured (Dense 31B + TQ3/TQ3 KV, 4K ctx, n_predict=256, greedy):
M1 no-MTP : 18.92 tok/s
M3 γ=1 : 18.64 tok/s, accept 0.660
γ=1 regression : 19.04 tok/s, accept 0.660 (regression-free)
γ=2 const : 16.82 tok/s, accept 0.441 (chain overhead > savings)
γ=4 const : 20.75 tok/s, accept 0.478 (+9.7% over M1)
γ=4 incr : 20.59 tok/s, accept 0.464 (const empirically wins)
Approach A's per-chain re-capture costs ~40 ms; switching to approach B
(multi-row mtp_h_prev capture in one verify pass, no re-capture) is
projected to push γ=4 to ~32 tok/s = 1.7× M1. Tracked separately.
Evidence logs in .sisyphus/notes/gemma4-baseline/mtp-gamma/.
Submodule pin moves from daef232a6 to 6715acf13 (TQ3 debug printf removal).
Plan: /home/peppi/.claude/plans/wild-growing-ember.md
Sweep matrix on Dense 31B + TQ3/TQ3 KV, γ ∈ {1,2,4,8}, ctx ∈ {4K,16K,64K}.
Decode tok/s:
no-MTP γ=1 γ=2 γ=4 γ=8
4K 19.63 19.18 20.21 16.54 16.54
16K 12.99 10.20 8.40 5.37 6.86
64K 6.55 5.54 8.42 6.54 5.33
Three headline findings:
1. γ=2 wins at both 4K (+3%) and 64K (+29%) over no-MTP.
2. Long-context MTP accept rate is HIGH (0.69–0.73 at 64K), contradicting
the pre-rebase 0.02 figure documented in OPEN_QUESTIONS.md. The
submodule Bug-2 fix (daef232a6) plus the γ>1 chain produces the high
accept regime. OPEN_QUESTIONS.md should be updated.
3. 16K is a "dead zone" — every MTP config loses to no-MTP at 16K.
γ=2 accept drops to 0.33 there. Worth investigating separately.
γ=4 and γ=8 collapse at 64K because approach A's re-capture single-token
target forward reads ~64K KV slots (~80 ms cost), paid per chain with
partial accept. Approach B (multi-row mtp_h_prev capture, no re-capture)
is projected to flip γ=4 and γ=8 from losses to wins at long context.
That's the next implementation step.
User-facing recommendation table (until approach B lands):
≤8K : --draft-method mtp --gamma 2
8K–32K : no MTP
≥32K : --draft-method mtp --gamma 2
Logs: .sisyphus/notes/gemma4-baseline/mtp-gamma/phase4/
Script: .sisyphus/notes/gemma4-baseline/mtp-gamma/run_sweep.sh
Replaces the per-chain re-capture single-token target forward (approach A)
with a batched capture into mtp_h_prev_batch [n_embd, 17] during the verify
call. Host-side picks column accept_drafts after greedy match and copies
it into mtp_h_prev for the next MTP chain.
Result: the re-capture cost was approach A's dominant per-chain overhead
at long context (~80 ms at 64K, ~30 ms at 16K). Removing it flips the
γ × ctx matrix decisively in favor of MTP at every context.
Decode tok/s sweep (Dense 31B + TQ3/TQ3 KV, n_predict=64, --temp 0):
no-MTP γ=1 γ=2 γ=4 γ=8
4K A 19.63 19.18 20.21 16.54 16.54
4K B 18.40 18.46 25.10 25.58 22.61 γ=4: +39% over no-MTP
16K A 12.99 10.20 8.40 5.37 6.86
16K B 12.43 9.79 12.56 13.06 9.33 γ=4: +5% (was -35%)
64K A 6.55 5.54 8.42 6.54 5.33
64K B 6.26 5.31 10.07 8.29 7.58 γ=2: +61% (was +29%)
The "16K dead zone" documented in Phase 4-A RESULTS.md was entirely the
re-capture overhead; approach B eliminates it. γ=4 at 16K now slightly
beats no-MTP.
γ=1 path is regression-safe: 18.46 tok/s vs the 18.64 M3 baseline (−1%,
within noise), accept_rate identical at 0.66. γ=1 never sets
mtp_h_prev_capture_mode = 1, so the batch-capture branch is dead code on
that path.
Files:
- dflash/src/internal.h: add mtp_h_prev_batch tensor + mtp_h_prev_capture_mode
- dflash/src/gemma4_target_graph.cpp: branch in capture site, write all
n_tokens rows when mode==1 instead of slicing one
- dflash/test/gemma4/test_gemma4_dflash.cpp: allocate mtp_h_prev_batch,
set capture_mode=1 around the γ>1 verify call, replace re-capture with
21 KB host-side staging copy
User-facing recommendation (revised):
≤8K --draft-method mtp --gamma 4 ≈ 25 tok/s
8K–32K --draft-method mtp --gamma 4 ≈ 13 tok/s
32K+ --draft-method mtp --gamma 2 ≈ 10 tok/s at 64K
Evidence:
- .sisyphus/notes/gemma4-baseline/mtp-gamma/phase3p5/ — three acceptance tests
- .sisyphus/notes/gemma4-baseline/mtp-gamma/phase4-b/ — full re-sweep + RESULTS.md
Matching update to the public Pages bundle just pushed. The site's prior claim that "MTP is uneconomic at 64K, ≤ 2% accept, open investigation" was pre-rebase noise. Post-rebase + γ>1 chain, the real numbers are: 4K γ=4: 25.58 tok/s (+39% over no-MTP) 16K γ=4: 13.06 tok/s (resolved the dead zone) 64K γ=2: 10.07 tok/s (+61% over no-MTP, 0.73 accept rate) Sections affected: summary coda, M3/V2 0.02-accept rows, hard-failures list, timeline (May 10 eve + May 11 AM), Open Questions (split into resolved vs still-open), new §γ>1 MTP, TOC entries.
…ted claims Single source of truth EVIDENCE.md + site rewrite based on one-variable-at-a- time (OVAT) measurements at the MoE 64K code reference cell, plus the Q8 KV- ceiling sweep on MoE 128K-512K and Dense 32K-64K, plus the triple-falsification of MoE 1M DFlash (cold-fresh = 26 ± 5 tok/s, not 4.86 as the sweep-churn outlier suggested). Findings that survived rigour - Q8 + pflash + DFlash dm=4 is the MoE long-ctx winner: 60-67 tok/s, 6-8 J/tok at 128K-512K. - TQ3 only earns its place above 512K (Q8 pages at 1M). - MTP γ=2 is the right drafter for Dense >=32K where DFlash would breach VRAM. Inflated claims walked back (cited in EVIDENCE.md §2.2) - "γ=2 at MoE 64K = +61% over no-MTP" — was vs TQ3-handicapped baseline. Vs naive Q8: TIED (-0.6%). DFlash dm=8 is actual winner (+49%). - "TQ3 is 2.5× faster than Q8 at Dense 64K" — conflated MoE OVAT with Dense. Withdrawn. - "DFlash collapses at MoE 1M" — sweep-churn artifact, cold = 26 ± 5 tok/s. - "MTP uneconomic at 64K, <=2% accept" (v1 dossier) — pre-Bug-2 noise. Post-rebase: 0.55-0.78 accept. Scope - 414 file changes: EVIDENCE.md + EVIDENCE.v1.md.bak archival + site/index.html rewrite + 14 sweep harnesses + sweep dirs covering OVAT, Q8 ceiling, triple-falsification, paired-prompt code/prose matrices, MoE scientific sweep with energy telemetry. Public site at dusterbloom/gemma4-evidence already updated (commit e1310b6). This commit lands the matching state on the PR branch.
…r history) Upstream rebased feature/tq3-kv-cache onto a clean linear history and named it feature/tq3-kv-cache-clean. Same effective state — top commit is still "chore(fattn): remove stray TQ3-DEQ debug printf in chunked dequant" — just with a tighter base.
# Conflicts: # dflash/CMakeLists.txt # dflash/scripts/server.py
Joule integration over paired-matrix/power/*.csv during the decode window gives 33.79 J/tok at 64K and 34.50 J/tok at 128K. Cells were previously marked "awaiting" pending power instrumentation; the paired-matrix sweep captured the watt traces. The MoE 1M DFlash Q8-pages cell remains "awaiting" — the most recent measurement was on TQ3/TQ3 (39.91 J/tok decode), not Q8 pages, so the row's labeled config is still unmeasured.
The .sisyphus/ tree (sweep logs, power CSVs, EVIDENCE drafts, site/index.html, plans, journey notes) was inflating PR diff to 91,927 lines across 486 files. The dossier is preserved off-repo at ~/lucebox-hub-evidence/feature-gemma4-support-*/ and is now gitignored so future sweep runs do not re-track it. Net effect: PR drops from 512 files / +91,927 lines to ~24 files of actual code (dflash src/test/scripts, CMakeLists, README, submodule pin, .gitignore).
Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing
Summary
Brings the Gemma4 family (31B Dense, 26B-A4B MoE) to production parity with Qwen3.5: chunked batched prefill, DFlash speculative decoding, pFlash block-sparse attention via a new ggml op, and a daemon mode wired into
scripts/server.pyso the OpenAI-compatible HTTP server can serve Gemma4 with--pflash.Update 2026-05-10 — post-fix benchmarks
Three commits in this branch fix correctness bugs that were holding back the original numbers:
d758ed9bf(submodule)fix(fattn): force chunked path for TQ3 K to avoid MMA-intercept FWHT mismatch— the MMA dequant intercept was dequantising TQ3_0 K to F16 without inverting the FWHT rotation that quantisation applies on write. The chunked / vec FA paths handle this correctly (they forward-rotate Q to match); the MMA intercept silently bypassed those hooks. Fix: force chunked for ALL TQ3 cases unlessDFLASH_TQ3_MMAopts in. This is the fix for the "TQ3_0 + chunked prefill produces token 0" item in the original Known Limitations section.7b62c07(parent)fix(gemma4): allocate+fill SWA mask for n_tokens==1 decode + bump llama.cpp— the SWA mask was only allocated forn_tokens > 1("batched prefill only" per the comment), so single-token decode fell back toattn_masksized for the full kv_len padding. The K view was 2048 wide, the mask was 256 wide — dimensional mismatch let FA read past the populated region into uninitialised cudaMalloc bytes. Catastrophic with TQ3_0 (turned the cache noise into a<unused94>/'en'repetition loop), benign-but-wrong with Q8_0.f1f811e(parent)fix(mtp): always provide FA mask for head_dim>=512 (any K type)— the CUDA MMA dispatcher'sgqa_opt_appliesrequires bothK->ne[1] % FATTN_KQ_STRIDE == 0ANDmask != nullptr. The previous predicate(kv_is_tq3 && head_dim_fa >= 512) || needs_kv_padleft Q8 + already-aligned KV with no mask, so the dispatcher returnedBEST_FATTN_KERNEL_NONEand aborted atfattn.cu:659. Reproduced as M4 (MTP+Q8/Q8) crashing at step ~210 in the original benchmarks. Fix: always setneed_maskwhenhead_dim_fa >= 512.New benchmarks (RTX 3090, single GPU, Q4_K_M weights, Q8_0 KV unless noted)
Gemma-4-26B-A4B MoE — pFlash + DFlash with
--draft-max 4PR's original report had the 64K MoE cell at decode 13 tok/s, accept 1.23/16 because the implicit
--draft-max 16over-speculated at long context. A--draft-max ∈ {1, 2, 4, 8}sweep on a 50K-token code prompt (concatenated HumanEval+ tasks) finds dm=4 is the optimum for MoE. With dm=4 the same model + same drafter at 64K runs at 36.57 tok/s — 2.8× the previous published number. dm=4 also holds at 256K.256K fits a 24 GB RTX 3090 with ~2.3 GB headroom. Decode tok/s is essentially flat from 64K to 256K (variance ±5%) because actual KV usage is held constant at 50K and only cache-allocation overhead scales — confirming the long-ctx decode is bound by per-step KV bandwidth at the populated position count, not at the cache-size allocation.
Gemma-4-31B Dense — DFlash + Q8/Q8 + pFlash, code prompt
long_open)For long-context production on a 24 GB GPU, MoE 26B is the model. The dense 31B can run at 64K target-only, but the dflash drafter pushes it into the swap regime and decode collapses. The MTP drafter unblocks itself at small ctx with the head_dim=512 fix (4K + Q8/Q8 + MTP runs to 256 steps with accept_rate 0.87, decode 34.4 tok/s) but no MoE MTP head exists yet.
Drafter quality is a function of (drafter × target × prompt distribution)
The dflash drafter is trained on code-distribution data. On a creative-writing prompt (
long_open: "Write a short story about a robot…") the drafter accept rate collapses to ~0.20–0.25; on a HumanEval prompt the same drafter at the same context and config runs at 0.64–0.67. Bench numbers must specify prompt distribution. PR #131's original 10.67/16 was on code prompts (matches our 6.56/16 dm=16 on HumanEval/2 within task variance); on creative writing the same drafter would have shown the same collapse.Updated Known Limitations (vs the original list)
TQ3_0 + chunked prefill produces token 0 (independent of pflash)→ RESOLVED by submodule commitd758ed9bf(force chunked for all TQ3 cases) and parent commit7b62c07(proper SWA mask for n_tokens==1 decode).DFlash spec acceptance drops at 64K (avg 1.23/16 vs 10.67/16 at 32K)→ MITIGATED. Root cause was a combination of (a) over-speculation:--draft-max 16is too aggressive at long ctx for MoE, (b) the drafter's 2096-slot KV cap means it sees only the last ~4% of a 50K prompt. Switching to--draft-max 4recovers 36.57 tok/s at 64K and 35.30 tok/s at 256K. Drafter cache size remains an open improvement.How to reproduce
The bench scripts and BPE-tokenised prompts are in this branch under
.sisyphus/notes/gemma4-baseline/:prompts/{short_chat,long_open,long_2k,long_50k,long_code_50k,humaneval_2}.txt— comma-separated token IDs (HFgoogle/gemma-3-27b-ittokeniser is byte-identical to the GGUF; verifiedvocab_size=262144 BOS=2 EOS=106match).run_dm_sweep.sh— the dm sweep that gave the 64K dm=4 result above.run_scaling.sh— the 16K → 256K MoE ladder.run_64k_v2.sh— the post-fix dense 64K matrix.Each runner emits a
SUMMARY.mdwith prefill tok/s, decode tok/s, AL, and VRAM per cell; raw per-cell logs are reproducible.A longer write-up of the debugging journey is in
.sisyphus/notes/gemma4-journey.mdfor anyone wanting the narrative behind these numbers.Original benchmark section
(retained for history; see "Update 2026-05-10" above for the post-fix numbers)
Gemma-4-31B Dense
Gemma-4-26B-A4B MoE — pFlash big wins at long context
* Decode at 64K shows low DFlash acceptance at the original
--draft-max 16budget; the post-fix numbers above show this lifts to 36.57 tok/s with--draft-max 4. Prefill speedup is unchanged.VRAM stays under 21 GB at 64K on the 24 GB RTX 3090.
Speedup scaling
pFlash savings scale with KV-len because attention block selection skips blocks proportional to context length. The progression (4.5% → 16.6% → 53.7% → 101.7%) matches the design.
Highlights
New
GGML_OP_FLASH_ATTN_SPARSEggml op (submodule)ggml-cuda/fattn-sparse.cuwith S↔H transpose for ggml ↔ pFlash layout conversionggml_flash_attn_extwhen no kernel registeredggml_get_to_fp16_cudadequant before the sparse path5be140d feat(ggml): add GGML_OP_FLASH_ATTN_SPARSE ...,866688b feat(ggml-cuda): dequantize K/V to F16 in sparse FA pathpFlash + DFlash + chunked prefill on Gemma4
ggml_flash_attn_sparsewired intobuild_full_attn_block()(full-attention layers only; SWA layers stay on dense FA)pflash_ggml_adapterpflash_supports = {F16, Q8_0, Q4_0}; TQ3_0 falls back to denselast_token_logits_onlyported from upstream PR perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context #108 — saves ~1GB output tensor and ~1000x lm_head compute per non-last prefill chunkMajor correctness fix: SWA mask coordinate frame
The host-built SWA causal mask was in absolute KV coordinates but the FA kernel reads it indexed by view position. For every prefill chunk where
kv_start > 0, the mask was misaligned byring_win_startcolumns → kernel saw all-inffor SWA layers → softmax NaN.236770(broken) to236799(correct) after this fix.Fix introduces a shared
compute_swa_view()helper used by both the graph builder and the test driver so K view + mask stay in lockstep.Daemon mode + server routing
test_gemma4_dflash --daemonmirrors the IPC protocol oftest_dflash: line-based stdin commands, int32 LE token stream on--stream-fd=N,-1sentinelscripts/server.pydetects GGUF architecture; routes totest_gemma4_dflashforgemma4, keepstest_dflashfor Qwen3.5--pflashserver flag passes through to the daemon-1sentinel streamed on fd=3 with 26B-A4B + 4096-token promptRun the server
Test plan
test_flash_attn_sparse(TDD: dense vs sparse @ alpha=1.0 within BF16 tolerance; alpha<1.0 liveness) — passestest_gemma4_dflash --pflashat 4K/8K Q8 on 31B Dense — output matches baseline at 4K, sparse approx at 8K-1sentinel streamed on fd=3Submodule
Submodule pointer is bumped on
dusterbloom/llama-cpp-turboquant-cudafeature/tq3-kv-cache. The.gitmodulesURL was updated to that fork because the upstreamLuce-Org/llama.cpp-dflash-ggml.gitdoesn't have these commits. Maintainer can rewrite the URL post-merge if the commits get mirrored to a Luce-Org repo.