Skip to content

Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131

Open
dusterbloom wants to merge 60 commits into
Luce-Org:mainfrom
dusterbloom:feature/gemma4-support
Open

Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing#131
dusterbloom wants to merge 60 commits into
Luce-Org:mainfrom
dusterbloom:feature/gemma4-support

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

@dusterbloom dusterbloom commented May 8, 2026

Gemma4 support: pFlash + DFlash + chunked prefill, daemon mode, server routing

Update 2026-05-10 — three correctness fixes landed in this branch unblock TQ3 KV across all contexts and recover the 64K MoE drafter, lifting decode there from 13 → 36.57 tok/s. New section below: "Update 2026-05-10". The original benchmarks remain for history.

Summary

Brings the Gemma4 family (31B Dense, 26B-A4B MoE) to production parity with Qwen3.5: chunked batched prefill, DFlash speculative decoding, pFlash block-sparse attention via a new ggml op, and a daemon mode wired into scripts/server.py so the OpenAI-compatible HTTP server can serve Gemma4 with --pflash.

Update 2026-05-10 — post-fix benchmarks

Three commits in this branch fix correctness bugs that were holding back the original numbers:

  • d758ed9bf (submodule) fix(fattn): force chunked path for TQ3 K to avoid MMA-intercept FWHT mismatch — the MMA dequant intercept was dequantising TQ3_0 K to F16 without inverting the FWHT rotation that quantisation applies on write. The chunked / vec FA paths handle this correctly (they forward-rotate Q to match); the MMA intercept silently bypassed those hooks. Fix: force chunked for ALL TQ3 cases unless DFLASH_TQ3_MMA opts in. This is the fix for the "TQ3_0 + chunked prefill produces token 0" item in the original Known Limitations section.

  • 7b62c07 (parent) fix(gemma4): allocate+fill SWA mask for n_tokens==1 decode + bump llama.cpp — the SWA mask was only allocated for n_tokens > 1 ("batched prefill only" per the comment), so single-token decode fell back to attn_mask sized for the full kv_len padding. The K view was 2048 wide, the mask was 256 wide — dimensional mismatch let FA read past the populated region into uninitialised cudaMalloc bytes. Catastrophic with TQ3_0 (turned the cache noise into a <unused94>/'en' repetition loop), benign-but-wrong with Q8_0.

  • f1f811e (parent) fix(mtp): always provide FA mask for head_dim>=512 (any K type) — the CUDA MMA dispatcher's gqa_opt_applies requires both K->ne[1] % FATTN_KQ_STRIDE == 0 AND mask != nullptr. The previous predicate (kv_is_tq3 && head_dim_fa >= 512) || needs_kv_pad left Q8 + already-aligned KV with no mask, so the dispatcher returned BEST_FATTN_KERNEL_NONE and aborted at fattn.cu:659. Reproduced as M4 (MTP+Q8/Q8) crashing at step ~210 in the original benchmarks. Fix: always set need_mask when head_dim_fa >= 512.

New benchmarks (RTX 3090, single GPU, Q4_K_M weights, Q8_0 KV unless noted)

Gemma-4-26B-A4B MoE — pFlash + DFlash with --draft-max 4

PR's original report had the 64K MoE cell at decode 13 tok/s, accept 1.23/16 because the implicit --draft-max 16 over-speculated at long context. A --draft-max ∈ {1, 2, 4, 8} sweep on a 50K-token code prompt (concatenated HumanEval+ tasks) finds dm=4 is the optimum for MoE. With dm=4 the same model + same drafter at 64K runs at 36.57 tok/s — 2.8× the previous published number. dm=4 also holds at 256K.

Context --draft-max Decode tok/s AL VRAM
4K (HumanEval/2) 4 111.92 2.64 18.95 GB
16K (long_2k Shakespeare) 4 72.32 1.66 19.27 GB
32K (long_2k) 4 70.54 1.66 19.45 GB
64K (long_code_50k) 4 36.57 1.79 19.74 GB
128K (long_code_50k) 4 35.21 1.77 20.40 GB
256K (long_code_50k) 4 35.30 / 36.63 1.79 21.73 GB

256K fits a 24 GB RTX 3090 with ~2.3 GB headroom. Decode tok/s is essentially flat from 64K to 256K (variance ±5%) because actual KV usage is held constant at 50K and only cache-allocation overhead scales — confirming the long-ctx decode is bound by per-step KV bandwidth at the populated position count, not at the cache-size allocation.

Gemma-4-31B Dense — DFlash + Q8/Q8 + pFlash, code prompt

Context --draft-max Decode tok/s AL VRAM Notes
4K (HumanEval/2) 8 56.12 5.12 22.16 GB code prompt
4K (HumanEval/2) 16 97.81 6.56 22.18 GB matches PR's 149/16 reference within task variance
4K (creative long_open) 8 23.77 2.13 22.10 GB drafter is OOD on creative writing — see "prompt distribution" note below
64K (long_code_50k) 8 1.78 ← VRAM saturated, paging 1.94 24.00 GB / 24.00 GB dense + drafter + Q8/Q8 + 64K is not viable on 24 GB
64K (target-only, Q8/Q8) 7.96 22.60 GB Without drafter Q8/Q8 dense at 64K fits
64K (target-only, TQ3/TQ3) 6.90 21.25 GB TQ3 minimum-VRAM dense at 64K

For long-context production on a 24 GB GPU, MoE 26B is the model. The dense 31B can run at 64K target-only, but the dflash drafter pushes it into the swap regime and decode collapses. The MTP drafter unblocks itself at small ctx with the head_dim=512 fix (4K + Q8/Q8 + MTP runs to 256 steps with accept_rate 0.87, decode 34.4 tok/s) but no MoE MTP head exists yet.

Drafter quality is a function of (drafter × target × prompt distribution)

The dflash drafter is trained on code-distribution data. On a creative-writing prompt (long_open: "Write a short story about a robot…") the drafter accept rate collapses to ~0.20–0.25; on a HumanEval prompt the same drafter at the same context and config runs at 0.64–0.67. Bench numbers must specify prompt distribution. PR #131's original 10.67/16 was on code prompts (matches our 6.56/16 dm=16 on HumanEval/2 within task variance); on creative writing the same drafter would have shown the same collapse.

Updated Known Limitations (vs the original list)

  • TQ3_0 + chunked prefill produces token 0 (independent of pflash)RESOLVED by submodule commit d758ed9bf (force chunked for all TQ3 cases) and parent commit 7b62c07 (proper SWA mask for n_tokens==1 decode).
  • DFlash spec acceptance drops at 64K (avg 1.23/16 vs 10.67/16 at 32K)MITIGATED. Root cause was a combination of (a) over-speculation: --draft-max 16 is too aggressive at long ctx for MoE, (b) the drafter's 2096-slot KV cap means it sees only the last ~4% of a 50K prompt. Switching to --draft-max 4 recovers 36.57 tok/s at 64K and 35.30 tok/s at 256K. Drafter cache size remains an open improvement.
  • NEW: dense 31B + drafter + Q8/Q8 + 64K saturates 24 GB and decode collapses to ~2 tok/s. Use MoE 26B for long-context on 24 GB consumer GPUs.

How to reproduce

The bench scripts and BPE-tokenised prompts are in this branch under .sisyphus/notes/gemma4-baseline/:

  • prompts/{short_chat,long_open,long_2k,long_50k,long_code_50k,humaneval_2}.txt — comma-separated token IDs (HF google/gemma-3-27b-it tokeniser is byte-identical to the GGUF; verified vocab_size=262144 BOS=2 EOS=106 match).
  • run_dm_sweep.sh — the dm sweep that gave the 64K dm=4 result above.
  • run_scaling.sh — the 16K → 256K MoE ladder.
  • run_64k_v2.sh — the post-fix dense 64K matrix.

Each runner emits a SUMMARY.md with prefill tok/s, decode tok/s, AL, and VRAM per cell; raw per-cell logs are reproducible.

A longer write-up of the debugging journey is in .sisyphus/notes/gemma4-journey.md for anyone wanting the narrative behind these numbers.


Original benchmark section

(retained for history; see "Update 2026-05-10" above for the post-fix numbers)

Gemma-4-31B Dense

Context Prefill baseline Prefill +pflash Speedup Decode +pflash Accept
4K 1438 tok/s 1502 tok/s +4.5% 149 tok/s 10.67/16
8K 1284 tok/s 1497 tok/s +16.6% 100 tok/s 10.67/16

Gemma-4-26B-A4B MoE — pFlash big wins at long context

Context Prefill baseline Prefill +pflash Speedup Decode +pflash Accept
8K 3400 tok/s (TBD) 117 tok/s 8.0/16
32K 2530 tok/s 3888 tok/s +53.7% 133 tok/s 10.67/16
64K 1997 tok/s 4028 tok/s +101.7% 13 tok/s* 1.23/16*

* Decode at 64K shows low DFlash acceptance at the original --draft-max 16 budget; the post-fix numbers above show this lifts to 36.57 tok/s with --draft-max 4. Prefill speedup is unchanged.

VRAM stays under 21 GB at 64K on the 24 GB RTX 3090.

Speedup scaling

pFlash savings scale with KV-len because attention block selection skips blocks proportional to context length. The progression (4.5% → 16.6% → 53.7% → 101.7%) matches the design.

Highlights

New GGML_OP_FLASH_ATTN_SPARSE ggml op (submodule)

  • CUDA dispatch in ggml-cuda/fattn-sparse.cu with S↔H transpose for ggml ↔ pFlash layout conversion
  • BF16 fast path; falls back to dense ggml_flash_attn_ext when no kernel registered
  • Q8_0 / Q4_0 K/V supported via ggml_get_to_fp16_cuda dequant before the sparse path
  • Submodule commits: 5be140d feat(ggml): add GGML_OP_FLASH_ATTN_SPARSE ..., 866688b feat(ggml-cuda): dequantize K/V to F16 in sparse FA path

pFlash + DFlash + chunked prefill on Gemma4

  • ggml_flash_attn_sparse wired into build_full_attn_block() (full-attention layers only; SWA layers stay on dense FA)
  • pFlash CUDA kernel registered via new pflash_ggml_adapter
  • Type-aware dispatch: pflash_supports = {F16, Q8_0, Q4_0}; TQ3_0 falls back to dense
  • last_token_logits_only ported from upstream PR perf: Replace Q8_0 format for KV with Q4_0 + Rotation, fix window_filled for long context #108 — saves ~1GB output tensor and ~1000x lm_head compute per non-last prefill chunk

Major correctness fix: SWA mask coordinate frame

The host-built SWA causal mask was in absolute KV coordinates but the FA kernel reads it indexed by view position. For every prefill chunk where kv_start > 0, the mask was misaligned by ring_win_start columns → kernel saw all -inf for SWA layers → softmax NaN.

  • Q8/F16 KV: degraded but plausible-looking output (NaN absorbed by saturating arithmetic). Visible: 8K Q8 baseline output token changed from 236770 (broken) to 236799 (correct) after this fix.
  • TQ3_0 KV: clean NaN propagation → argmax returns 0.

Fix introduces a shared compute_swa_view() helper used by both the graph builder and the test driver so K view + mask stay in lockstep.

Daemon mode + server routing

  • test_gemma4_dflash --daemon mirrors the IPC protocol of test_dflash: line-based stdin commands, int32 LE token stream on --stream-fd=N, -1 sentinel
  • scripts/server.py detects GGUF architecture; routes to test_gemma4_dflash for gemma4, keeps test_dflash for Qwen3.5
  • New --pflash server flag passes through to the daemon
  • Smoke test: 16 valid Gemma4 vocab tokens + -1 sentinel streamed on fd=3 with 26B-A4B + 4096-token prompt

Run the server

python3 dflash/scripts/server.py \
  --target /path/to/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf \
  --draft  /path/to/gemma4-26b-a4b-dflash/ \
  --pflash \
  --port 8000

Test plan

  • test_flash_attn_sparse (TDD: dense vs sparse @ alpha=1.0 within BF16 tolerance; alpha<1.0 liveness) — passes
  • test_gemma4_dflash --pflash at 4K/8K Q8 on 31B Dense — output matches baseline at 4K, sparse approx at 8K
  • Gemma-4-26B-A4B MoE pFlash + DFlash 32K, 64K — see updated numbers above
  • NEW: post-fix MoE 16K → 256K ladder + dm sweep — see updated numbers above
  • Daemon mode smoke test — 16 valid Gemma4 vocab tokens + -1 sentinel streamed on fd=3
  • Server end-to-end OpenAI API request against deployed Gemma4 (manual)

Submodule

Submodule pointer is bumped on dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache. The .gitmodules URL was updated to that fork because the upstream Luce-Org/llama.cpp-dflash-ggml.git doesn't have these commits. Maintainer can rewrite the URL post-merge if the commits get mirrored to a Luce-Org repo.

dusterbloom added 20 commits May 7, 2026 10:38
Full implementation of Gemma4 architecture for lucebox-hub DFlash:

Target model (GGUF loader + forward pass graph builder):
- Per-layer head_count_kv array (8 for SWA, 2 for full-attention)
- Dual head_dim: 256 (SWA) / 512 (full-attention) with correct cache sizing
- V=K sharing on full-attention layers (attention_k_eq_v)
- MoE FFN: 128 experts, top-8 routing with shared expert + softmax gating
- Sliding window attention pattern from BOOL GGUF array
- Proportional RoPE (p-RoPE) with per-layer freq_factors
- Embedding scaled by sqrt(hidden_size) per HF reference
- CUDA FA 256-alignment for head_dim>=512 (FATTN_KQ_STRIDE)
- TurboQuant TQ3_0 KV cache with 256-byte alignment padding
- Logit softcapping: 30 * tanh(logits / 30)

Draft model (safetensors loader + forward pass):
- 5-layer transformer with SwiGLU FFN
- FC projection: 6 * target_hidden -> draft_hidden
- Tied LM head using target tok_embd
- Block-diffusion speculative decoding architecture
5 smoke tests validating the Gemma4 implementation:
- smoke_load_gemma4_target: GGUF metadata, per-layer head_kv, SWA pattern
- smoke_gemma4_target_forward: full 26B-A4B forward pass, logits in [-30,30]
- smoke_load_gemma4_draft: safetensors loading, fc/layer shape validation
- smoke_gemma4_draft_forward: draft forward with injected target tok_embd
- test_gemma4_kv_tq3: TQ3 cache 256-alignment, shared layer donors

Plus test_gemma4_dflash driver for combined target+draft benchmarking.
The evenly-spaced formula produced wrong IDs for both Gemma4 variants.
Use the actual values from the z-lab DFlash draft model config.json:
- 26B-A4B (30 layers): {1, 6, 11, 17, 22, 27}
- 31B (60 layers): {1, 12, 23, 35, 46, 57}
Fall back to evenly-spaced for unknown layer counts.
The draft model was stateless (no KV cache), giving 0% speculative
acceptance.  Add prefix-direct KV materialization: target features are
projected through FC → hidden_norm → per-layer K/V, stored in a
dedicated draft KV cache.  The draft forward now attends to this
cache, matching the SGLang/vLLM DFlash architecture.

Gemma4-26B-A4B with draft: avg 10.67 tokens accepted per step,
~250 tok/s decode on RTX 3090 (vs ~67 tok/s baseline).
Replace single-token autoregressive prefill with chunked batched forward.
Each chunk processes up to swa_window tokens in a single GPU dispatch,
cutting prefill from ~66 tok/s to ~830-1060 tok/s on RTX 3090.

Add swa_mask to GemmaGraphInputs so SWA attention layers use a
sliding-window mask during batched prefill while full-attention layers
keep the standard causal mask.
Add --csv flag for direct use with test_gemma4_dflash --tokens.
Default model changed to google/gemma-4-26b-a4b-it. Add --verbose
flag, local_files_only caching, and --add-bos option.
Converts BF16 safetensors draft weights to Q8_0 GGUF format.
Projection weights quantized to Q8_0 (~50% size), norms kept F32.
Includes Gemma4-specific GGUF metadata (sliding_window, logit_softcap,
target_layer_ids). Requires a C++ GGUF loader to be used at inference.
Three bugs prevented coherent speculative decoding output:

1. Missing BOS token: Gemma4 requires BOS (token 2) at position 0.
   Auto-prepend from GGUF bos_token_id when not already present.

2. Missing EOT fallback: many Gemma4 GGUFs omit eot_token_id, so
   eos_chat_id stayed -1 and <end_of_turn> (107) was never caught.
   Default to 107 when the key is absent.

3. Uninitialized SWA mask in speculative verify: when n_tokens > 1,
   build_gemma4_step allocates swa_mask but only attn_mask was filled.
   SWA layers used garbage memory, corrupting all hidden states and
   collapsing output to token 0 (padding) from step 2 onward.

Verified: DFlash now produces identical output to AR baseline and
stops at EOS. Gemma4-31B Q4_K_M + TQ3_0 KV = 80.82 tok/s (2.37x
over AR 34.14 tok/s) on RTX 3090.
… script

load_gemma4_draft_gguf() reads Q8_0-quantized draft weights from GGUF,
auto-detected by .gguf extension on --draft path. Q8_0 drafter matches
BF16 acceptance (AL=6.74) while loading 44% faster and using 380MB less VRAM.

quantize_gemma4_draft_q8.py now reads config.json for model dimensions
instead of hardcoding 26B constants, supporting both 26B-A4B and 31B drafters.
…ttention

Layer-by-layer prefill using FlashPrefill block-sparse WMMA attention for
full-attention layers and ggml FA for SWA layers. Includes gallocr
pre-reserve to eliminate graph allocator overhead and fused [B+SWA] graphs
to reduce hidden_buf round-trips.

Benchmarks at 6K tokens (26B-A4B): 4073 tok/s (+12% over chunked prefill).
Real gains expected at 64K+ where attention density drops below 10%.
Add --pflash, --pflash-alpha, and --tokens-file flags to test harness.
--tokens-file reads comma-separated IDs from a file, bypassing ARG_MAX
limits for prompts >16K tokens.

Fix draft KV cache overflow crash when prompt exceeds draft sliding window
(2096 slots). Clamp prefill to trailing window, adjust ring-buffer offset,
and add defensive assert in build_draft_kv_prefill_graph().
… in FA

FWHT rotation for TQ3_0 KV cache is now handled inside the Flash Attention
CUDA kernel via warp-cooperative shuffle. Remove the separate ggml_turbo_wht
graph ops from build_swa_attn_block() and build_full_attn_block().
SWA layers only need swa_window slots, not the full context. At 64K with
Gemma4 (50 SWA, 10 full-attn layers), this saves 81.8% of KV VRAM.

Ring-buffer read/write positions use modular arithmetic so SWA cache views
never exceed tensor boundaries at long contexts.

Verified: 31B Dense at 64K uses 22.06 GB (target-only), 24.00 GB (full stack
with Q8_0 draft + TQ3_0 KV + DFlash decode at 29.26 tok/s).
After prefill fills all 2096 draft KV slots, the first decode step would
crash with "draft KV overflow". Now wraps draft_kv_pos with modulo
arithmetic, treating the draft cache as a ring buffer.
Decouple Graph A/B chunk size (32K) from SWA window (1K-2K). Batch
consecutive SWA layers into single ggml graphs to reduce graph build
overhead. SWA_CHUNK now tracks actual cache allocation.

Full-attn layers keep the existing Graph A → pFlash → Graph B path.
pFlash integration into single-graph-per-chunk architecture is next.
Replaces the layer-by-layer gemma4_pflash_prefill() with a single-graph-
per-chunk path using the new GGML_OP_FLASH_ATTN_SPARSE op for full-
attention layers. SWA layers continue to use ggml_flash_attn_ext.

Perf (MoE 26B-A4B at 64K, RTX 3090, Q8_0 KV):
  chunked baseline:  1867 tok/s prefill, 100.6 tok/s decode, 10.67/16 accept
  + --pflash:        3374 tok/s prefill (1.81x), 101.8 tok/s decode

Changes:
- Adapter (pflash_ggml_adapter.cpp/h) registers the pFlash CUDA kernel
  with the ggml op. Maps alpha>=1.0 to fully-dense mode.
- build_full_attn_block() conditionally uses ggml_flash_attn_sparse
  when use_pflash is set.
- attn_mask is skipped (in graph + driver) when use_pflash=true since
  the sparse op applies block-level causal internally.
- gemma4_pflash_prefill.cpp removed (replaced by chunked path).
- test/test_flash_attn_sparse.cpp: TDD coverage for the ggml op
  (dense vs sparse @ alpha=1.0 within BF16 precision; alpha<1.0 liveness).

Ported upstream fixes:
- TQ3_0 mask stride (PR Luce-Org#128): bump g_kq_stride_pad to 256 when KV is
  selected via DFLASH27B_KV_K/V env vars. Prevents NaN at chunk sizes
  256/512/1024/2048 with TQ3_0 KV.
- last_token_logits_only (PR Luce-Org#108): skip lm_head matmul over all but
  last token during prefill chunks. Saves ~1GB output tensor and
  ~1000x lm_head compute per chunk on Gemma4-31B (vocab=262144).
Three correctness fixes after benchmarking exposed silent corruption
when --pflash was combined with quantized KV:

1. Graph-level type check in build_full_attn_block: dispatch to
   ggml_flash_attn_sparse only when K/V are F16/Q8_0/Q4_0. TQ3_0 falls
   back to ggml_flash_attn_ext because TQ3's WHT rotation requires
   special handling not yet in the sparse path.

2. Always allocate attn_mask in test_gemma4_dflash (previously skipped
   when use_pflash=true). When some full-attn layers fall back to dense
   FA (non-supported KV types), the mask is required.

3. Guard ggml_backend_tensor_set on attn_mask/swa_mask buffer existence:
   when all full-attn layers use sparse FA, the mask tensor is
   unreferenced by any compute op so gallocr leaves its buffer NULL.
   ggml_set_output is added as a hint but doesn't force allocation;
   skip the write when buffer is NULL. swa_mask gets the same defensive
   check.

Measured on Gemma-4-31B Q4_K_M, RTX 3090, Q8_0 KV:
  4K: 1348 -> 1483 tok/s prefill (+10%), output matches baseline
  8K: 1441 -> 1546 tok/s prefill (+7.3%), block-sparse approximation

Earlier MoE 64K "1.81x speedup" claim was on the broken sparse path
(reading Q8 bytes as F16); that data point is invalid. The current
numbers are on verified-correct execution.

TQ3_0 + chunked path is broken independently of pflash (produces token
0); needs separate debug.
The host-built SWA causal mask was filled in absolute KV coordinates
(mask[q][abs_k] = 0 for valid keys) but the FA CUDA kernel reads it
indexed by view position (k_view = 0..effective_win_len-1, where slot 0
= the cache offset where the K view starts).

For every prefill chunk where kv_start > 0, the K view starts at
ring_win_start in the cache (computed in build_swa_attn_block as
kv_start - swa_window aligned to the ring buffer). The mask cell
[q][k_view=0] was written assuming absolute slot 0, which is far before
the window's lo bound, so it stayed -inf. The kernel then saw every
K-view position as -inf for q rows touching that chunk.

Symptoms:
- Q8/F16 KV: degraded but plausible-looking output (NaNs absorbed by
  saturating arithmetic; argmax landed on some non-zero index)
- TQ3_0 KV: clean NaN propagation through WHT-rotated FA path; argmax
  over NaN-containing logits returns 0 (because `if (x[i] > best)` is
  false for NaN). This is why "TQ3 produces token 0" was the visible
  failure mode.

Fix:
- Add SwaView struct + compute_swa_view() helper in internal.h /
  gemma4_target_graph.cpp encapsulating the
  (abs_win_start, effective_win_len, ring_win_start) math
- build_swa_attn_block calls the helper instead of inlining
- build_swa_causal_mask in test driver takes (abs_win_start, win_len,
  n_tokens, kv_start, swa_window); writes mask[q][k_view] for k_view
  in [0, win_len), using abs_win_start + k_view to check the absolute
  causal window
- swa_mask tensor sized [align_up(effective_win_len, g_kq_stride_pad),
  q_pad] instead of [align_up(kv_len, g_kq_stride_pad), q_pad]
- Both prefill chunk loop and spec-decode verify loop call the helper
  to get matching geometry

Measured impact (Gemma-4-31B Q4_K_M, RTX 3090):
  8K Q8 baseline last sampled token: 236770 (broken) -> 236799 (correct)
  8K Q8 +pflash:                                          1284 -> 1497 tok/s (+16.6%)

Bug entered with chunked prefill (commit 7ce68ac); SWA ring-buffer
(commit f2c36bc) made the offset non-monotonic in kv_start.

The reference Qwen3.5 driver (test/test_dflash.cpp:547-565) already had
this correct via `out_mask[q*kv_pad + (k - win_start)]`.

TQ3_0 still produces token 0 after this fix; that is a separate
TQ3-specific bug.
Wires the Gemma4 binary into scripts/server.py so the OpenAI-compatible
HTTP server can serve Gemma-4-31B and Gemma-4-26B-A4B (with the pFlash
+ DFlash + chunked prefill stack we built this session).

## test/test_gemma4_dflash.cpp

Added a daemon mode that mirrors the IPC protocol used by test_dflash
(Qwen3.5 binary):

- New flags: --daemon, --stream-fd=N, --max-ctx=N (alias for --ctx-size)
- No-op flags accepted for cmdline compatibility with server.py:
  --fast-rollback, --ddtree, --ddtree-budget=B, --ddtree-temp=F,
  --ddtree-no-chain-seed
- After model load, prints "[daemon] ready" to stdout and enters a
  stdin loop reading line-based commands
- Supported command: <prompt_bin_path> <n_gen> [samp=t,p,k,r[,seed]]
- prompt_bin_path is a binary file of int32 LE token IDs
- Each generated token is written as int32 LE to stream_fd; -1 sentinel
  marks end of generation
- Unsupported commands (RESTORE, SNAPSHOT, compress, park, ...) are
  acknowledged with -1 sentinel for now (out of scope for v1)

## scripts/server.py

- _read_gguf_architecture() reads general.architecture from a GGUF
- main() detects "gemma4" and switches DEFAULT_BIN to test_gemma4_dflash
- For Gemma4 the draft argument stays as a directory (matching the
  binary's CLI); for Qwen3 it stays a file as before
- Daemon command is built differently per arch: Gemma4 uses --model /
  --draft named flags and accepts --pflash, Qwen3 keeps the existing
  positional form
- New top-level --pflash flag passes through to the Gemma4 daemon

Smoke-tested locally with the 26B-A4B model + 4096-token prompt, n_gen=16:
daemon prints "[daemon] ready", consumes the binary prompt file, runs
chunked prefill, decodes 16 tokens streamed as int32 LE on fd=3, and
emits the -1 sentinel. Tokens are valid Gemma4 vocab IDs.
The parent's submodule pointer references commits that live only on
github.com/dusterbloom/llama-cpp-turboquant-cuda (our pflash sparse-FA
work). Update .gitmodules so cloners fetch from that fork instead of
the upstream Luce-Org/llama.cpp-dflash-ggml repo (which doesn't have
these commits).

Maintainer can rewrite this URL post-merge if the commits get
mirrored to a Luce-Org repo.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 20 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/errors.cpp">

<violation number="1" location="dflash/src/errors.cpp:30">
P2: Returns a pointer into shared mutable error storage after unlocking, so concurrent `set_last_error()` calls can invalidate the returned `const char *`.</violation>
</file>

<file name="dflash/scripts/quantize_gemma4_draft_q8.py">

<violation number="1" location="dflash/scripts/quantize_gemma4_draft_q8.py:227">
P2: Missing validation for empty `target_layer_ids` can crash quantization with modulo-by-zero when computing `TARGET_HIDDEN`.</violation>
</file>

<file name="dflash/scripts/server.py">

<violation number="1" location="dflash/scripts/server.py:69">
P2: Swallowing GGUF read errors here makes Gemma4 detection fail open, so the server silently takes the non-Gemma4 daemon path and uses the wrong argv shape instead of failing explicitly.</violation>

<violation number="2" location="dflash/scripts/server.py:893">
P2: Gemma4 draft path validation accepts non-directory paths by falling back to the parent directory, masking typos and using the wrong draft directory.</violation>
</file>

<file name="dflash/src/gemma4_target_loader.cpp">

<violation number="1" location="dflash/src/gemma4_target_loader.cpp:604">
P2: Failure paths after allocating `out.buf` return without cleaning up partial `GemmaTargetWeights`, so load errors can leak backend memory unless every caller manually frees on failure.</violation>

<violation number="2" location="dflash/src/gemma4_target_loader.cpp:675">
P2: Missing validation that `tok_embd_sz` is divisible by `n_vocab` before deriving `row_bytes` can corrupt embedding row strides for malformed GGUFs.</violation>
</file>

<file name="dflash/CMakeLists.txt">

<violation number="1" location="dflash/CMakeLists.txt:157">
P2: pFlash is gated by the first CUDA arch entry instead of the true minimum SM, which can wrongly enable sm80-only sources for unsorted mixed-arch builds.</violation>
</file>

<file name="dflash/test/test_flash_attn_sparse.cpp">

<violation number="1" location="dflash/test/test_flash_attn_sparse.cpp:107">
P2: The dense-vs-sparse correctness check is too permissive and can mask bad outputs, including non-finite values.</violation>
</file>

<file name="dflash/src/gemma4_dflash_graph.cpp">

<violation number="1" location="dflash/src/gemma4_dflash_graph.cpp:184">
P2: Missing bounds validation for kv_start + n_tokens before KV-cache writes in build_gemma4_draft_graph().</violation>
</file>

<file name="dflash/test/test_gemma4_dflash.cpp">

<violation number="1" location="dflash/test/test_gemma4_dflash.cpp:906">
P2: Daemon requests with the default seed 0 never reseed the shared RNG, so sampling becomes order-dependent across requests.</violation>

<violation number="2" location="dflash/test/test_gemma4_dflash.cpp:1591">
P2: Resetting `draft_kv_pos` to 0 on cache overflow discards the draft context instead of preserving a valid context length, so speculative decoding runs with an empty draft KV cache once capacity is reached.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread dflash/src/errors.cpp
Comment thread dflash/scripts/quantize_gemma4_draft_q8.py Outdated
Comment thread dflash/scripts/server.py Outdated
Comment thread dflash/scripts/server.py
Comment thread dflash/src/gemma4_target_loader.cpp
Comment thread dflash/CMakeLists.txt
Comment thread dflash/test/test_flash_attn_sparse.cpp Outdated
Comment thread dflash/src/gemma4_dflash_graph.cpp
Comment thread dflash/test/test_gemma4_dflash.cpp Outdated
Comment thread dflash/test/test_gemma4_dflash.cpp Outdated
@davide221
Copy link
Copy Markdown
Contributor

@dusterbloom can you fix the merge conflict? Great contribution!!

Bundled defensive fixes from code review:

1. errors.cpp: thread-local snapshot of last_error before c_str() return —
   prevents concurrent set_last_error() from invalidating the returned
   pointer across threads.

2. server.py:69: log GGUF read failures to stderr instead of silently
   returning ""; prevents Gemma4 detection from failing open and using
   the wrong daemon argv shape.

3. server.py:893: explicit branches for is_dir / is_file / not-found
   on --draft path; no more silent fallback to parent directory that
   masks user typos.

4. quantize_gemma4_draft_q8.py: confirmed existing N_TARGET_LAYERS == 0
   guard at line 215 prevents the modulo-by-zero (no edit required).

5. gemma4_target_loader.cpp: cleanup_out lambda free's out.buf and
   resets state on every failure path after the buffer allocation —
   prevents backend memory leak on load errors.

6. gemma4_target_loader.cpp: validate tok_embd_sz % n_vocab == 0
   before computing row_bytes — fails fast on malformed GGUFs instead
   of corrupting embedding strides.

7. CMakeLists.txt: replace list(GET _dflash27b_archs 0 ...) with an
   explicit min loop over all configured CUDA arches — pFlash now
   correctly disables when ANY arch in the list is below sm_80.

8. test_flash_attn_sparse.cpp: add explicit non-finite (NaN/inf) check
   in the dense-vs-sparse comparison; printf reports nonfinite=YES/no
   and the return value requires both finite values and max_diff < 1.0.

9. gemma4_dflash_graph.cpp: GGML_ABORT on out-of-bounds kv_start +
   n_tokens at the top of build_gemma4_draft_graph — catch at graph
   build time instead of corrupted-memory crash later.

10. test_gemma4_dflash.cpp daemon: always reseed the RNG per request
    (random_device when seed=0); prevents order-dependent sampling
    across concurrent daemon requests.

11. test_gemma4_dflash.cpp draft KV overflow: replace the hard reset
    cache.draft_kv_pos = 0 with a sliding-window re-prefill from the
    last `keep = dkv_cap - q_len` accepted tokens. This was discarding
    ALL draft context once the ring filled, causing DFlash speculative
    acceptance to crash from 10.67/16 (32K) to 1.23/16 (64K) — matching
    the LongSpec arXiv:2502.17421 long-context regression mode for
    EAGLE-style drafters.

Also includes the WIP TQ3 rotation infrastructure (submodule pointer
bump). Self-test DFLASH_TQ3_VERIFY=1 confirms the rotation is
mathematically reversible (max_diff=0.000000 on roundtrip). TQ3 chunked
output still wrong; the bug is downstream of rotation.
# Conflicts:
#	dflash/deps/llama.cpp
#	dflash/scripts/server.py
Two interlocking bugs were silently corrupting Gemma4 multi-chunk prefill,
producing all-zero decoded tokens (artificially high spec accept rate
because target and drafter both predict token 0 deterministically).

1. SWA ring optimization (swa_ctx_alloc = swa_window + headroom) saves
   VRAM at long contexts but ring-wraps during multi-chunk prefill. The
   K view is constrained to a single contiguous ring slice [ring_win_start,
   ring_size), which on wrap covers only the pre-wrap portion. Post-wrap
   tokens (the latest writes) are silently omitted — queries at positions
   spanning the wrap can't attend to themselves or recent context.

   Pragmatic fix: swa_ctx_alloc = max_ctx_alloc unconditionally. SWA
   layers behave like full-attn during prefill. We lose the VRAM
   optimization but restore correctness. Future work: implement
   double-view SWA reads (concat pre-wrap + post-wrap views) so the
   memory savings can come back without correctness regression.

2. SWA ring-wrap also produced a non-256-aligned win_len_padded clamp
   for TQ3_0 (which requires FATTN_KQ_STRIDE=256), causing SIGSEGV.
   Snap ring_win_start down to the nearest 256-multiple so the K view
   length stays aligned. The mask already excludes the extra padded
   tokens. Now redundant given (1) but kept as a safety net.

Also adds an env-gated [CACHE-WRITE-PROBE] in the test driver
(DFLASH_TQ3_PROBE_CACHE_WRITE=1) for future debugging.

Submodule bump pulls in:
- fix(ggml-cuda): honor view_offs in cpy data pointer
- perf(ggml-cuda): skip cudaMemGetInfo on chunked-FA hot path

Verified end-to-end on RTX 3090:
  Dense 31B + Q8 + draft @ 2.5K  = real tokens (was: all zeros)
  Dense 31B + TQ3 + draft @ 2.5K = real tokens (was: SIGSEGV)
  MoE 26B + TQ3 + draft @ 16K    = real tokens, 1969 tok/s prefill
  Dense 31B + TQ3 + draft @ 4K   = real tokens, 480 tok/s prefill
Replace the disable-fix (swa_ctx_alloc = max_ctx_alloc) with a properly-
sized ring + non-monotonic mask formula. Restores 70-95% SWA cache
VRAM savings at long contexts while keeping multi-chunk correctness.

Architecture:
  - Ring sized to hold the last R = 2 * swa_window keys (= 2 chunks worth).
    Always contains the relevant key window for any chunk, but in non-
    monotonic order after wrap (newest tokens land in pre-wrap slots).
  - K view is ALWAYS the full ring (ring_win_start = 0, len = ring_size).
    The kernel reads the full ring; correctness comes from the mask.
  - build_swa_causal_mask uses an abs_pos formula:
      latest_slot = (kv_end - 1) % ring_size
      offset_back = (latest_slot - k_view + R) % R
      abs_k       = (kv_end - 1) - offset_back
    This handles any wrap pattern correctly.
  - K/V WRITE path splits on wrap: when kv_start % R + n_tokens > R,
    issue two ggml_cpy ops (pre-wrap [write_pos, R) + post-wrap [0, post_n)).
  - compute_swa_view returns full-ring geometry; no truncation, no
    alignment-snap, no contiguous-segment assertion.

Verified on RTX 3090, ~15 min run including TQ3 trifecta:
  T1 single-chunk @ 900 (Q8 + draft):   sampled=236774, real tokens
  T2 2-chunk @ 2.5K (Q8 + draft):       decoded 514, 4755, 822, 2864...
  T3 ring-wrapping @ 8K (Q8 + draft):   1340 tok/s, real tokens
  T4 MoE 16K + TQ3 + draft (the one):   2489 tok/s, swa=2048, saved 72.9%

VRAM at 64K Gemma4-31B: previously 5.5 GB SWA cache (disable-fix),
now ~0.18 GB (50 SWA layers * 2048 * 1792B = 30x reduction).

Submodule bump pulls in the [TQ3-DEQ] printf re-gate.
Adds per-layer KV type machinery + a narrow override that forces Q8_0
on the small subset of full-attn layers whose hidden states are captured
for the DFlash drafter (target_feat ring). Mirrors vLLM's
kv-cache-dtype-skip-layers pattern.

Why: upstream FA dispatch (deps/llama.cpp/.../fattn.cu:441) routes
TQ3_0 + Q->ne[0]>256 to slow CHUNKED kernel. On Dense Gemma4-31B
(full-attn head_dim=512), this is a perf trap. Forcing the drafter's
captured layers to Q8 unblocks the pflash sparse fast path for the
slice the draft consumes.

Gate: kv_type==TQ3 && head_dim>256 && draft wired (capture_layer_ids
non-empty). SWA layers always exempt (don't hit the trap).

Empirical impact (RTX 3090, Dense 31B Q4_K_M + TQ3 + draft + pflash @ 4K):
  - Dense override fires on 2 of 10 full-attn layers (capture IDs 12, 46)
  - Prefill 48 -> 50 tok/s (marginal; 8 remaining full-attn still slow)
  - MoE override fires on 2 of 4 captured (3 keep TQ3); no regression
    (1464 tok/s under GPU contention vs 2489 dedicated)
  - Q8 control unchanged (gate requires TQ3)

Recommendation for production: Dense 31B + draft -> use Q8_0 KV
(505 tok/s prefill in our testing) until an upstream MMA-F16 TQ3
dequant kernel for head_dim=512 lands. TQ3 KV remains optimal for
MoE 26B-A4B (2489 tok/s @ 16K).

Per-layer machinery (kv_k_type_per_layer, kv_v_type_per_layer) is kept
infrastructure for future asymmetric experiments.
Submodule commit 580246202 adds an opt-in (DFLASH_TQ3_MMA=1) route
for TQ3_0 KV through the MMA-F16 tensor-core path:
- New k_tq3_0_dequant_f16_full bulk-dequant kernel
- Intercept in ggml_cuda_flash_attn_ext_mma_f16 with pool-allocated
  f16 K/V temp buffers
- tq3_needs_chunked guard lifted when env var set

Target prefill (Dense 31B + TQ3 + pflash, no draft): 420 -> 610 tok/s.

Note: with --draft enabled, Dense+TQ3 still hits the 9x penalty bug
(separate from FA dispatch). MMA fix is a building block toward closing
the gap.
When --draft is a directory containing both draft-q8_0.gguf (1.6 GB)
and model.safetensors (3 GB BF16), prefer the GGUF. The BF16 safetensors
draft pushed Dense+TQ3 over the 24 GB VRAM ceiling on a 3090, which
fragmented the allocator and triggered host-side cudaStreamSynchronize
stalls (per nsys: 67% of total CUDA time, max sync 1.5s) — collapsing
target prefill from 800+ tok/s to 41 tok/s.

The fix detects this case, logs a warning so the user knows what
happened, and loads the GGUF.

Empirical impact (RTX 3090, draft path = directory):
  Dense 31B + TQ3 + draft + pflash @ 4K:   41 -> 797-852 tok/s  (~20×)
  MoE 26B + TQ3 + draft + pflash @ 16K:    2489 -> 3089 tok/s   (+24%)
  VRAM (MoE 16K):                          24.0 GB -> 19.3 GB

This makes 852 tok/s the new ceiling for our Dense-31B + TQ3 + spec-decode
trifecta on a single RTX 3090, beating the prior best-known by ~6×
(stock llama.cpp/ollama hangs at 3-4K — see ollama#15350).

Bonus: explicit `--draft .../draft-q8_0.gguf` already worked; this
just removes the foot-gun for users passing the directory.
…le bench harnesses

This commit captures the empirical scaffolding used in this session to
validate three earlier fixes (TQ3 dispatcher d758ed9bf, SWA mask 7b62c07,
head_dim=512 mask f1f811e). Together those fixes unlocked TQ3 KV at all
contexts and let MTP+Q8/Q8 run past step ~110.

Contents:

- .sisyphus/plans/gemma4-context-scaling.md — phased plan to test all
  configs at 1k/4k/8k/32k/64k/256k for the user-facing tuning guide.

- 6 BPE-tokenised prompt files (Gemma 4 vocab; HF google/gemma-3-27b-it
  tokeniser is byte-identical to the GGUF) so benches are reproducible:
    short_chat (27 tok)         — pangram-explanation chat
    long_open  (40 tok)         — robot-painting open prompt
    long_2k    (2611 tok)       — Alice in Wonderland Ch. 1
    long_50k   (49904 tok)      — Tiny Shakespeare summarisation
    long_code_50k (50002 tok)   — concatenated HumanEval+ tasks (code)
    humaneval_2 (139 tok)       — single HE task, EvalPlus chat format
  Each prompt has a .meta sidecar with tokenizer + chat-template + source.

- generate_prompts.py — the original tokenizer harness used to produce
  short_chat / long_open / long_2k. The 50k prompts were generated by
  inline scripts since they pulled from disk-local sources (Tiny
  Shakespeare; the in-repo HumanEval+ jsonl).

- 5 reproducible bench runners (run_*.sh):
    run_matrix_v3.sh        — pre-fix 4-cell target/MTP × Q8/TQ3 matrix
    run_64k_drafter_ab.sh   — 3-way drafter A/B at 64k (pre-fix snapshot)
    run_64k_v2.sh           — 3-cell post-fix 64k re-run
    run_scaling.sh          — dense Q8 64k verify + MoE Q8 16k→256k sweep
    run_dm_sweep.sh         — MoE dm sweep on 50k code prompt at 64k+256k

- SUMMARY.md headline numbers from each completed matrix.

Headline numbers from the committed SUMMARYs:

- 31B dense + Q8/Q8 + dflash + dm=16 + HumanEval/2 @ 4K
    decode  97.81 tok/s, AL 6.56  (~30% under PR Luce-Org#131's 149 ref;
                                   gap is task-mix variance, not regression)
- 31B dense + MTP + Q8/Q8 + HumanEval/2 @ 4K
    decode  34.36 tok/s, accept_rate 0.87  (was aborting at step ~112
                                            pre f1f811e)
- 31B dense + Q8/Q8 + pflash @ 64K  (long_50k Shakespeare prompt)
    prefill 1402 tok/s, decode 7.96 tok/s, VRAM 22.60 GB
    (proves Q8/Q8 fits in 24 GB at 64K; was previously assumed to OOM)
- 31B dense + TQ3/TQ3 + pflash @ 64K
    prefill  585 tok/s, decode 6.90 tok/s, VRAM 21.25 GB
- MoE 26B + dflash + Q8/Q8 + dm=4 + pflash + ctx=256K
    fits at VRAM 21.74 GB on a 24 GB 3090; decode ~30 tok/s
    (the production-relevant 256K config — fits with 2.3 GB to spare)

The dm-sweep results dir is intentionally NOT committed here (run still
in progress). Per-cell raw .log files also omitted to keep the commit
slim; they're reproducible from the runners + prompts on disk.
…on effect, dm sweep, ship config for 24 GB GPUs

Companion narrative to PR Luce-Org#131's amended benchmark section. Documents
the path from a contaminated `0.22 accept_rate` baseline (byte-fallback
tokenisation on out-of-distribution input) through three correctness
fixes to a 24 GB-RTX-3090 ship config that runs Gemma-4-26B-A4B MoE +
dflash + Q8/Q8 + pflash + dm=4 at 35-37 tok/s decode across 64K-256K
context with ~22 GB VRAM.

Three commits referenced:
- d758ed9bf (submodule) fix(fattn): force chunked path for TQ3 K
  to avoid MMA-intercept FWHT mismatch
- 7b62c07 (parent) fix(gemma4): allocate+fill SWA mask for
  n_tokens==1 decode + bump llama.cpp
- f1f811e (parent) fix(mtp): always provide FA mask for head_dim>=512

Sections:
1. The setting (hardware, models, drafters, stack)
2. Day 0: the contaminated baseline (byte-fallback tokens)
3. Bug 1: SWA mask missing for single-token decode
4. The bisect that proved the bug was older
5. Bug 2: TQ3 K dequant intercept silently strips FWHT rotation
6. Bug 3: head_dim=512 + Q8/Q8 MMA gqa-opt requires non-null mask
7. The HumanEval surprise: drafter quality is prompt-distribution-bound
8. DM sweep: PR Luce-Org#131's 64K result was over-speculation, not collapse
9. Scaling: MoE 26B + dflash fits 256K on 24 GB at 35-37 tok/s
10. What still hurts: bandwidth, not bugs (24% of theoretical ceiling)
11. Lessons that would have saved us a weekend
12. Production ship config table
13. Open questions (drafter cache size, decode-time KV sparsity,
    SWA-wrap branch, MoE MTP head training, head_dim=512 kernel cleanup)

Aimed at engineers maintaining a fork of llama.cpp for speculative
decoding (DFlash, MTP, Medusa-class) on consumer GPUs targeting Gemma-4
or any Gemma-4-like SWA + GQA + chunked-prefill model.
Three corrections to the original blog post (commit b441587):

1. Confirms pFlash IS active in the dense ladder runs — both MoE and
   dense logs show `[chunked+pflash, chunk_size=1024]`. The 15× prefill
   gap (4912 vs 319 tok/s at 64K) is architectural (MoE has ~4B active
   params/token over 30 layers; dense has 31B over 60 layers, ~15×
   compute ratio matches the observed prefill ratio), plus VRAM-cap
   contention on the dense path. pflash works; it just can only skip
   attention, not FFN compute.

2. Adds the full dense 31B + dflash + Q8/Q8 + dm=8 + pflash ladder:
   - 64K: 1.78 tok/s decode, AL 1.94 ← anomaly, paged at 24/24 GB cap
   - 128K: 24.89 tok/s decode, AL 7.11 (89% accept) ← healthy
   - 256K: 23.87 tok/s decode, AL 7.11 ← healthy
   The 64K-specific decode collapse with the same drafter + same config
   that decodes fine at 128K/256K is an open puzzle, likely a VRAM
   allocator edge case at the 64K-cache size where 50K-token prompt
   fills 78% of the cache.

3. Updates the "Production ship config" table:
   - Adds prefill tok/s column (was missing — that's what triggered the
     amend; the prefill numbers tell the dense-vs-MoE story)
   - Reframes dense long-context cell as "viable at 128K/256K once
     prefill is paid" rather than the previous "not viable" claim
   - Adds an explicit avoid-list entry for dense + drafter + ctx=64K

Net headline unchanged: MoE 26B + dflash + Q8/Q8 + pflash + dm=4 fits
256K context on a 24 GB 3090 at 35-37 tok/s decode and 4.9K tok/s
prefill. Dense 31B is now positioned as "viable at long ctx for users
willing to pay 3.5 minutes prefill on a 50K prompt", not "not viable".
… dm sweep with GPU power telemetry

Adds run_scientific.sh and the resulting 24-cell results.csv. Each cell
profiles GPU power at ~5 Hz via background nvidia-smi polling, integrates
trapezoidal energy across the cell, and apportions prefill vs decode
energy by time-fraction of the binary's reported phases.

Cells: 2 models (Gemma4-31B dense, Gemma4-26B-A4B MoE) × 2 prompt
distributions (HumanEval/2 code, long_open creative) × 6 draft-max
budgets (1, 2, 4, 8, 16, 32). All Q8/Q8 KV, 4K ctx, n_predict=256,
temp=0 seed=0 --ignore-eos, pflash on.

Headlines:
- Best decode tok/s: MoE+code+dm≥16 = 132 (plateau at dm=16; dm=32 wastes)
- Best efficiency real-spec: MoE+creative+dm=2 = 6.6 J/tok
- Dense max: 82 tok/s creative+dm=16 (the dense drafter generalises
  better OOD than the smaller MoE drafter — 5.12 vs 2.49 AL on creative)
- MoE+code AL plateau 5.22; MoE+creative AL plateau 2.49 — MoE drafter
  is code-distribution-trained, weaker OOD
- VRAM: dense 22.1 GB, MoE 18.9 GB across all dms

Per-cell columns in results.csv: cell, rc, wall_s, prefill_ms,
decode_ms, first_tok_ms, prefill_tok_s, decode_tok_s, AL, VRAM_GB,
avg_power_W, total_energy_J, prefill_energy_J, decode_energy_J,
decode_J_per_tok.

Hardware: RTX 3090 24 GB, CUDA 13.1, 399W TDP. Active-window peak
~395W (~99% TDP) on dense+code, MoE peaks lower (~130W avg).
Comment thread dflash/deps/llama.cpp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to provide the patch for ggml.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the existing quantize draft to q8 script doesn't work for you? Better you can enhance it to support qwen and gemma4.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4935293. Merged quantize_gemma4_draft_q8.py back into quantize_draft_q8.py with --arch {qwen,gemma4} (auto-detect from config.json's model_type when present). Q8_0 scale formula and tensor-name mapping were already identical between the two scripts, so the merge is a clean parameterisation behind two _ARCH_DEFAULTS dicts. Net -134 lines. Old gemma4 file deleted; no external callers needed updating.

Comment thread dflash/src/f16_convert.cu Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I purposely remove this file couple days ago. We can use ggml convert function to do the same thing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9ccd827. Removed f16_convert.cu; replaced 10 call sites (5 distinct blocks) with copy_target_feat_bf16_to_f32() using ggml_cpy(ctx, bf16_view, f32_view) — dispatched to ggml-cuda's native BF16→F32 path in cpy.cu. ggml_view_2d with byte offset handles ring-buffer slot arithmetic; ggml_backend_graph_compute synchronises internally so the explicit cudaDeviceSynchronize is no longer needed. Net -76 lines. CI will verify the full CUDA build.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to build our loader in a generic way. For this PR, I think it is fine. but this should be a high priority work item to refactor this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — agreed, the loader is gemma4-specific and a generic loader would be cleaner. Tracking as a follow-up work item alongside the qwen3↔gemma4 dflash graph unification (which has 5 material deviations per r3215263575). Both refactors estimated ~3-5 engineer-days combined; will scope a dedicated PR after this one lands.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My read is that draft model is same as qwen3.5. Did I miss anything?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking carefully. The gemma4 dflash graph is not a rename of the qwen3 one — there are five material architectural differences:

  1. KV cache — gemma4 splits into two functions (build_draft_kv_prefill_graph + build_gemma4_draft_graph) sharing a persistent GemmaTargetCache; qwen3 is stateless (recomputes context K/V every call).
  2. Logit softcapping — gemma4 applies cap * tanh(logits / cap) (cap=30) and owns a tied lm_head; qwen3 borrows the target's lm_head with no cap.
  3. Embedding scaling — gemma4 scales input embeddings by sqrt(hidden_size) (Gemma family trait); qwen3 does not.
  4. 6 vs 5 captured target layers — gemma4's FC input width is 6 * target_hidden, not 5.
  5. YaRN RoPE + mask-aware SWA truncation — gemma4 adds opt-in YaRN extrapolation and must slice the attention mask tensor when truncating SWA layers (because the mask is causal over the cache); qwen3's SWA only slices positions_k.

A unified parameterised builder is feasible but would require addressing all five axes plus the loader divergence, and would change the calling convention. Estimated ~3-5 engineer-days for the refactor + bilateral re-validation. Deferring to a follow-up refactor is the right call for this PR — happy to scope as a separate work item.

Comment thread dflash/test/gemma4/smoke_gemma4_draft_forward.cpp
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 issues found across 32 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".sisyphus/notes/gemma4-baseline/matrix-v3/SUMMARY.md">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/matrix-v3/SUMMARY.md:6">
P2: Third matrix entry is missing its rc/result line, leaving the summary incomplete and ambiguous.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/run_scientific.sh">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_scientific.sh:36">
P1: Return code always 0 because `|| true` swallows command exit status before `rc=$?` captures it, making failed benchmark cells indistinguishable from successes in analysis results.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/run_64k_v2.sh">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_64k_v2.sh:49">
P2: `set -e` causes the script to abort when grep finds no matches in a log file, because the unguarded `grep -E` returns exit code 1 on no-match and `set -e` exits the shell immediately. This skips the rest of the summary generation and the decoded-text comparison.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/run_dm_sweep.sh">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_dm_sweep.sh:21">
P2: Exit code capture is broken: `|| true` masks the benchmark's exit status, so `rc=$?` always reports 0 even when the binary crashes.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/run_scaling.sh">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_scaling.sh:25">
P2: Exit-code capture is broken: `$?` after `cmd || true` is always 0, so failures/OOMs are reported as rc=0 in SUMMARY.md.</violation>

<violation number="2" location=".sisyphus/notes/gemma4-baseline/run_scaling.sh:61">
P2: `grep` in summary loop can abort the script under `set -e` when no lines match; non-zero grep exit is treated as a fatal error.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/run_matrix_v3.sh">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_matrix_v3.sh:8">
P2: Missing `pipefail` lets the Python summary step fail silently while the script still reaches `DONE`.</violation>

<violation number="2" location=".sisyphus/notes/gemma4-baseline/run_matrix_v3.sh:8">
P2: `set -e` causes expected failing cells to abort the whole matrix run, preventing full summary generation.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/matrix-64k/SUMMARY.md">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/matrix-64k/SUMMARY.md:61">
P2: Benchmark token extraction in benchmark summary is polluted by telemetry/debug numbers, making the reported 'first 80 generated tokens' unreliable and misleading.</violation>
</file>

<file name=".sisyphus/notes/gemma4-baseline/run_64k_drafter_ab.sh">

<violation number="1" location=".sisyphus/notes/gemma4-baseline/run_64k_drafter_ab.sh:5">
P2: Hardcoded absolute local path (`/home/peppi/Dev/lucebox-hub`) makes the script non-portable and causes immediate failure on any other machine due to `set -e`</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

local POW_PID=$!

local t0=$(date +%s.%N)
./dflash/build/test_gemma4_dflash \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Return code always 0 because || true swallows command exit status before rc=$? captures it, making failed benchmark cells indistinguishable from successes in analysis results.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_scientific.sh, line 36:

<comment>Return code always 0 because `|| true` swallows command exit status before `rc=$?` captures it, making failed benchmark cells indistinguishable from successes in analysis results.</comment>

<file context>
@@ -0,0 +1,164 @@
+  local POW_PID=$!
+
+  local t0=$(date +%s.%N)
+  ./dflash/build/test_gemma4_dflash \
+    --model $model \
+    --draft $draft \
</file context>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment thread .sisyphus/notes/gemma4-baseline/matrix-v3/SUMMARY.md Outdated
echo "" >> $LOGDIR/SUMMARY.md
echo "### ${cell}" >> $LOGDIR/SUMMARY.md
echo '```' >> $LOGDIR/SUMMARY.md
grep -E "kv types|narrow asymmetric|^\[draft\] KV|prefill.*tokens in|context_used|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT" $log >> $LOGDIR/SUMMARY.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: set -e causes the script to abort when grep finds no matches in a log file, because the unguarded grep -E returns exit code 1 on no-match and set -e exits the shell immediately. This skips the rest of the summary generation and the decoded-text comparison.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_64k_v2.sh, line 49:

<comment>`set -e` causes the script to abort when grep finds no matches in a log file, because the unguarded `grep -E` returns exit code 1 on no-match and `set -e` exits the shell immediately. This skips the rest of the summary generation and the decoded-text comparison.</comment>

<file context>
@@ -0,0 +1,77 @@
+  echo "" >> $LOGDIR/SUMMARY.md
+  echo "### ${cell}" >> $LOGDIR/SUMMARY.md
+  echo '```' >> $LOGDIR/SUMMARY.md
+  grep -E "kv types|narrow asymmetric|^\[draft\] KV|prefill.*tokens in|context_used|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT" $log >> $LOGDIR/SUMMARY.md
+  echo '```' >> $LOGDIR/SUMMARY.md
+done
</file context>

run() {
local tag=$1; local ctx=$2; local dm=$3
echo "=== ${tag} (ctx=${ctx} dm=${dm}) starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md
./dflash/build/test_gemma4_dflash \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Exit code capture is broken: || true masks the benchmark's exit status, so rc=$? always reports 0 even when the binary crashes.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_dm_sweep.sh, line 21:

<comment>Exit code capture is broken: `|| true` masks the benchmark's exit status, so `rc=$?` always reports 0 even when the binary crashes.</comment>

<file context>
@@ -0,0 +1,58 @@
+run() {
+  local tag=$1; local ctx=$2; local dm=$3
+  echo "=== ${tag} (ctx=${ctx} dm=${dm}) starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md
+  ./dflash/build/test_gemma4_dflash \
+    --model $MOE \
+    --draft $MOE_DFLASH \
</file context>

echo "" >> $LOGDIR/SUMMARY.md
echo "### $tag" >> $LOGDIR/SUMMARY.md
echo '```' >> $LOGDIR/SUMMARY.md
grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: grep in summary loop can abort the script under set -e when no lines match; non-zero grep exit is treated as a fatal error.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_scaling.sh, line 61:

<comment>`grep` in summary loop can abort the script under `set -e` when no lines match; non-zero grep exit is treated as a fatal error.</comment>

<file context>
@@ -0,0 +1,66 @@
+  echo "" >> $LOGDIR/SUMMARY.md
+  echo "### $tag" >> $LOGDIR/SUMMARY.md
+  echo '```' >> $LOGDIR/SUMMARY.md
+  grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md
+  echo '```' >> $LOGDIR/SUMMARY.md
+done
</file context>
Suggested change
grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md
grep -E "context_used|kv types|prefill.*tokens in|tok/s=|VRAM used|^\[mtp\] steps|^\[spec\]|GGML_ABORT|out of memory|cudaMalloc|^\[draft\] KV" $log >> $LOGDIR/SUMMARY.md || true

local tag=$1; local logfile=$LOGDIR/${tag}.log
shift
echo "=== $tag starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md
./dflash/build/test_gemma4_dflash "$@" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Exit-code capture is broken: $? after cmd || true is always 0, so failures/OOMs are reported as rc=0 in SUMMARY.md.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_scaling.sh, line 25:

<comment>Exit-code capture is broken: `$?` after `cmd || true` is always 0, so failures/OOMs are reported as rc=0 in SUMMARY.md.</comment>

<file context>
@@ -0,0 +1,66 @@
+  local tag=$1; local logfile=$LOGDIR/${tag}.log
+  shift
+  echo "=== $tag starting at $(date +%H:%M:%S) ===" | tee -a $LOGDIR/SUMMARY.md
+  ./dflash/build/test_gemma4_dflash "$@" \
+    --n-predict 256 --temp 0 --seed 0 --ignore-eos \
+    > $logfile 2>&1 || true
</file context>

# N2: target-only K=Q8 V=Q8 (control, expect coherent — was M2 in v2)
# N3: MTP K=Q8 V=TQ3 (the production ship target — measure accept_rate)
# N4: MTP K=Q8 V=Q8 (previous safe baseline — was M4 in v2; expect crash ~step 210)
set -e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing pipefail lets the Python summary step fail silently while the script still reaches DONE.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_matrix_v3.sh, line 8:

<comment>Missing `pipefail` lets the Python summary step fail silently while the script still reaches `DONE`.</comment>

<file context>
@@ -0,0 +1,97 @@
+#   N2: target-only K=Q8 V=Q8     (control, expect coherent — was M2 in v2)
+#   N3: MTP        K=Q8 V=TQ3   (the production ship target — measure accept_rate)
+#   N4: MTP        K=Q8 V=Q8     (previous safe baseline — was M4 in v2; expect crash ~step 210)
+set -e
+cd /home/peppi/Dev/lucebox-hub
+export PATH=/usr/local/cuda-13.1/bin:$PATH
</file context>

# N2: target-only K=Q8 V=Q8 (control, expect coherent — was M2 in v2)
# N3: MTP K=Q8 V=TQ3 (the production ship target — measure accept_rate)
# N4: MTP K=Q8 V=Q8 (previous safe baseline — was M4 in v2; expect crash ~step 210)
set -e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: set -e causes expected failing cells to abort the whole matrix run, preventing full summary generation.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_matrix_v3.sh, line 8:

<comment>`set -e` causes expected failing cells to abort the whole matrix run, preventing full summary generation.</comment>

<file context>
@@ -0,0 +1,97 @@
+#   N2: target-only K=Q8 V=Q8     (control, expect coherent — was M2 in v2)
+#   N3: MTP        K=Q8 V=TQ3   (the production ship target — measure accept_rate)
+#   N4: MTP        K=Q8 V=Q8     (previous safe baseline — was M4 in v2; expect crash ~step 210)
+set -e
+cd /home/peppi/Dev/lucebox-hub
+export PATH=/usr/local/cuda-13.1/bin:$PATH
</file context>

Comment thread .sisyphus/notes/gemma4-baseline/matrix-64k/SUMMARY.md Outdated
# 64k context, TQ3 KV, pFlash on, dense 31B.
# Compare drafters: target-only vs MTP vs dflash.
set -e
cd /home/peppi/Dev/lucebox-hub
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoded absolute local path (/home/peppi/Dev/lucebox-hub) makes the script non-portable and causes immediate failure on any other machine due to set -e

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .sisyphus/notes/gemma4-baseline/run_64k_drafter_ab.sh, line 5:

<comment>Hardcoded absolute local path (`/home/peppi/Dev/lucebox-hub`) makes the script non-portable and causes immediate failure on any other machine due to `set -e`</comment>

<file context>
@@ -0,0 +1,96 @@
+# 64k context, TQ3 KV, pFlash on, dense 31B.
+# Compare drafters: target-only vs MTP vs dflash.
+set -e
+cd /home/peppi/Dev/lucebox-hub
+export PATH=/usr/local/cuda-13.1/bin:$PATH
+
</file context>
Suggested change
cd /home/peppi/Dev/lucebox-hub
+cd "$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/../../.."

Adapts PR Luce-Org#129 (howard0su/swa) — sliding-window attention truncation
for the qwen3 draft graph — to the gemma4 cached-KV draft layout.

draft graph (gemma4_dflash_graph.cpp):
  * draft_swa_trunc_enabled() — opt-in via DFLASH_DRAFT_SWA_TRUNC=1.
    For SWA layers (first n-1; final layer stays full attn), restrict
    K_full / V_full views to the last (sliding_window + n_tokens) slots
    and copy a contiguous mask slice (ggml_cont) for FA.
  * draft_rope() — single wrapper around ggml_rope_ext used at all
    three draft RoPE sites. Optional YaRN scaling via
    DFLASH_DRAFT_YARN=1 (default n_ctx_orig=32768, override via
    DFLASH_DRAFT_YARN_NCTX_ORIG).

test harness (test_gemma4_dflash.cpp):
  * --draft-swa-trunc CLI flag mirroring the env var.
  * Bundles the bench-harness infrastructure that has been in-progress
    on this branch: adaptive draft_max, --draft-kv-cap override,
    --mem-diag, --fa-window plumbing through build_gemma4_step.

Bench (RTX 3090, gemma-4-31B-Q4_K_M target + qwen3 5-layer draft,
50K-token prompt, n_predict=64, ctx=65536, NO_VMM=1):

  cap   | SWA | AL   | decode tok/s
  ------+-----+------+--------------
  2096  | off | 1.36 | 1.31
  2096  |  on | 1.73 | 1.68
  8192  | off | 1.02 | 4.29
  8192  |  on | 1.68 | 6.96   <-- +65% AL, +62% decode

The SWA truncation does not fix the underlying long-position
acceptance collapse (the qwen3 draft model itself appears to have
been effectively trained at <=32K positions; see comment near the
sliding re-prefill block in test_gemma4_dflash.cpp). It is a real
partial improvement shippable today; the residual cliff needs a
long-context drafter.

The diff is large because it bundles the pre-existing harness
infrastructure noted above; happy to split if reviewers prefer.
@dusterbloom
Copy link
Copy Markdown
Contributor Author

Update — SWA truncation port + YaRN opt-in (98f72c1)

Just pushed feat(gemma4): port SWA truncation to draft graph + YaRN opt-in (commit 98f72c1).

Adapts PR #129 (howard0su/swa, qwen3) to gemma4's cached-KV draft graph layout. Also touches the same surface as #140 / #141 — happy to coordinate with those.

What's in:

  • gemma4_dflash_graph.cppdraft_swa_trunc_enabled() (opt-in via DFLASH_DRAFT_SWA_TRUNC=1 or --draft-swa-trunc); restricts K_full / V_full views and the FA mask (with ggml_cont) for SWA layers when kv_len > sliding_window + n_tokens.
  • gemma4_dflash_graph.cppdraft_rope() wrapper around ggml_rope_ext; opt-in YaRN via DFLASH_DRAFT_YARN=1 (default n_ctx_orig=32768, override DFLASH_DRAFT_YARN_NCTX_ORIG).
  • test_gemma4_dflash.cpp--draft-swa-trunc CLI flag mirror.

Bench (RTX 3090, gemma-4-31B-Q4_K_M target + qwen3 5L draft, 50K-token prompt, n_predict=64, ctx=65536, NO_VMM=1):

draft cap SWA trunc AL decode tok/s
2096 (default) off 1.36 1.31
2096 (default) on 1.73 1.68
8192 off 1.02 4.29
8192 on 1.68 6.96 (+65% AL, +62% decode, ~5× over default-cap baseline)

Caveat: SWA truncation does not lift AL to "healthy" (~5+). Paired probe (prose-50k vs code-50k at ctx=65536) confirms the residual AL=1 cliff is purely position-driven, not content-distribution-driven — the qwen3 draft model itself appears to have been trained at ≤32K positions (matching the in-source comment about "32K → 64K acceptance collapse"). Real fix needs a long-context drafter; this port is a meaningful partial improvement that ships today.

YaRN is included as an opt-in lever for users who want to experiment with draft-side RoPE scaling; default-off, no behavior change.

The 32 GiB CUDA VMM pool reservation fragments badly inside the last
few hundred MB on a 24 GiB card. Measured impact at ctx=65536 dense
Q8/Q8 with the long-prompt code workload:

  default (VMM on):    1.79 tok/s  (verify_ms p50 ~1738)
  GGML_CUDA_NO_VMM=1:  2.78 tok/s  (verify_ms p50 ~975)   +55%

Plus prefill bumps from 193 to 179 tok/s (-7%; net win is decisive).

Auto-detect: cudaGetDeviceProperties.totalGlobalMem <= 25 GiB AND
the user has not explicitly set GGML_CUDA_NO_VMM. Override with
GGML_CUDA_NO_VMM=0 if you need VMM for some reason.

Banner: "[auto] GGML_CUDA_NO_VMM=1 set (GPU has N GiB; override with
GGML_CUDA_NO_VMM=0)" so it is obvious in logs.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/test/test_gemma4_dflash.cpp">

<violation number="1">
P1: Auto-disabling VMM is wired to an unsupported runtime env-var contract, so the 24 GB GPU safeguard likely never affects ggml's CUDA backend.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment thread dflash/test/gemma4/test_gemma4_dflash.cpp
The submodule's ggml_turbo_wht kernel writes dst using src strides
(turbo-wht.cu:20-21), so a non-contiguous Q (post ggml_permute) gets
scattered writes and corrupts the result. Always wrap with ggml_cont
BEFORE rotating, never after permute alone.

Six edit sites (target_graph SWA + full attn blocks; mtp_graph Q
rotate + O rotate sites):

  ggml_tensor * Qfa = ggml_cont(ctx, ggml_permute(ctx, Qcur, ...));
  if (q_rotate) Qfa = ggml_turbo_wht(ctx, Qfa, 0);
  ...
  if (out_rotate) {
      attn = ggml_cont(ctx, attn);
      attn = ggml_turbo_wht(ctx, attn, 1);
  }

This is the OUTER-REPO half of a path-asymmetric contract: graph
pre-rotates Q forward and post-rotates FA output backward; FA backends
(chunked, VEC) consume rotated K/V directly. The kernel-side half
lives in the submodule fork.

Verified: TQ3/TQ3 target-only on MoE 26B-A4B + humaneval_2 + ctx=4096
produces coherent prose matching the Q8 control's continuation:
  Q8: 1106 6596 108 2063 102267 236779 5640 ...
  TQ3: 1106 6596 ... (same logits, max=21.250@6596).

Plus optional cap-hint params on create_gemma4_cache /
create_draft_kv_cache (used by --draft-kv-cap and --mem-diag CLI
flags introduced in 98f72c1).
…clean

Rebases the fork from merge-base fae3a2807 (early May) onto current
ggml-org/llama.cpp:master HEAD 0b047287f. Addresses PR Luce-Org#131 review
comment r3214286746 from howard0su ("You may need to provide the
patch for ggml").

Old pin: dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache
         d758ed9bf 12 commits ahead of fae3a2807, 1 of 12 patches
         applied clean to ggml/master.

New pin: dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache-clean
         daef232a6 11 commits on top of ggml/master 0b047287f.

Commit triage applied during rebase (per user decisions in
~/.claude/plans/do-the-fork-rebase-vast-kurzweil.md):
  - Dropped: debug probe commits (45e492b13, 3f65b59c4) and the broken
    TQ3 MMA dequant intercept (580246202) and its safety gate (d758ed9bf).
  - Squashed: WIP rotation kernel commit (e2af945b9) into the FWHT
    fuse commit (fd8710abc → 694cea5e1).
  - New: TQ3_0 → F16 cpy dispatch (666df462d), graph-level FWHT
    contract refactor (6d5ca8c4b), fused MoE kernel (992aac8ac),
    force-chunked-for-TQ3 dispatcher fix (daef232a6).

Verified TQ3 chunked correctness post-rebase against the previous
working baseline:
  next=6596 max=21.250@6596 (matches Q8 control)
  43.97 tok/s on n_predict=16, 6.58 tok/s on n_predict=64

Phase-1 upstream candidate: commit 14f90bf60 (cpy view_offs fix)
applies clean to ggml/master.
@dusterbloom
Copy link
Copy Markdown
Contributor Author

Update — Submodule rebase complete (f008033)

Two new commits on feature/gemma4-support:

  • 4b0c158 fix(gemma4): TQ3 graph-level FWHT rotation contract — outer-repo half of the TQ3 fix; wraps ggml_permute(Q) in ggml_cont before ggml_turbo_wht (turbo-wht.cu's kernel writes dst with src strides, so non-contig input scatters writes).
  • f008033 chore(submodule): bump dflash/deps/llama.cpp to feature/tq3-kv-cache-clean — addresses r3214286746 by rebasing the fork from a 4-week-old merge-base (fae3a2807) onto current ggml/master HEAD (0b047287f).

Submodule rebase summary

Old pin: dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache@d758ed9bf — 12 commits ahead of fae3a2807, 1 of 12 patches applied cleanly to current ggml/master.

New pin: dusterbloom/llama-cpp-turboquant-cuda feature/tq3-kv-cache-clean@daef232a6 — 11 commits on top of 0b047287f, 4 of 11 apply cleanly.

Commit triage during the rebase:

  • Dropped debug probes (45e492b13, 3f65b59c4)
  • Dropped the broken TQ3 MMA dequant intercept (580246202) and its safety gate (d758ed9bf) — that intercept produced degenerate output (token loop) and is being held for a kernel-level rework
  • Squashed WIP rotation kernel (e2af945b9) into the FWHT fuse commit (fd8710abc)
  • New atomic commits: cpy TQ3→F16 dispatch, graph-level FWHT contract refactor, fused MoE kernel, and a force-chunked-for-TQ3 dispatcher fix that restores SWA-decode correctness post-rebase

Verification

  • ✅ Submodule build clean
  • ✅ FWHT primitives roundtrip self-test: max_diff=0.000000
  • ✅ Q8/Q8 control byte-identical to pre-rebase baseline
  • ✅ TQ3/TQ3 chunked produces coherent output: next=6596 max=21.250@6596 (matches Q8 control's first prediction); 6.58 tok/s on n_predict=64 — same as the freshly-rebuilt pre-rebase baseline binary

Phase-1 upstream-ready patches

Two of the 11 apply cleanly to ggml-org/llama.cpp:master and are immediately submittable:

  1. 0006-fix-ggml-cuda-honor-view_offs-in-cpy-data-pointer.patch — small, generic bugfix; multi-chunk Gemma4 prefill was silently corrupting KV cache state because ggml_cuda_cpy() was using src->data raw, ignoring view_offs. The fix adds a ggml_cuda_cpy_data() helper that resolves view_src + view_offs correctly. Applies cleanly to current upstream HEAD.

  2. 0010-feat-ggml-cuda-fused-MoE-kernel-for-IQ2_XS-IQ3_XXS-IQ4_XS.patch — generic perf, uses standard upstream quant types only, no TQ3 dependency.

Full mapping at submodule-patches/README.md.

Other open feedback (status table)

Comment Status
r3214286746 patch for ggml This update — rebased + Phase-1 patches identified
r3214287342 quantize_gemma4_draft_q8.py refactor ⏳ Separate refactor, not in this PR
r3214289240 f16_convert.cu purposely removed ✅ Not reintroduced in the rebase
r3214290950 generic loader refactor ⏳ High-priority refactor backlog
r3214291841 draft model is same as qwen3.5 Confirmed empirically. Internal architectural audit (dflash/REFACTOR_DFLASH_UNIFIED.md proposed) shows the qwen3 and gemma4 drafts are the same 5L Qwen3-style block-diffusion; ~4-5 engineer-day unification refactor planned, not in this PR.
r3214292648 group gemma4 tests ⏳ Process suggestion
cubic-dev-ai P2 bots (8 items) All addressable in follow-ups

@dusterbloom
Copy link
Copy Markdown
Contributor Author

Phase-1 upstream PR opened

The first generic bugfix from the rebased fork is now an open PR upstream:

ggml-org/llama.cpp#22913ggml-cuda: honor view_offs in cpy data pointer

This is patch 0006 from the rebased feature/tq3-kv-cache-clean branch, applied as a single commit on top of current ggml-org/llama.cpp:master. Self-contained, generic, no TQ3 dependency.

Once it lands upstream, we can drop 14f90bf60 from the fork and shrink the patch series further. 0010 (fused MoE kernel) is the next Phase-1 candidate — happy to send that one too once you sign off.

…CONFLICTING state

# Conflicts:
#	dflash/deps/llama.cpp
#	dflash/scripts/server.py
- errors.cpp: set_last_error returns thread-local copy (race fix)
- quantize_gemma4_draft_q8.py: validate non-empty target_layer_ids
- server.py: log GGUF detection failures explicitly
- gemma4_target_loader.cpp: validate tok_embd_sz divisibility
- gemma4_target_loader.cpp: free out.buf on load failure paths
- CMakeLists.txt: gate pFlash on minimum CUDA arch, not first
- gemma4_dflash_graph.cpp: bounds-check kv_start + n_tokens
- test_mtp_loader.cpp: assert exact donor-layer mapping
- gemma4_mtp_graph.cpp: validate centroid-head invariants

Fixes 1-7 were already implemented in 5fb516d. This commit adds
the two remaining items:

Fix 8 (test_mtp_loader.cpp): strengthen Assertion 7 from a simple
bounds check [0,60) to an exact semantic check: each MTP layer's
donor_target_layer must equal 58 (last full-attn target layer, Dense
31B even-indexed) or 59 (last SWA target layer, odd-indexed) per
the loader's own documented contract, not just any value in range.

Fix 9 (gemma4_mtp_graph.cpp): add four GGML_ASSERT guards at the
top of the centroid-head construction block — n_vocab > 0,
n_centroids > 0, n_vocab % n_centroids == 0, and top_k in
[1, n_centroids] — preventing silent corruption or div-by-zero on
mismatched vocab sizes or out-of-range top_k values.

Refs cubic comments: 3210041139, 3210041163, 3210041171, 3210041174,
3210041179, 3210041182, 3210041198, 3213492728, 3213492729.
@cubic-dev-ai
Copy link
Copy Markdown

cubic-dev-ai Bot commented May 10, 2026

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

Previous commit e05afcd called setenv("GGML_CUDA_NO_VMM", "1")
at runtime, but that env-var is compile-time only — ggml's
CMakeLists.txt converts it to add_compile_definitions, and the
source guards on #if defined(GGML_USE_VMM), never on getenv.

This commit:
- Replaces the no-op setenv block with a runtime detection that
  emits a clear warning when <=24 GiB CUDA devices are seen and the
  binary was built without -DGGML_CUDA_NO_VMM=ON.
- Adds a small-VRAM build hint to dflash/README.md.

Refs cubic comment: 3214928808 (P1).
Merge quantize_gemma4_draft_q8.py into quantize_draft_q8.py with
--arch {qwen,gemma4} flag. Auto-detects from config.json's model_type
when present, falls back to per-arch hardcoded defaults. Removes the
duplicate script (367 lines deleted, 233 added — net -134 lines).

Q8_0 scale formula and tensor-name mapping were already identical
between the two scripts, so the merge is a clean parameterisation
behind two _ARCH_DEFAULTS dicts and one model_type sniffer.

Per howard0su's review on PR Luce-Org#131 (r3214287342).
…sion

Howard0su explicitly removed this file from the codebase a few days ago,
noting that ggml's built-in convert (ggml_cpy with target dtype) does
the same thing. We re-added it as part of the gemma4 work; this commit
removes it again and replaces 10 call sites (5 distinct blocks: daemon
prefill, standard prefill, single-slot warmup, sliding re-prefill,
spec-decode commit) with copy_target_feat_bf16_to_f32() which uses
ggml_cpy(ctx, bf16_view, f32_view) — dispatched to ggml-cuda's native
BF16→F32 path in cpy.cu.

ggml_view_2d with byte offset handles the ring-buffer slot arithmetic
that was previously raw pointer math; ggml_backend_graph_compute
synchronises internally so cudaDeviceSynchronize is no longer needed.

Per howard0su's review on PR Luce-Org#131 (r3214289240).
Move 8 gemma4-specific test files into a dedicated subdirectory:
  test_gemma4_dflash.cpp
  test_gemma4_kv_tq3.cpp
  test_mtp_graph_shapes.cpp
  test_mtp_loader.cpp
  smoke_gemma4_draft_forward.cpp
  smoke_gemma4_target_forward.cpp
  smoke_load_gemma4_draft.cpp
  smoke_load_gemma4_target.cpp

dflash/test/test_flash_attn_sparse.cpp stays at the root (it's
generic, no gemma4 references).

dflash/CMakeLists.txt source paths updated; cmake configure passes
(ggml commit daef232a6 detected, build files regenerated).

Per howard0su's review on PR Luce-Org#131 (r3214292648).
Implements multi-token chained MTP speculative decoding for Gemma4
Dense 31B + assistant.  Previously γ=1 was a correctness gate that
ran the drafter after every target forward without batched-verify,
yielding *slower* throughput than no-MTP (25.28 vs 26.69 tok/s).
At γ=4, the chained path now amortizes drafter cost across K+1
batched target verifications.

Architecture matches Google's HF reference (modeling_gemma4_assistant.py):
- capture point post-final-RMSNorm of last layer (`gemma4_target_graph.cpp`)
- concat [token_emb, h_target] → pre_projection
- 4 drafter blocks with Q-only cross-attn into target's KV
- constant position_ids within a chain (empirically beats incrementing)
- autoregressive h_post → h_prev feedback between chain steps

CLI surface:
- --gamma N (1..16): chain length.  γ=1 routes through the existing
  correctness-gate loop unchanged.  γ>1 routes through the new path.
- --mtp-pos-mode {const,incr}: A/B falsification harness for the
  position-semantic question.  const is default (matches Google).
- γ>1 + stochastic sampling is currently fatal (--temp must be 0);
  Leviathan-style rescaling for stochastic γ>1 is out of scope here.

Phase 2 correctness fix: `cache.mtp_h_prev_row` parameterizes the
slice index in the capture site (`gemma4_target_graph.cpp:1106-1123`).
γ=1 leaves the sentinel -1 (n_tokens-1, current behavior).  γ>1 uses
approach A: a 1-token re-capture target forward when accept_drafts < K
to refresh the hidden state at the last accepted position.

Measured (Dense 31B + TQ3/TQ3 KV, 4K ctx, n_predict=256, greedy):
  M1 no-MTP            : 18.92 tok/s
  M3 γ=1               : 18.64 tok/s, accept 0.660
  γ=1 regression       : 19.04 tok/s, accept 0.660   (regression-free)
  γ=2 const            : 16.82 tok/s, accept 0.441   (chain overhead > savings)
  γ=4 const            : 20.75 tok/s, accept 0.478   (+9.7% over M1)
  γ=4 incr             : 20.59 tok/s, accept 0.464   (const empirically wins)

Approach A's per-chain re-capture costs ~40 ms; switching to approach B
(multi-row mtp_h_prev capture in one verify pass, no re-capture) is
projected to push γ=4 to ~32 tok/s = 1.7× M1.  Tracked separately.

Evidence logs in .sisyphus/notes/gemma4-baseline/mtp-gamma/.
Submodule pin moves from daef232a6 to 6715acf13 (TQ3 debug printf removal).
Plan: /home/peppi/.claude/plans/wild-growing-ember.md
Sweep matrix on Dense 31B + TQ3/TQ3 KV, γ ∈ {1,2,4,8}, ctx ∈ {4K,16K,64K}.

Decode tok/s:
              no-MTP   γ=1     γ=2     γ=4     γ=8
  4K         19.63   19.18   20.21   16.54   16.54
  16K        12.99   10.20    8.40    5.37    6.86
  64K         6.55    5.54    8.42    6.54    5.33

Three headline findings:
1. γ=2 wins at both 4K (+3%) and 64K (+29%) over no-MTP.
2. Long-context MTP accept rate is HIGH (0.69–0.73 at 64K), contradicting
   the pre-rebase 0.02 figure documented in OPEN_QUESTIONS.md.  The
   submodule Bug-2 fix (daef232a6) plus the γ>1 chain produces the high
   accept regime.  OPEN_QUESTIONS.md should be updated.
3. 16K is a "dead zone" — every MTP config loses to no-MTP at 16K.
   γ=2 accept drops to 0.33 there.  Worth investigating separately.

γ=4 and γ=8 collapse at 64K because approach A's re-capture single-token
target forward reads ~64K KV slots (~80 ms cost), paid per chain with
partial accept.  Approach B (multi-row mtp_h_prev capture, no re-capture)
is projected to flip γ=4 and γ=8 from losses to wins at long context.
That's the next implementation step.

User-facing recommendation table (until approach B lands):
  ≤8K     : --draft-method mtp --gamma 2
  8K–32K  : no MTP
  ≥32K    : --draft-method mtp --gamma 2

Logs: .sisyphus/notes/gemma4-baseline/mtp-gamma/phase4/
Script: .sisyphus/notes/gemma4-baseline/mtp-gamma/run_sweep.sh
Replaces the per-chain re-capture single-token target forward (approach A)
with a batched capture into mtp_h_prev_batch [n_embd, 17] during the verify
call.  Host-side picks column accept_drafts after greedy match and copies
it into mtp_h_prev for the next MTP chain.

Result: the re-capture cost was approach A's dominant per-chain overhead
at long context (~80 ms at 64K, ~30 ms at 16K).  Removing it flips the
γ × ctx matrix decisively in favor of MTP at every context.

Decode tok/s sweep (Dense 31B + TQ3/TQ3 KV, n_predict=64, --temp 0):

              no-MTP   γ=1     γ=2     γ=4     γ=8
  4K    A     19.63   19.18   20.21   16.54   16.54
  4K    B     18.40   18.46   25.10   25.58   22.61    γ=4: +39% over no-MTP
  16K   A     12.99   10.20    8.40    5.37    6.86
  16K   B     12.43    9.79   12.56   13.06    9.33    γ=4: +5%   (was -35%)
  64K   A      6.55    5.54    8.42    6.54    5.33
  64K   B      6.26    5.31   10.07    8.29    7.58    γ=2: +61%  (was +29%)

The "16K dead zone" documented in Phase 4-A RESULTS.md was entirely the
re-capture overhead; approach B eliminates it.  γ=4 at 16K now slightly
beats no-MTP.

γ=1 path is regression-safe: 18.46 tok/s vs the 18.64 M3 baseline (−1%,
within noise), accept_rate identical at 0.66.  γ=1 never sets
mtp_h_prev_capture_mode = 1, so the batch-capture branch is dead code on
that path.

Files:
- dflash/src/internal.h: add mtp_h_prev_batch tensor + mtp_h_prev_capture_mode
- dflash/src/gemma4_target_graph.cpp: branch in capture site, write all
  n_tokens rows when mode==1 instead of slicing one
- dflash/test/gemma4/test_gemma4_dflash.cpp: allocate mtp_h_prev_batch,
  set capture_mode=1 around the γ>1 verify call, replace re-capture with
  21 KB host-side staging copy

User-facing recommendation (revised):
  ≤8K       --draft-method mtp --gamma 4   ≈ 25 tok/s
  8K–32K    --draft-method mtp --gamma 4   ≈ 13 tok/s
  32K+      --draft-method mtp --gamma 2   ≈ 10 tok/s at 64K

Evidence:
- .sisyphus/notes/gemma4-baseline/mtp-gamma/phase3p5/ — three acceptance tests
- .sisyphus/notes/gemma4-baseline/mtp-gamma/phase4-b/ — full re-sweep + RESULTS.md
Matching update to the public Pages bundle just pushed.  The site's
prior claim that "MTP is uneconomic at 64K, ≤ 2% accept, open
investigation" was pre-rebase noise.  Post-rebase + γ>1 chain, the
real numbers are:
  4K  γ=4: 25.58 tok/s (+39% over no-MTP)
  16K γ=4: 13.06 tok/s (resolved the dead zone)
  64K γ=2: 10.07 tok/s (+61% over no-MTP, 0.73 accept rate)

Sections affected: summary coda, M3/V2 0.02-accept rows, hard-failures
list, timeline (May 10 eve + May 11 AM), Open Questions (split into
resolved vs still-open), new §γ>1 MTP, TOC entries.
…ted claims

Single source of truth EVIDENCE.md + site rewrite based on one-variable-at-a-
time (OVAT) measurements at the MoE 64K code reference cell, plus the Q8 KV-
ceiling sweep on MoE 128K-512K and Dense 32K-64K, plus the triple-falsification
of MoE 1M DFlash (cold-fresh = 26 ± 5 tok/s, not 4.86 as the sweep-churn
outlier suggested).

Findings that survived rigour
- Q8 + pflash + DFlash dm=4 is the MoE long-ctx winner: 60-67 tok/s, 6-8 J/tok
  at 128K-512K.
- TQ3 only earns its place above 512K (Q8 pages at 1M).
- MTP γ=2 is the right drafter for Dense >=32K where DFlash would breach VRAM.

Inflated claims walked back (cited in EVIDENCE.md §2.2)
- "γ=2 at MoE 64K = +61% over no-MTP" — was vs TQ3-handicapped baseline.
  Vs naive Q8: TIED (-0.6%).  DFlash dm=8 is actual winner (+49%).
- "TQ3 is 2.5× faster than Q8 at Dense 64K" — conflated MoE OVAT with Dense.
  Withdrawn.
- "DFlash collapses at MoE 1M" — sweep-churn artifact, cold = 26 ± 5 tok/s.
- "MTP uneconomic at 64K, <=2% accept" (v1 dossier) — pre-Bug-2 noise.
  Post-rebase: 0.55-0.78 accept.

Scope
- 414 file changes: EVIDENCE.md + EVIDENCE.v1.md.bak archival + site/index.html
  rewrite + 14 sweep harnesses + sweep dirs covering OVAT, Q8 ceiling,
  triple-falsification, paired-prompt code/prose matrices, MoE scientific
  sweep with energy telemetry.

Public site at dusterbloom/gemma4-evidence already updated (commit e1310b6).
This commit lands the matching state on the PR branch.
…r history)

Upstream rebased feature/tq3-kv-cache onto a clean linear history and named
it feature/tq3-kv-cache-clean.  Same effective state — top commit is still
"chore(fattn): remove stray TQ3-DEQ debug printf in chunked dequant" — just
with a tighter base.
# Conflicts:
#	dflash/CMakeLists.txt
#	dflash/scripts/server.py
Joule integration over paired-matrix/power/*.csv during the decode window
gives 33.79 J/tok at 64K and 34.50 J/tok at 128K. Cells were previously
marked "awaiting" pending power instrumentation; the paired-matrix sweep
captured the watt traces.

The MoE 1M DFlash Q8-pages cell remains "awaiting" — the most recent
measurement was on TQ3/TQ3 (39.91 J/tok decode), not Q8 pages, so the
row's labeled config is still unmeasured.
The .sisyphus/ tree (sweep logs, power CSVs, EVIDENCE drafts, site/index.html,
plans, journey notes) was inflating PR diff to 91,927 lines across 486 files.
The dossier is preserved off-repo at ~/lucebox-hub-evidence/feature-gemma4-support-*/
and is now gitignored so future sweep runs do not re-track it.

Net effect: PR drops from 512 files / +91,927 lines to ~24 files of actual code
(dflash src/test/scripts, CMakeLists, README, submodule pin, .gitignore).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants