Skip to content

mtp: prefix-cache WARM hit (perfect + partial via range-warm)#221

Open
dusterbloom wants to merge 12 commits into
Luce-Org:mainfrom
dusterbloom:feat/mtp-prefix-warm-ghost
Open

mtp: prefix-cache WARM hit (perfect + partial via range-warm)#221
dusterbloom wants to merge 12 commits into
Luce-Org:mainfrom
dusterbloom:feat/mtp-prefix-warm-ghost

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

Stacks on top of #214 (feat/mtp-via-daemon). The two new commits in this PR are:

What's in the WARM-hit commit

Plugs MTP requests into the existing server.py prefix-cache protocol so agent loops with repeated prefixes skip the cold prefill + warm path.

  • Inline snap ack (common/prefix_snap.h) — single source for [snap] inline slot=N cur_pos=M. DFlash do_prefill and the MTP orchestrator both funnel through emit_inline_snap_ack so the format can't drift.
  • Mid-prompt snap in the MTP orchestrator — chunked prefill clips to snap_at (server's prepare_inline_snap picks the second-to-last <|im_start|> boundary, almost always mid-prompt). Partial warm_head_kv fills slots [1..snap_at], snapshot_save fires through ModelBackend, ack emits. Full warm runs after to cover [1..prompt_len).
  • Snapshot captures h_{snap_pos-1} via target->last_hidden() + a shape contract (γ / n_head_kv / n_ctx / n_embd). Restore rejects mismatches before touching state.
  • INativeMtp::warm_head_kv_range — same warm graph as warm_head_kv, caller-controlled slot_start, used by partial-WARM restore.
  • restore_and_generate unifies perfect-WARM and partial-WARM:
    • Perfect-WARM (prompt_len == snap_pos): restore head_kv, no prefill.
    • Partial-WARM (prompt_len > snap_pos): restore head_kv, prefill delta, range-warm [snap_pos..prompt_len] using pre_warm_hidden + delta_hiddens. Slot snap_pos is overwritten with the new request's prompt[snap_pos] so cross-request first-after-cut divergence cannot corrupt decode.
    • Contract failures fall through to a cold-restart fallback that discards the snapshot's head_kv and runs full cold prefill — byte-correct against the chain runner's [0..prompt_len) requirement.
  • Thin head_kv snapshot[0..head_kv_pos+1] slice instead of full [key_len, n_ctx, n_head_kv]. ~10× payload reduction at max_ctx=65536.
  • Lazy GPU head_kv alloc — grows from 8K initial slots to n_ctx_max on the first warm that needs more (saves ~256 MiB VRAM on a daemon serving short prompts).
  • Qwen35Backend::init_mtp_ passes cfg_.device.max_ctx to the module so head_kv tracks the backbone (previous hardcoded 8192 broke any prompt > 8K).

Tests

  • 4/4 test_prefix_cache_mtp (perfect bit-equal round-trip, prefill_next mismatch fallback, int32 boundary, shape-contract rejection).
  • 4/4 test_common_mtp_orchestrator (unchanged from feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214).
  • Full cmake --build green.
  • Harness probe 7/7 at --max-ctx 65536 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw). Zero chain failures, zero ack-missing warnings, zero capacity overflows.

Bench

dflash/bench/bench_mtp_warm_hit_2turn.py is a 2-turn warm-hit gate that drives a live daemon: prefill_ratio < 0.3 + bit-equal first 16 tokens between turns.

Follow-ups (not in this PR)

  • Tree-MTP arena (B≥2 sibling drafts) is being prototyped on a separate branch.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 34 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/gguf_mmap.h
Comment thread dflash/src/common/gguf_mmap.h
Comment thread dflash/bench/bench_mtp_warm_hit_2turn.py
Comment thread dflash/scripts/server.py
@dusterbloom dusterbloom force-pushed the feat/mtp-prefix-warm-ghost branch from 0170c40 to 0d4a531 Compare May 18, 2026 16:05
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.

- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
  IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
  IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
  hoisted out of the backend. Three KV-reconciliation paths
  (accept-all / fast rollback / recommit) share a single post-iter
  invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
  to chain runner. Owns only control flow; all compute lives in
  DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
  backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
  inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
  cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
  at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
  generate() delegates to common::mtp::warm_and_decode when MTP is
  configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
  command surface unchanged.

Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
Plugs MTP requests into the existing server.py prefix-cache protocol so
agent loops with repeated prefixes skip the cold prefill + warm path.

- common/prefix_snap.h: single source for the inline snap ack format
  ("[snap] inline slot=N cur_pos=M") that server.py's bus.await_reply
  matches. DFlash do_prefill and the MTP orchestrator both funnel
  through emit_inline_snap_ack so the format can't drift.
- mtp_orchestrator: chunked prefill clips to snap_at when the server
  requests a mid-prompt cut (prepare_inline_snap picks the
  second-to-last <|im_start|> boundary). Partial warm_head_kv fills
  slots [1..snap_at], snapshot_save fires through ModelBackend, ack
  emits. Full warm runs after to cover [1..prompt_len). End-of-prompt
  snap supported but rarely picked by the server.
- snapshot_save captures h_{snap_pos-1} via target->last_hidden() and a
  shape contract (γ / n_head_kv / n_ctx / n_embd). Restore rejects
  mismatches before touching state.
- INativeMtp gains warm_head_kv_range(prompt, n_prompt, start_slot,
  n_chunk, prefill_next, hiddens) — same warm graph as warm_head_kv,
  with caller-controlled slot_start so partial-WARM restore can fill
  [snap_pos..prompt_len] using pre_warm_hidden + delta_hiddens.
- restore_and_generate unifies perfect-WARM and partial-WARM under a
  single tok/shape/hidden contract. Perfect-WARM (prompt_len ==
  snap_pos): restore head_kv, no prefill. Partial-WARM (prompt_len >
  snap_pos): restore head_kv, prefill delta (kv_offset=snap_pos),
  range-warm [snap_pos..prompt_len]. Slot snap_pos is overwritten with
  the new prompt[snap_pos] so cross-request first-after-cut divergence
  cannot corrupt decode. Contract failures fall through to a
  cold-restart fallback that discards the snapshot's head_kv and runs
  full cold prefill — byte-correct against the chain runner's
  [0..prompt_len) requirement.
- qwen36_mtp: thin head_kv snapshot ([0..head_kv_pos+1] slice instead
  of full [key_len, n_ctx, n_head_kv]); ~10× payload reduction at
  max_ctx=65536. Lazy GPU alloc grows head_kv tensors from 8K initial
  to n_ctx_max on first warm that needs more (saves ~256 MiB VRAM on a
  daemon serving short prompts). F16 dtype guard on snapshot+restore.
- qwen35_backend: head_kv_warm_ flag gates snapshot_save's MTP capture
  so a pre-warm snapshot can't round-trip as valid. init_mtp_ passes
  cfg_.device.max_ctx to the module so head_kv tracks the backbone
  (previous hardcoded 8192 broke any prompt > 8K).

Tests: 4/4 test_prefix_cache_mtp (perfect bit-equal round-trip, prefill
mismatch fallback, int32 boundary, shape-contract rejection) + 4/4
test_common_mtp_orchestrator. Full build green. Harness probe 7/7 at
--max-ctx 65536; zero chain failures / ack-missing / capacity overflows.

R3 integration bench (dflash/bench/bench_mtp_warm_hit_2turn.py) drives
a 2-turn warm-hit gate against a live daemon: prefill_ratio < 0.3 +
bit-equal first 16 tokens between turns.
`n_layer = block_count - nextn_predict_layers` is correct for backbone-graph
iteration and the divisibility check, but `plan.layer_end` was also defaulting
to this reduced value — silently filtering blk.{n_layer}.* out of the GPU
load. The MTP loader's `find_tensor(meta_ctx, ...)` then resolved the
descriptor with `data==nullptr` and failed with "14 required NextN tensor(s)
missing".

Fix: default `plan.layer_end` to `n_block_raw` so MTP head blocks are loaded
alongside backbone. Validation upper-bound widened to match. No-op for
non-MTP GGUFs where `nextn_predict_layers=0` (n_block_raw == n_layer).
`n_last_chunk = committed % PREFILL_UBATCH` only equals the last prefill
chunk's actual size when prefill started at kv_offset=0. With prefix-cache
partial restore, `restore_and_generate` runs delta-prefill from kv_offset>0,
so the last chunk's `n_tokens` is `prompt_len - kv_offset`, not the modulo
of `committed` over PREFILL_UBATCH. The read offset was then larger than
sg_.argmax_tokens->ne[0], firing the "tensor read out of bounds" assert
on the first DFlash spec-decode request against any prompt the cache had
already seen.

Read the actual last-chunk size from sg_.argmax_tokens->ne[0], which the
graph builder sized to match the bound chunk. No-op when kv_offset==0
(`committed % UBATCH == ne[0]`).
@dusterbloom dusterbloom force-pushed the feat/mtp-prefix-warm-ghost branch from 0d4a531 to 5e7594c Compare May 19, 2026 08:47
When --prefill-skip-park is set, compress_text_via_daemon correctly skips
its own `park target` / `park draft` sends, but the C++ handle_compress
parses an independent `nopark` trailing token and parks target+draft
itself when it's absent. The two paths were out of sync: Python honored
skip_park but the daemon ignored it.

This is a no-op for the PFlash-only path (unpark target rebinds nothing
that has stale references). But the MTP path holds tensor pointers into
the backbone's ggml_context across requests — the internal park frees
those tensors, and the immediate unpark rebuilds the context with new
addresses, which makes the MTP graph crash with
GGML_ASSERT(ggml_can_repeat(b, a)) on the next forward.

Pass " nopark" through to the C++ command when skip_park is true, so
both layers agree.
Extends yesterday's 2026-05-17_f031f08 matrix with the workload classes that
matrix didn't cover:

  • Agent suite (2k/8k/24k buckets) across 4 configs (MTP+DFlash × q8/tq3 KV)
    plus the stacked PFlash+MTP+TQ3 path. MTP wins agent (53.98) by 4-14%
    over DFlash because the small drafter's accept collapses from 70-90% on
    code/math to 28-29% on chat/tool-use prompts; MTP's accept stays at 0.69.

  • PFlash + MTP + TQ3 stack verified end-to-end for the first time:
    36K NIAH passes in 22.8s wall (20.3× compression, decode 52 tok/s, needle
    recall correct) on a single 3090. All 7 OpenAI-compatible clients
    (claude_code, codex, hermes, openclaw, openwebui, opencode, pi) pass
    every probe against the stacked server.

  • he/gsm/math reproduced under today's branch — DFlash AL is unchanged
    from f031f08 (speculator is healthy); absolute tok/s are lower because
    today's bench harness wraps the full streaming HTTP response while
    f031f08's parsed the daemon's internal tok/s timer.

Three branch fixes that made the matrix runnable:
  - 230c303 MTP loader: include NextN block in GPU load by default
  - 5e7594c do_spec_decode: use sg_.argmax_tokens->ne[0] for last-chunk size
  - af05a23 prefill_hook: propagate skip_park to daemon compress (MTP+PFlash)

Raw per-suite JSONs included alongside summary.md.
…fter-arch protocol

Option A from thoughts/2026-05-19/dual-speculator-architecture.md.

Backend keeps BOTH speculators resident when --draft and --mtp-gguf are
both provided. GenerateRequest grows a `speculator` field (string):
  - "dflash" forces the DFlash drafter + DDTree spec-decode path
  - "mtp"    forces the MTP γ-chain native-heads path
  - "auto"  (default) selects based on prompt size

Rationale for the auto threshold (4096 tokens), pulled from today's bench:
  - DFlash beats MTP 1.75–2.6× on code/math under ~16K ctx (HE 173 vs 64,
    Math 115 vs 61, GSM 102 vs 59 tok/s on bench_matrix)
  - MTP beats DFlash on agent prompts (DFlash drafter accept rate 0.29
    vs MTP 0.69)
  - MTP beats DFlash 2.4–5.7× on PFlash-compressed long context (DFlash
    drafter accept collapses to 0.14–0.21 on gapped compressed sequences;
    MTP heads consume backbone hidden states and are unaffected)

VRAM on a 24 GB 3090 with dual-load: ~19.9 GB (15.3 target + 1 DFlash
drafter + 0.5 MTP heads + 2 KV TQ3 + 1 activations). Fits.

Boot banner now prints `[speculators] dual-mode: DFlash+MTP both loaded
(auto_select threshold=4096 tokens)` when both are configured.

Same commit also extends the PFlash compress protocol to carry an
optional drafter_arch token so the server can route `qwen3-0.6b` vs
`qwen35-0.8b` per request. New CLI flag --prefill-drafter-arch on the
Python side; daemon's handle_compress parses an optional 4th positional
token between drafter_gguf and the trailing "nopark" marker. Backward
compatible: omitting the token defaults to qwen3-0.6b, which matches the
previous hard-coded behavior.

Smoke tested:
  - speculator=dflash → DFlash path confirmed via [spec-decode] log line
  - speculator=mtp    → MTP path via [mtp_decode] log line
  - speculator=auto on 16-token prompt → routed to DFlash (< 4096)
  - speculator=auto on 5415-token prompt → routed to MTP (> 4096)
  - All 9 existing MTP test binaries still pass
bench_matrix.py + the matrix/ subpackage were added in dflash commit
f031f08 (bench: matrix orchestrator + power sweep + DFlash optimality
audit) but landed on a feature branch that never made it to main, so
they're absent from our HEAD. Restored verbatim from f031f08 (with the
HumanEval/GSM8K/Math500 workloads from follow-up commit 59cd0fa) so
today's apples-to-apples re-runs against yesterday's matrix have an
in-tree harness.

This is the orchestrator that drives test_dflash directly (positional
args, no server), parses the daemon's own decode-only tok/s figure
([dflash] generated N tokens in T s -> X tok/s), and emits per-cell
JSON artifacts with bootstrap CI 95% (1000 resamples, seed=42).
Subsumes the older scattered bench scripts (bench_agent.py,
bench_agent_mtp.py, bench_llm.py) for cross-comparison.

Layout:
  bench_matrix.py                    — entry point
  render_matrix.py                   — markdown summary writer
  matrix/workload.py                 — Workload base
  matrix/speculator.py               — Speculator base
  matrix/speculators/{ar,dflash,mtp}.py
  matrix/workloads/{humaneval,gsm8k,math500,swe_bench}.py

Usage:
  python3 dflash/scripts/bench_matrix.py \
    --workloads humaneval,gsm8k,math500 \
    --speculators ar,dflash_b22,mtp_d3 \
    --n-gen 256 --n-runs 8 --n-sample 8
…ary update

Three new run directories plus a yesterday-vs-today comparison snapshot:

  2026-05-17T17-40-56_f031f08/  — preserved yesterday's reference matrix
    (n_sample=8, n_runs=8, bootstrap CI 95%). DFlash b22: HE 169.40,
    GSM 104.32, Math 119.36 tok/s. MTP d3: 65.62 / 61.00 / 61.89. AR
    baseline ~34 tok/s across suites. Was untracked in tree because the
    bench_matrix orchestrator landed on a stale branch.

  2026-05-19T11-43-13_83e19d9/  — first matrix re-run on HEAD (HE only,
    MTP_GGUF env unset → mtp_d3 cell empty). DFlash b22: 173.81 tok/s.
    Confirms no DFlash kernel regression vs f031f08.

  2026-05-19T11-54-32_83e19d9/  — full apples-to-apples re-run on HEAD
    with MTP_GGUF set. Result: all 9 cells (3 suites × {AR, DFlash b22,
    MTP d3}) within ±5% of f031f08 mean tok/s. DFlash HE +2.6%, MTP HE
    −2.0%, etc. No regression.

  2026-05-19_mtp-prefix-warm-ghost/summary.md  — updated with the
    apples-to-apples table above, an agent bucket-label-vs-actual-token
    audit (agent_24k prompts are actually ~2.6K), a known-gaps section
    documenting what is NOT yet tested (real CLI agentic loops, NIAH
    > 131K, concurrent sessions, sustained throughput, PR Luce-Org#195 merge).
…resume plan

Three working docs that capture what we learned shipping Option A:

  dual-speculator-architecture.md — full pipeline diagrams (PFlash →
    target prefill → auto_select → DFlash/MTP → verify → stream),
    component cost matrix on RTX 3090, workload-to-config cheat sheet,
    synergy/conflict grid (the one real conflict is DFlash drafter
    accept collapsing on PFlash-compressed prompts), recommended 3090
    default for agentic coding via hermes/opencode/pi.

  pflash-drafter-unification-plan.md — experiment plan for "can we
    use a smaller / different / shared PFlash drafter." Today's E1
    swap to Qwen3.5-0.8B blocked on two loader gaps (DFlash-draft
    variant strips lm_head; full base GGUF has tied embeddings the
    target loader rejects). Documents the structural finding that a
    DFlash draft and a PFlash scorer can't be the same file by
    construction.

  dual-monster-resume-plan.md — milestone-based roadmap (M1..M8)
    organising what's committed, what's uncommitted, what's tested,
    and what to ship next. M1 (this commit set) → M2 real CLI
    sessions → M3 PR → M4 mid-stream Option B → ...
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 48 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/bench_matrix.py">

<violation number="1" location="dflash/scripts/bench_matrix.py:446">
P2: --no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

ar_cache=ar_cache.get(wname),
tmpdir=tmpdir,
)
ar_res = ar_cache.get(wname) if not args.no_ar_cache else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: --no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/bench_matrix.py, line 446:

<comment>--no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs</comment>

<file context>
@@ -0,0 +1,468 @@
+                ar_cache=ar_cache.get(wname),
+                tmpdir=tmpdir,
+            )
+            ar_res = ar_cache.get(wname) if not args.no_ar_cache else None
+            ap = _write_artifact(
+                run_dir, wl, spec, meta,
</file context>

Reproducible per-CLI setup for driving the dual-speculator server
(feat/mtp-prefix-warm-ghost) from real client binaries instead of
synthetic POSTs. Tested 2026-05-19 against the server running on
http://127.0.0.1:18080 with PFlash + MTP γ=3 + DFlash b=22 + Q8 KV +
prefix-cache.

Per-CLI working configurations:

  - claude:   --bare is mandatory; without it OAuth keychain wins and
              ANTHROPIC_BASE_URL is ignored. Set ANTHROPIC_AUTH_TOKEN,
              ANTHROPIC_MODEL, DISABLE_AUTOUPDATER, DISABLE_TELEMETRY.
              Real 3-turn session confirmed auto-route DFlash → MTP
              crossing the 4096-token threshold mid-conversation;
              1815-token long response streamed at 49.3 tok/s via MTP.

  - codex:    CODEX_HOME must be outside /tmp (codex refuses to write
              helper binaries there). wire_api = "responses" in
              config.toml (chat is deprecated). Falls back to chat
              endpoint automatically when responses is unavailable.

  - hermes:   context_length: 65536 required (hermes refuses < 64K).
              Server must launch with --max-ctx 65536 to match.
              config.yaml + .env in an isolated HOME dir; provider
              named "lucebox" with api_mode "chat_completions".

  - pi:       Existing ~/.pi/agent/models.json had lucebox pre-wired but
              pointing at stale port 8000. Patch to port 18080 + api
              "openai-completions" (NOT openai-chat — pi doesn't
              register that name). Pi's --provider lucebox path then
              works directly with --mode text.

  - opencode: Old install (0.5.x on Node 22) silently failed at
              provider load. Fix: nvm use --lts (Node 24) then
              npm install -g opencode-ai (1.15.5+). Provider lucebox
              registered in ~/.config/opencode/opencode.json with
              @ai-sdk/openai-compatible. Provider config is loaded
              (opencode models lucebox prints the model) but the
              parallel title+main request pattern stalls the second
              request due to a server-side daemon stdin serialization
              limit — orthogonal to wiring.

Includes server-log signatures so users can verify each CLI is hitting
the right speculator path: [generate] speculator=dflash|mtp lines and
[spec-decode] / [mtp_decode] accept rate lines.
Engineering memo on porting sapientinc/HRM-Text-1B as a Luce target.

HRM is a dual-timescale recurrent transformer (1B params, two stacks H
and L iterated H_cycles × (L_cycles+1) = 8 passes per token, with
state injection z_L + z_H). Different forward contract from anything
in luce today — needs a new graph builder, 128 effective KV cache slots
(16 layers × 8 invocations), embedding scaling, prefix-LM bidirectional
mask.

Bottom line: don't port now. ~6 engineer-days for AR-only HRM via a new
qwen35-style backend; spec decode (DFlash drafter, MTP heads) is
multi-week + needs training compute since no aligned drafter exists.
Pre-alignment checkpoint, useless for agentic coding without SFT.

The interesting cross-pollination is the OPPOSITE direction: use HRM as
a SPECULATOR for a larger target. The dual-timescale recurrence may
produce drafts that align with target hidden-state evolution better
than a single-pass Qwen3-0.6B does. Worth a separate research spike.

Model is downloaded to /home/peppi/models/hrm-text-1b (2.3 GB
safetensors + custom modeling_hrm_text.py + tokenizer). Memo includes
a 12-line reproducer that runs HRM via transformers main branch
for evaluation without any luce work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant