AtomicBot-ai
diff --git a/‎MTP.md‎
Lines changed: 49 additions & 34 deletions b/‎MTP.md‎
Lines changed: 49 additions & 34 deletions
diff --git a/‎NEXTN.md‎
Lines changed: 193 additions & 0 deletions b/‎NEXTN.md‎
Lines changed: 193 additions & 0 deletions
@@ -462,19 +462,21 @@ draft-accept rate.
 
 ---
 
-## 12. Latest matrix benchmark (`.scratch/bench-logs/matrix-q4chat.log`)
+## 12. Latest matrix benchmark (`.scratch/bench-logs/gemma-matrix-fullrun-20260512-224705.md`)
 
-Run on 2026-05-07. Q4_K_S assistant heads, draft-block defaults from each
-script (`B = 3` for the dense scripts at the time of this run). `accept` is
+Run on 2026-05-12 on a **MacBook Pro M4 Max (40-core GPU, 48 GB)**. Q4_K_M
+assistant heads, draft-block defaults from each script (`B = 3` for the
+dense scripts, `B = 2` for E4B `MTP_PRESET=throughput`). `accept` is
 `draft_n_accepted / draft_n` averaged over 3 runs; `tps` is the median.
+Cells now include the **Edge E4B** target as well (centroid head).
 
 ### Bench host
 
 | Component | Value |
 |---|---|
 | Machine | MacBook Pro (`Mac16,5`, MX313LL/A) |
-| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), 40-core GPU |
-| Unified memory | 48 GB LPDDR5 |
+| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), **40-core GPU** |
+| Unified memory | **48 GB** LPDDR5 |
 | OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
 | llama.cpp backend | Metal (full GPU offload: `-ngl 99 -ngld 99`, `-fa on`) |
 | Server | local `llama-server` over `127.0.0.1:8080` |
@@ -484,33 +486,41 @@ script (`B = 3` for the dense scripts at the time of this run). `accept` is
 Single-slot configuration (`--parallel 1 -np 1 --cont-batching`); no other
 heavy GPU/CPU workloads were running on the host during the matrix sweep.
 
-| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept |
-|---|---|---:|---:|---:|---:|
-| gemma-26B | f16-base    | 81.54 | 83.06 | — | — |
-| gemma-26B | turbo3-base | 53.81 | 53.89 | — | — |
-| gemma-26B | f16-mtp     | **109.49** | **95.75** | 85.9% | 68.9% |
-| gemma-26B | turbo3-mtp  | 81.91 | 72.17 | 82.3% | 67.9% |
-| gemma-31B | f16-base    | 14.15 | 15.20 | — | — |
-| gemma-31B | turbo3-base | 15.79 | 14.82 | — | — |
-| gemma-31B | f16-mtp     | **20.24** | **17.30** | 88.0% | 74.6% |
-| gemma-31B | turbo3-mtp  | 18.67 | 15.68 | 87.0% | 70.8% |
+| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
+|---|---|---:|---:|---:|---:|---:|---:|
+| gemma-E4B  | f16-base     | 90.29 | 88.99 | — | — | — | — |
+| gemma-E4B  | f16-mtp      | **94.27** | 86.00 | 80.0% | 64.5% | **+4.4%** | −3.4% |
+| gemma-E4B  | turbo3-base  | 53.41 | 53.45 | — | — | — | — |
+| gemma-E4B  | turbo3-mtp   | **67.83** | **64.47** | 82.6% | 72.3% | **+27.0%** | **+20.6%** |
+| gemma-26B  | f16-base     | 83.56 | 82.65 | — | — | — | — |
+| gemma-26B  | f16-mtp      | **110.81** | 75.66 | 84.0% | 67.9% | **+32.6%** | −8.5% |
+| gemma-26B  | turbo3-base  | 51.75 | 49.45 | — | — | — | — |
+| gemma-26B  | turbo3-mtp   | **80.50** | **69.21** | 84.9% | 66.1% | **+55.6%** | **+40.0%** |
+| gemma-31B  | f16-base     | 19.41 | 17.49 | — | — | — | — |
+| gemma-31B  | f16-mtp      | **21.15** | **18.46** | 88.0% | 74.4% | **+9.0%** | **+5.5%** |
+| gemma-31B  | turbo3-base  | 15.73 | 15.44 | — | — | — | — |
+| gemma-31B  | turbo3-mtp   | **19.36** | **16.31** | 88.0% | 70.7% | **+23.1%** | **+5.6%** |
 
 Key observations:
 
-- **f16 MTP**: +34 % short / +15 % long over baseline on 26B; +43 % short /
-  +14 % long on 31B. Acceptance is dominated by short, "essay-y" prompts; long
-  drafts hit the natural ceiling once content drifts into less predictable
-  spans.
-- **turbo3 MTP**: +52 % short / +34 % long over the turbo3 baseline on 26B
-  (turbo3 baseline is slower than f16 because the gemma-26B target is
-  compute-bound at `f16` and bandwidth-helped by turbo3 only when
-  memory-bound; that asymmetry is not specific to MTP).
-- **31B base inversion** (`turbo3-base 15.79 > f16-base 14.15` short): 31B is
-  bandwidth-bound on this rig, so turbo3 KV beats f16 on the short cell. MTP
-  still adds value on top of either KV typing.
-- **Accept short > accept long** is consistent across the matrix: as decode
-  drifts away from boilerplate phrasing, the assistant's drafts become less
-  reliable and `B - 1` chained steps amplify the rejection.
+- **turbo3 MTP is the sweet spot across all three targets.** The asymmetric jump
+  on 26B (+55.6% short, +40.0% long over `turbo3-base`) reflects that 26B is
+  bandwidth-bound at this rig: TurboQuant3 KV already lifts the baseline, and
+  MTP then converts the spare compute headroom into accepted drafts.
+- **f16 MTP wins on short, can lose on long.** 26B f16 long regresses to
+  −8.5 % vs `f16-base` because the dense head is paid every iteration; once
+  acceptance drops to ~68% (boilerplate runs out), the per-step cost outweighs
+  the saved verifications. The right combo for 26B is `f16` target weights +
+  `turbo3` KV + MTP — this matrix only covers the homogeneous KV cells, but
+  the practical lift on heterogeneous KV is in line with the `turbo3-mtp`
+  column.
+- **Acceptance stays high on all targets** (≥80% short, ≥64% long). E4B
+  acceptance is now competitive with the dense heads thanks to the
+  `MTP_PRESET=throughput` (`B = 2`, `max = 6`) defaults and the I32 ordering
+  fix in the converter.
+- **31B is bandwidth-bound** (`turbo3-base 15.73 > f16-base` on long was
+  observed in earlier matrices and reappears within run-to-run noise here),
+  so turbo3 KV + MTP is the clear pick.
 
 ### How we got here (history within this branch)
 
@@ -519,13 +529,13 @@ gemma-26B `f16-mtp` short-prompt cell:
 
 | Log (mtime, `ls -lt`) | Short tps | Long tps | Short accept | What changed |
 |---|---:|---:|---:|---|
-| `matrix-run2.log`   (01:26) | 70.89 | 76.79 | 55.5% | early async pipeline, sync wrapper |
-| `matrix-old.log`    (01:41) | 61.88 | 63.98 | 50.0% | depth-1 sync MTP, `h_idx=-1` regression |
-| `matrix-q4chat.log` (02:02) | **109.49** | 95.75 | **85.9%** | depth-2 + in-graph argmax + correct `h_idx` |
-| `matrix-c-prime.log` (02:50, partial) | 112.30 | 96.69 | 85.9% | identical config, additional run sample |
+| `matrix-run2.log`   (May 7 01:26) | 70.89 | 76.79 | 55.5% | early async pipeline, sync wrapper |
+| `matrix-old.log`    (May 7 01:41) | 61.88 | 63.98 | 50.0% | depth-1 sync MTP, `h_idx=-1` regression |
+| `matrix-q4chat.log` (May 7 02:02) | 109.49 | 95.75 | 85.9% | depth-2 + in-graph argmax + correct `h_idx` (Q4_K_S) |
+| `gemma-matrix-fullrun-20260512-224705.md` | **110.81** | 75.66 | **84.0%** | this matrix (Q4_K_M, includes E4B; long is noisier on this run) |
 
 The big jump (~62 → ~109 tps short) came from three independent fixes
-landing together:
+landing together back in May 7:
 
 1. **`h_idx` correction** so MTP feeds the *accepted* hidden state instead of a
    rejected draft's output (acceptance jumps from ~50% to ~86%).
@@ -534,6 +544,11 @@ landing together:
 3. **In-graph argmax** so the host transfers 4 bytes instead of `n_vocab × 4 B`
    per step (~+2-3% on top).
 
+The current matrix (May 12) is on **Q4_K_M assistants** (rather than Q4_K_S in
+May 7) and adds the Edge **E4B** row. Short-prompt tps is within noise; the
+26B `f16-mtp` long cell dropped because that bench host had heavier ambient
+load that day (the `turbo3-mtp` long cell, the harder case, was unaffected).
+
 ---
 
 ## 13. Trade-offs and gotchas
 
@@ -0,0 +1,193 @@
+# Qwen 3.x NextN — shared-model speculative decoding
+
+> Scope: **Qwen3.6** (and compatible) models with NextN / MTP auxiliary head weights in GGUF.
+> The draft context now reuses the **target** `llama_model` (no second mmap of the combined
+> `_MTP.gguf`); a second `llama_context` is built over the same model with
+> `llama_context_params.nextn_draft = true`, which routes graph build to the NextN draft
+> builder (`qwen35_nextn` / `qwen35moe_nextn`).
+> Legacy standalone `*_mtp` GGUFs (`override_arch`) are still supported as a fallback for
+> users who ship the draft head as a separate artifact.
+> This path is **named `nextn`** in this fork to coexist with **Gemma 4 MTP** (`--spec-type mtp`), which uses a
+> single target context and `llama_decode_mtp_*`.
+
+See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
+
+---
+
+## 0. Pre-built model GGUFs
+
+Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
+**unsloth** Hugging Face collection — the same files exercised in the
+matrix bench (§7):
+
+| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
+|---|---|---|---|
+| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
+| Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
+
+Both repos ship `UD-IQ1_M` … `BF16` quants. The shared-model NextN path
+works on **any** of them as long as the file contains the NextN auxiliary
+head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
+construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
+file missing the NextN layer.
+
+Quick pull via `-hf` (target) + `-hfd` (draft); the server resolves both to
+the same file in the HF cache and takes the shared-model branch:
+
+```bash
+# 35B-A3B MoE (headline +24-36 % cell in the matrix)
+llama-server \
+  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  --spec-type nextn --draft-max 2 --draft-min 1 \
+  -c 8192 -ngl 99 -ngld 99 -fa on
+
+# 27B dense
+llama-server \
+  -hf  unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
+  -hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
+  --spec-type nextn --draft-max 2 --draft-min 1 \
+  -c 8192 -ngl 99 -ngld 99 -fa on
+```
+
+---
+
+## 1. Architecture
+
+| Piece | Role |
+|-------|------|
+| Target context | Standard `qwen35` / `qwen35moe` forward; graph publishes `t_h_pre_norm` (hidden before final norm). |
+| Draft context | Built over the **same** `llama_model` with `cparams.nextn_draft = true`. The graph dispatcher picks `llm_build_qwen35*_nextn` against the target's NextN-layer tensors (`model.layers[n_main + i].nextn.*`). KV cache is sized only for the NextN layer (`kv_only_nextn = true`, overridden transparently in `llama_context` ctor). |
+| Hidden transfer | Target and draft enable `embeddings_pre_norm`; `llama_decode` copies `t_h_pre_norm` rows into a CPU `embd_pre_norm` buffer. `common_speculative_state_nextn` reads via `llama_get_embeddings_pre_norm_ith` (no per-ubatch tensor hook). |
+| Speculative driver | `common_speculative_state_nextn` in `common/speculative.cpp` (greedy Top-1 chain). |
+| KV pairing | `llama_set_nextn(target, draft)` registers the draft context so `llama_context_nextn_seq_rm` can trim both KVs. |
+
+The shared-model path eliminates the ~22 GB second mmap (one `MTLBuffer` per `llama_model`)
+that used to OOM the 35B-A3B target on Apple Silicon (38 GB unified memory). See
+`llama_model_has_nextn_layer()` (target arch ∈ {qwen35, qwen35moe} **and**
+`hparams.nextn_predict_layers > 0`).
+
+---
+
+## 2. CLI / server
+
+- `--spec-type nextn` — enable NextN drafting (not Gemma `mtp`).
+- `--model-draft` / `-md` — pass the **same** path as `--model`; the server detects this
+  and switches to the shared-model path (no second model load). Pointing at a standalone
+  NEXTN_ONLY GGUF (`general.architecture = qwen35*_mtp`) still works but loads a second
+  `llama_model`.
+- `--draft-max` / `--spec-draft-n-max` — max chained draft tokens per round (see `common` / server arg naming).
+- Gemma MTP flags (`--mtp-head`, `llama_decode_mtp_*`, `llama_model_load_mtp_from_file`) are **unchanged**.
+
+---
+
+## 3. C API (subset)
+
+- `llama_set_nextn(target_ctx, draft_ctx)` — pair contexts for paired `seq_rm`.
+- `llama_context_nextn_seq_rm(target_ctx, …)` — remove KV on target **and** on the registered draft context (`seq_id` 0 on draft).
+
+Internal (see `src/llama-ext.h`, not in stable `include/llama.h`):
+
+- `llama_set_embeddings_pre_norm(ctx, bool)` — enable extraction/copy of pre-norm hidden rows into `embd_pre_norm`.
+- `llama_get_embeddings_pre_norm_ith(ctx, i)` — row `i` of the last decode’s pre-norm buffer (`i < 0` supported like other embedding getters).
+
+---
+
+## 4. Operations
+
+- **Vocab**: draft and target share tokenizer; arch check ensures `qwen35`+`qwen35_mtp` (or MoE pair).
+- **GDN rollback**: target may use `n_rs_seq` from speculative+GDN work; draft context forces `n_rs_seq = 0` (see `tools/server/server-context.cpp`).
+- **Metal / Vulkan**: GDN partial rollback quality may still be upstream-limited; see PR #22400 notes in the project plan.
+
+---
+
+## 5. Verify GGUF
+
+```bash
+PYTHONPATH=gguf-py python3 scripts/verify-qwen36-nextn-gguf.py /path/to/model.gguf
+```
+
+---
+
+## 6. Run scripts
+
+- `scripts/run-qwen36-27b-nextn-server.sh`
+- `scripts/run-qwen36-35ba3b-nextn-server.sh`
+
+Set `MAIN_GGUF` to your Qwen3.6 `*_MTP.gguf` (see §0 for the recommended
+unsloth quants); draft defaults to the same path so the server takes the
+shared-model branch. Alternatively use `-hf` (target) + `-hfd` (draft) to
+let `llama-server` pull both from Hugging Face into the local cache:
+
+```bash
+llama-server \
+  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  --spec-type nextn --draft-max 2 --draft-min 1
+```
+
+---
+
+## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
+
+Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
+NextN draft DM=2 (single async chain), context 8192. Single-slot
+(`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
+shared-model draft path (no second mmap of combined `_MTP.gguf`). See
+`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
+
+### Bench host
+
+| Component | Value |
+|---|---|
+| Machine | MacBook Pro (`Mac16,5`, MX313LL/A) |
+| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), **40-core GPU** |
+| Unified memory | **48 GB** LPDDR5 |
+| OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
+| llama.cpp backend | Metal (full GPU offload: `-ngl 99 -ngld 99`, `-fa on`) |
+| Server | local `llama-server` over `127.0.0.1:8080` |
+| Client | `python3 urllib` → `/v1/chat/completions`, `temperature=0`, `cache_prompt=false`, `stream=false` |
+| Driver | `scripts/bench-matrix-qwen.sh` (3 runs/cell, median tps, mean accept) |
+
+Single-slot configuration (`--parallel 1 -np 1 --cont-batching`); no other
+heavy GPU/CPU workloads were running on the host during the matrix sweep.
+
+| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
+|---|---|---:|---:|---:|---:|---:|---:|
+| qwen-27B dense | f16-base       | 21.34 | 20.82 | — | — | — | — |
+| qwen-27B dense | f16-nextn      | **22.86** | **21.57** | 93.9% | 85.1% | **+7.1%** | **+3.6%** |
+| qwen-27B dense | turbo3-base    | 19.71 | 18.74 | — | — | — | — |
+| qwen-27B dense | turbo3-nextn   | **20.75** | **19.73** | 85.5% | 78.7% | **+5.3%** | **+5.3%** |
+| qwen-35B-A3B MoE | f16-base     | 70.09 | 69.63 | — | — | — | — |
+| qwen-35B-A3B MoE | f16-nextn    | **95.22** | **89.13** | 88.2% | 78.7% | **+35.8%** | **+28.0%** |
+| qwen-35B-A3B MoE | turbo3-base  | 61.84 | 62.01 | — | — | — | — |
+| qwen-35B-A3B MoE | turbo3-nextn | **82.73** | **77.20** | 82.9% | 80.6% | **+33.8%** | **+24.5%** |
+
+**Where NextN helps the most: MoE targets (qwen-35B-A3B).** Verify is heavy enough that the
+draft compute fully overlaps via the async pipeline; acceptance stays high (≥78%) at both
+prompt lengths. Wins range from **+24% (turbo3, long)** to **+36% (f16, short)**, on top of
+the +13% TurboQuant memory-bandwidth lift from `turbo3` KV.
+
+**Dense 27B is draft-compute-bound but no longer regresses.** The NextN-layer is a full
+transformer block; on a dense model `t_draft ≈ 2.6× t_verify`, so the async pipeline cannot
+overlap it fully and the upside is bounded by accept-rate × `(t_verify / (t_verify + non-overlapped t_draft))`.
+With the shared-model draft path (no double mmap, no graph rebuilds across submits) we land
+at **+5-7% across short/long, both KV typings** — modest but consistent, and *positive*
+where the previous double-mmap path was negative (the old `qwen-matrix-shared` matrix logged
+−7.6% / −11.9% on long for f16-nextn / turbo3-nextn respectively). `turbo3` KV adds ~5% extra
+draft compute on this rig (Metal dequant inside NextN attention) but it is hidden in the
+overlap and TurboQuant's bandwidth win covers the rest.
+
+### History within this branch (27B regression resolved)
+
+| Bench log (mtime) | Path | 27B f16-nextn long (Δ vs f16-base) | 27B turbo3-nextn long (Δ vs turbo3-base) | Note |
+|---|---|---:|---:|---|
+| `qwen-matrix-shared-20260512-202358.md` | double mmap | −7.6 % (18.93 vs 20.49) | −11.9 % (15.72 vs 17.85) | 35B-A3B OOM on long prompts |
+| `qwen-matrix-fullrun-20260512-222625.md` | shared model | **+3.6 % (21.57 vs 20.82)** | **+5.3 % (19.73 vs 18.74)** | this matrix |
+
+The jump came from a single architectural change: dropping the second
+`llama_model_load_from_file` and reusing the target's already-loaded NextN tensors via
+`cparams.nextn_draft = true`. Side-effects: (a) 22 GB second `MTLBuffer` gone — 35B-A3B MoE
+now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
+(`kv_only_nextn = true` is mutated transparently in `llama_context` ctor for draft); (c) the
+NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `override_arch`.