AtomicBot-ai
diff --git a/‎MTP.md‎
Lines changed: 49 additions & 34 deletions b/‎MTP.md‎
Lines changed: 49 additions & 34 deletions
diff --git a/‎NEXTN.md‎
Lines changed: 110 additions & 27 deletions b/‎NEXTN.md‎
Lines changed: 110 additions & 27 deletions
@@ -462,19 +462,21 @@ draft-accept rate.
 
 ---
 
-## 12. Latest matrix benchmark (`.scratch/bench-logs/matrix-q4chat.log`)
+## 12. Latest matrix benchmark (`.scratch/bench-logs/gemma-matrix-fullrun-20260512-224705.md`)
 
-Run on 2026-05-07. Q4_K_S assistant heads, draft-block defaults from each
-script (`B = 3` for the dense scripts at the time of this run). `accept` is
+Run on 2026-05-12 on a **MacBook Pro M4 Max (40-core GPU, 48 GB)**. Q4_K_M
+assistant heads, draft-block defaults from each script (`B = 3` for the
+dense scripts, `B = 2` for E4B `MTP_PRESET=throughput`). `accept` is
 `draft_n_accepted / draft_n` averaged over 3 runs; `tps` is the median.
+Cells now include the **Edge E4B** target as well (centroid head).
 
 ### Bench host
 
 | Component | Value |
 |---|---|
 | Machine | MacBook Pro (`Mac16,5`, MX313LL/A) |
-| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), 40-core GPU |
-| Unified memory | 48 GB LPDDR5 |
+| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), **40-core GPU** |
+| Unified memory | **48 GB** LPDDR5 |
 | OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
 | llama.cpp backend | Metal (full GPU offload: `-ngl 99 -ngld 99`, `-fa on`) |
 | Server | local `llama-server` over `127.0.0.1:8080` |
@@ -484,33 +486,41 @@ script (`B = 3` for the dense scripts at the time of this run). `accept` is
 Single-slot configuration (`--parallel 1 -np 1 --cont-batching`); no other
 heavy GPU/CPU workloads were running on the host during the matrix sweep.
 
-| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept |
-|---|---|---:|---:|---:|---:|
-| gemma-26B | f16-base    | 81.54 | 83.06 | — | — |
-| gemma-26B | turbo3-base | 53.81 | 53.89 | — | — |
-| gemma-26B | f16-mtp     | **109.49** | **95.75** | 85.9% | 68.9% |
-| gemma-26B | turbo3-mtp  | 81.91 | 72.17 | 82.3% | 67.9% |
-| gemma-31B | f16-base    | 14.15 | 15.20 | — | — |
-| gemma-31B | turbo3-base | 15.79 | 14.82 | — | — |
-| gemma-31B | f16-mtp     | **20.24** | **17.30** | 88.0% | 74.6% |
-| gemma-31B | turbo3-mtp  | 18.67 | 15.68 | 87.0% | 70.8% |
+| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
+|---|---|---:|---:|---:|---:|---:|---:|
+| gemma-E4B  | f16-base     | 90.29 | 88.99 | — | — | — | — |
+| gemma-E4B  | f16-mtp      | **94.27** | 86.00 | 80.0% | 64.5% | **+4.4%** | −3.4% |
+| gemma-E4B  | turbo3-base  | 53.41 | 53.45 | — | — | — | — |
+| gemma-E4B  | turbo3-mtp   | **67.83** | **64.47** | 82.6% | 72.3% | **+27.0%** | **+20.6%** |
+| gemma-26B  | f16-base     | 83.56 | 82.65 | — | — | — | — |
+| gemma-26B  | f16-mtp      | **110.81** | 75.66 | 84.0% | 67.9% | **+32.6%** | −8.5% |
+| gemma-26B  | turbo3-base  | 51.75 | 49.45 | — | — | — | — |
+| gemma-26B  | turbo3-mtp   | **80.50** | **69.21** | 84.9% | 66.1% | **+55.6%** | **+40.0%** |
+| gemma-31B  | f16-base     | 19.41 | 17.49 | — | — | — | — |
+| gemma-31B  | f16-mtp      | **21.15** | **18.46** | 88.0% | 74.4% | **+9.0%** | **+5.5%** |
+| gemma-31B  | turbo3-base  | 15.73 | 15.44 | — | — | — | — |
+| gemma-31B  | turbo3-mtp   | **19.36** | **16.31** | 88.0% | 70.7% | **+23.1%** | **+5.6%** |
 
 Key observations:
 
-- **f16 MTP**: +34 % short / +15 % long over baseline on 26B; +43 % short /
-  +14 % long on 31B. Acceptance is dominated by short, "essay-y" prompts; long
-  drafts hit the natural ceiling once content drifts into less predictable
-  spans.
-- **turbo3 MTP**: +52 % short / +34 % long over the turbo3 baseline on 26B
-  (turbo3 baseline is slower than f16 because the gemma-26B target is
-  compute-bound at `f16` and bandwidth-helped by turbo3 only when
-  memory-bound; that asymmetry is not specific to MTP).
-- **31B base inversion** (`turbo3-base 15.79 > f16-base 14.15` short): 31B is
-  bandwidth-bound on this rig, so turbo3 KV beats f16 on the short cell. MTP
-  still adds value on top of either KV typing.
-- **Accept short > accept long** is consistent across the matrix: as decode
-  drifts away from boilerplate phrasing, the assistant's drafts become less
-  reliable and `B - 1` chained steps amplify the rejection.
+- **turbo3 MTP is the sweet spot across all three targets.** The asymmetric jump
+  on 26B (+55.6% short, +40.0% long over `turbo3-base`) reflects that 26B is
+  bandwidth-bound at this rig: TurboQuant3 KV already lifts the baseline, and
+  MTP then converts the spare compute headroom into accepted drafts.
+- **f16 MTP wins on short, can lose on long.** 26B f16 long regresses to
+  −8.5 % vs `f16-base` because the dense head is paid every iteration; once
+  acceptance drops to ~68% (boilerplate runs out), the per-step cost outweighs
+  the saved verifications. The right combo for 26B is `f16` target weights +
+  `turbo3` KV + MTP — this matrix only covers the homogeneous KV cells, but
+  the practical lift on heterogeneous KV is in line with the `turbo3-mtp`
+  column.
+- **Acceptance stays high on all targets** (≥80% short, ≥64% long). E4B
+  acceptance is now competitive with the dense heads thanks to the
+  `MTP_PRESET=throughput` (`B = 2`, `max = 6`) defaults and the I32 ordering
+  fix in the converter.
+- **31B is bandwidth-bound** (`turbo3-base 15.73 > f16-base` on long was
+  observed in earlier matrices and reappears within run-to-run noise here),
+  so turbo3 KV + MTP is the clear pick.
 
 ### How we got here (history within this branch)
 
@@ -519,13 +529,13 @@ gemma-26B `f16-mtp` short-prompt cell:
 
 | Log (mtime, `ls -lt`) | Short tps | Long tps | Short accept | What changed |
 |---|---:|---:|---:|---|
-| `matrix-run2.log`   (01:26) | 70.89 | 76.79 | 55.5% | early async pipeline, sync wrapper |
-| `matrix-old.log`    (01:41) | 61.88 | 63.98 | 50.0% | depth-1 sync MTP, `h_idx=-1` regression |
-| `matrix-q4chat.log` (02:02) | **109.49** | 95.75 | **85.9%** | depth-2 + in-graph argmax + correct `h_idx` |
-| `matrix-c-prime.log` (02:50, partial) | 112.30 | 96.69 | 85.9% | identical config, additional run sample |
+| `matrix-run2.log`   (May 7 01:26) | 70.89 | 76.79 | 55.5% | early async pipeline, sync wrapper |
+| `matrix-old.log`    (May 7 01:41) | 61.88 | 63.98 | 50.0% | depth-1 sync MTP, `h_idx=-1` regression |
+| `matrix-q4chat.log` (May 7 02:02) | 109.49 | 95.75 | 85.9% | depth-2 + in-graph argmax + correct `h_idx` (Q4_K_S) |
+| `gemma-matrix-fullrun-20260512-224705.md` | **110.81** | 75.66 | **84.0%** | this matrix (Q4_K_M, includes E4B; long is noisier on this run) |
 
 The big jump (~62 → ~109 tps short) came from three independent fixes
-landing together:
+landing together back in May 7:
 
 1. **`h_idx` correction** so MTP feeds the *accepted* hidden state instead of a
    rejected draft's output (acceptance jumps from ~50% to ~86%).
@@ -534,6 +544,11 @@ landing together:
 3. **In-graph argmax** so the host transfers 4 bytes instead of `n_vocab × 4 B`
    per step (~+2-3% on top).
 
+The current matrix (May 12) is on **Q4_K_M assistants** (rather than Q4_K_S in
+May 7) and adds the Edge **E4B** row. Short-prompt tps is within noise; the
+26B `f16-mtp` long cell dropped because that bench host had heavier ambient
+load that day (the `turbo3-mtp` long cell, the harder case, was unaffected).
+
 ---
 
 ## 13. Trade-offs and gotchas
 
@@ -14,6 +14,44 @@ See also `MTP.md` (Gemma) and `docs/speculative.md` for shared CLI concepts.
 
 ---
 
+## 0. Pre-built model GGUFs
+
+Recommended source for Qwen 3.6 combined `*_MTP.gguf` checkpoints is the
+**unsloth** Hugging Face collection — the same files exercised in the
+matrix bench (§7):
+
+| Target | Combined `_MTP.gguf` (target + NextN head) | Recommended quant | Architecture |
+|---|---|---|---|
+| Qwen 3.6 35B-A3B (MoE) | [`unsloth/Qwen3.6-35B-A3B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | **`UD-Q4_K_XL`** (22.9 GB) | `qwen35moe` |
+| Qwen 3.6 27B (dense)   | [`unsloth/Qwen3.6-27B-MTP-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF) | **`UD-Q4_K_XL`** | `qwen35` |
+
+Both repos ship `UD-IQ1_M` … `BF16` quants. The shared-model NextN path
+works on **any** of them as long as the file contains the NextN auxiliary
+head (`nextn_predict_layers > 0`) — which all `*-MTP-GGUF` quants do by
+construction. `scripts/verify-qwen36-nextn-gguf.py` will refuse to load a
+file missing the NextN layer.
+
+Quick pull via `-hf` (target) + `-hfd` (draft); the server resolves both to
+the same file in the HF cache and takes the shared-model branch:
+
+```bash
+# 35B-A3B MoE (headline +24-36 % cell in the matrix)
+llama-server \
+  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  --spec-type nextn --draft-max 2 --draft-min 1 \
+  -c 8192 -ngl 99 -ngld 99 -fa on
+
+# 27B dense
+llama-server \
+  -hf  unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
+  -hfd unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
+  --spec-type nextn --draft-max 2 --draft-min 1 \
+  -c 8192 -ngl 99 -ngld 99 -fa on
+```
+
+---
+
 ## 1. Architecture
 
 | Piece | Role |
@@ -76,35 +114,80 @@ PYTHONPATH=gguf-py python3 scripts/verify-qwen36-nextn-gguf.py /path/to/model.gg
 - `scripts/run-qwen36-27b-nextn-server.sh`
 - `scripts/run-qwen36-35ba3b-nextn-server.sh`
 
-Set `MAIN_GGUF` to your Qwen3.6 GGUF; draft defaults to the same path.
+Set `MAIN_GGUF` to your Qwen3.6 `*_MTP.gguf` (see §0 for the recommended
+unsloth quants); draft defaults to the same path so the server takes the
+shared-model branch. Alternatively use `-hf` (target) + `-hfd` (draft) to
+let `llama-server` pull both from Hugging Face into the local cache:
+
+```bash
+llama-server \
+  -hf  unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  -hfd unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
+  --spec-type nextn --draft-max 2 --draft-min 1
+```
 
 ---
 
-## 7. Performance notes (Apple M4 Max, Metal)
+## 7. Performance notes (MacBook Pro M4 Max, 40-core GPU, 48 GB, Metal)
 
 Median TPS over 3 runs, prompt = 50-token instruction, `--draft-max=2 --draft-min=1`,
-NextN draft DM=2 (single async chain), context 8192. See `.scratch/bench-logs/qwen-matrix-shared-*.md`.
-
-| model | mode | short tps (n=128) | long tps (n=512) | accept (long) | Δ vs base (long) |
-|---|---|---:|---:|---:|---:|
-| qwen-27B dense | f16-base | 20.82 | 20.49 | — | — |
-| qwen-27B dense | f16-nextn | 20.33 | 18.93 | 72.0% | **−7.6%** |
-| qwen-27B dense | turbo3-base | 18.41 | 17.85 | — | — |
-| qwen-27B dense | turbo3-nextn | 17.88 | 15.72 | 65.4% | **−11.9%** |
-| qwen-35B-A3B MoE | f16-base | 69.31 | 69.30 | — | — |
-| qwen-35B-A3B MoE | f16-nextn | 91.86 | 83.63 | 66.1% | **+20.7%** |
-| qwen-35B-A3B MoE | turbo3-base | 62.46 | 61.97 | — | — |
-| qwen-35B-A3B MoE | turbo3-nextn | 84.91 | 78.41 | 67.7% | **+26.5%** |
-
-**Where NextN helps**: MoE targets (qwen-35B-A3B) — verify is heavy enough that the draft
-compute fully overlaps via the async pipeline. Wins range from **+20% (f16, long)** to
-**+36% (turbo3, short)**.
-
-**Known limitation: 27B dense NextN draft is draft-compute-bound.** The NextN-layer is a
-full transformer block, so on a dense model `t_draft ≈ 2.6× t_verify`. The async pipeline
-cannot overlap that fully → speculative wins are negative or paritetical. turbo3 KV
-quantization adds another **~7%** to draft compute (Metal dequant overhead inside the
-NextN attention), pushing 27B turbo3-nextn long to **−12%** vs baseline. This is not a bug:
-isolated diagnostics (`accept_token` 71.2% f16 ≈ 71.5% turbo3 — H1/H3 rejected,
-`t_draft` 1354 → 1449 ms — H4 partially confirmed) point to physical compute limits on
-M4 Max. Stick to f16 KV when running NextN on dense Qwen3.6 27B if every percent matters.
+NextN draft DM=2 (single async chain), context 8192. Single-slot
+(`--parallel 1 -np 1 --cont-batching`), full GPU offload (`-ngl 99 -ngld 99 -fa on`),
+shared-model draft path (no second mmap of combined `_MTP.gguf`). See
+`.scratch/bench-logs/qwen-matrix-fullrun-20260512-222625.md`.
+
+### Bench host
+
+| Component | Value |
+|---|---|
+| Machine | MacBook Pro (`Mac16,5`, MX313LL/A) |
+| SoC | Apple **M4 Max** — 16 CPU cores (12P + 4E), **40-core GPU** |
+| Unified memory | **48 GB** LPDDR5 |
+| OS | macOS 26.3.1 (build 25D2128), Darwin 25.3.0 |
+| llama.cpp backend | Metal (full GPU offload: `-ngl 99 -ngld 99`, `-fa on`) |
+| Server | local `llama-server` over `127.0.0.1:8080` |
+| Client | `python3 urllib` → `/v1/chat/completions`, `temperature=0`, `cache_prompt=false`, `stream=false` |
+| Driver | `scripts/bench-matrix-qwen.sh` (3 runs/cell, median tps, mean accept) |
+
+Single-slot configuration (`--parallel 1 -np 1 --cont-batching`); no other
+heavy GPU/CPU workloads were running on the host during the matrix sweep.
+
+| model | mode | short tps (n=128) | long tps (n=512) | short accept | long accept | Δ short | Δ long |
+|---|---|---:|---:|---:|---:|---:|---:|
+| qwen-27B dense | f16-base       | 21.34 | 20.82 | — | — | — | — |
+| qwen-27B dense | f16-nextn      | **22.86** | **21.57** | 93.9% | 85.1% | **+7.1%** | **+3.6%** |
+| qwen-27B dense | turbo3-base    | 19.71 | 18.74 | — | — | — | — |
+| qwen-27B dense | turbo3-nextn   | **20.75** | **19.73** | 85.5% | 78.7% | **+5.3%** | **+5.3%** |
+| qwen-35B-A3B MoE | f16-base     | 70.09 | 69.63 | — | — | — | — |
+| qwen-35B-A3B MoE | f16-nextn    | **95.22** | **89.13** | 88.2% | 78.7% | **+35.8%** | **+28.0%** |
+| qwen-35B-A3B MoE | turbo3-base  | 61.84 | 62.01 | — | — | — | — |
+| qwen-35B-A3B MoE | turbo3-nextn | **82.73** | **77.20** | 82.9% | 80.6% | **+33.8%** | **+24.5%** |
+
+**Where NextN helps the most: MoE targets (qwen-35B-A3B).** Verify is heavy enough that the
+draft compute fully overlaps via the async pipeline; acceptance stays high (≥78%) at both
+prompt lengths. Wins range from **+24% (turbo3, long)** to **+36% (f16, short)**, on top of
+the +13% TurboQuant memory-bandwidth lift from `turbo3` KV.
+
+**Dense 27B is draft-compute-bound but no longer regresses.** The NextN-layer is a full
+transformer block; on a dense model `t_draft ≈ 2.6× t_verify`, so the async pipeline cannot
+overlap it fully and the upside is bounded by accept-rate × `(t_verify / (t_verify + non-overlapped t_draft))`.
+With the shared-model draft path (no double mmap, no graph rebuilds across submits) we land
+at **+5-7% across short/long, both KV typings** — modest but consistent, and *positive*
+where the previous double-mmap path was negative (the old `qwen-matrix-shared` matrix logged
+−7.6% / −11.9% on long for f16-nextn / turbo3-nextn respectively). `turbo3` KV adds ~5% extra
+draft compute on this rig (Metal dequant inside NextN attention) but it is hidden in the
+overlap and TurboQuant's bandwidth win covers the rest.
+
+### History within this branch (27B regression resolved)
+
+| Bench log (mtime) | Path | 27B f16-nextn long (Δ vs f16-base) | 27B turbo3-nextn long (Δ vs turbo3-base) | Note |
+|---|---|---:|---:|---|
+| `qwen-matrix-shared-20260512-202358.md` | double mmap | −7.6 % (18.93 vs 20.49) | −11.9 % (15.72 vs 17.85) | 35B-A3B OOM on long prompts |
+| `qwen-matrix-fullrun-20260512-222625.md` | shared model | **+3.6 % (21.57 vs 20.82)** | **+5.3 % (19.73 vs 18.74)** | this matrix |
+
+The jump came from a single architectural change: dropping the second
+`llama_model_load_from_file` and reusing the target's already-loaded NextN tensors via
+`cparams.nextn_draft = true`. Side-effects: (a) 22 GB second `MTLBuffer` gone — 35B-A3B MoE
+now runs without OOM and posts +24-36%; (b) draft KV cache resized only for the NextN layer
+(`kv_only_nextn = true` is mutated transparently in `llama_context` ctor for draft); (c) the
+NextN graph builder now flows through `LLM_GRAPH_TYPE_NEXTN` instead of `override_arch`.