|
| 1 | +# ADR 0012 — Proposer/verifier value proposition: bounded-memory + recall (all platforms), platform-forked throughput |
| 2 | + |
| 3 | +- **Status**: Accepted (2026-06-13) |
| 4 | +- **Date**: 2026-06-13 |
| 5 | +- **Decision drivers**: |
| 6 | + - A recurring question keeps being re-opened by new contributors (human |
| 7 | + and agent): *"is the proposer still worth it, given Step-1 reaches 1.0× AR |
| 8 | + on Mac without using it?"* and *"is speculative decoding dead on Mac?"* |
| 9 | + This ADR settles the value map so the decision tree is not re-derived |
| 10 | + every time the Mac throughput number looks bad in isolation. |
| 11 | + - 2026-06-12 Mac ctx280 validation (`results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json`): |
| 12 | + Step-1 recall 5/5 vs oracle 5/5, bounded resident KV 132.9 MB vs naive |
| 13 | + 1308.9 MB (89.8 % saving) at 4406–5810-token prompts. |
| 14 | + - 2026-06-11 H200 #107 evidence: fused spec-decode 1.27× AR, recall 1.0. |
| 15 | + - 2026-06-13 `verify(L)` calibration sweep (`results/research/verify_l_sweep.json`, |
| 16 | + ctx 4096): measured kernel-dedup headroom 3.92× at L=16, ≈87 % of the |
| 17 | + router-measured expert-union bound (4.52×). |
| 18 | + - Builds on / re-affirms ADR 0001 (proposer sizing + alignment), |
| 19 | + ADR 0004 (alignment data policy), ADR 0006 (local-agent-infra |
| 20 | + positioning), ADR 0008 §11 (dLM K/V-Restoration architecture), |
| 21 | + ADR 0009 (capability exchange), ADR 0010 (full-attention low-precision |
| 22 | + KV / affine4), ADR 0011 (cross-attention coupling, falsified by R1e). |
| 23 | + |
| 24 | +## Context |
| 25 | + |
| 26 | +ADR 0008 §11 (K-series) changed the proposer's primary role from *drafter* |
| 27 | +to *history reconstructor*: the dLM proposer has no KV cache and can produce |
| 28 | +transient K/V for the **entire** history, which is used to restore the |
| 29 | +verifier's attention at structurally-evicted positions. Speculative decoding |
| 30 | +is the **second** product line on the same architecture, not the first. |
| 31 | + |
| 32 | +The trap is to evaluate the architecture on a single cell of its value |
| 33 | +table — "Mac, single host, generic chat, current un-aligned DFlash" — see a |
| 34 | +weak throughput number, and conclude the proposer (or spec-decode) has no |
| 35 | +value. That conclusion does not generalise. This ADR records the full value |
| 36 | +map and prices the open options explicitly. |
| 37 | + |
| 38 | +## Decision |
| 39 | + |
| 40 | +The proposer/verifier value proposition is realised on **two axes**, and its |
| 41 | +status is **platform- and workload-dependent**, not a single scalar: |
| 42 | + |
| 43 | +### 1. The core value is "bounded memory + recall", not "fast" |
| 44 | + |
| 45 | +Since ADR 0008 §11, the proposer's first-class role is **history |
| 46 | +reconstruction**: no KV cache, transient full-history K/V → restore the |
| 47 | +verifier's evicted-position attention. The main line has already **won**, but |
| 48 | +the value is realised on the **memory axis**, not the throughput axis: |
| 49 | +Step-1 = **1.0× AR throughput + recall 5/5 + KV 132.9 MB vs naive 1308.9 MB |
| 50 | +(89.8 % saving; ~48 MB after affine4 / ADR 0010)**. That is the |
| 51 | +proposer/verifier deliverable. |
| 52 | + |
| 53 | +A finer honesty note: in the Mac **S5-native** shipping configuration, |
| 54 | +Gemma-4's *native hybrid attention* means keeping the **5 full-attention |
| 55 | +layers exact** is already enough to carry recall — so on this specific model |
| 56 | +the f_θ/proposer reconstruction is **replaced by the S5 shortcut**. But on a |
| 57 | +**pure sliding-window architecture** (the K1/K2 Qwen3 case — no |
| 58 | +full-attention layers to preserve) and on the **CUDA full-restoration path**, |
| 59 | +proposer reconstruction remains the **only** source of recall. The |
| 60 | +architecture's domain of applicability is unchanged; Gemma-4 simply handed us |
| 61 | +a free coupon. |
| 62 | + |
| 63 | +### 2. Speculative-decoding value forks by platform — the Mac negative does not extrapolate |
| 64 | + |
| 65 | +- **H200 (#107 measured)**: fused = **1.27× AR, recall 1.0** — the *same* |
| 66 | + proposer/verifier code; on a platform where verify-batch is nearly free, |
| 67 | + spec-decode value holds. |
| 68 | +- **Mac's 0.26×** has a **concrete, movable** bottleneck: real per-token |
| 69 | + acceptance is **30–40 %**. The vLLM reference reports the *same* drafter at |
| 70 | + **44.7 %**, and our own drafter docs say "the precise EAGLE-3 ↔ block-fusion |
| 71 | + alignment is a Stage-2 task". Alignment fine-tuning (the plan that has been |
| 72 | + queued in ADR 0001 / 0004 all along) lifting acceptance to **~70 %** makes |
| 73 | + the block-4 arithmetic `3.5 × 43.8 / 140 ≈ 1.1×` — Mac clears the bar too. |
| 74 | + So the Mac status is **"waiting for the alignment asset"**, not |
| 75 | + **"architecture pronounced dead"**. |
| 76 | + |
| 77 | +### 3. Option value of the verification primitive: correctness containment makes any draft source plug-and-play |
| 78 | + |
| 79 | +The v3 loop + byte-level consistency guarantees one thing: a draft source can |
| 80 | +only affect **throughput**, never **pollute output**. This is an open |
| 81 | +interface; the drafter can be swapped for anything. A concrete, this-week, |
| 82 | +testable Mac route: **NGramProposer** (the zero-weight prompt-lookup proposer |
| 83 | +already in PR #105) + the v3 loop — draft cost ≈ 0, no alignment dependency, |
| 84 | +naturally high acceptance on agentic workloads (tool-call JSON, templated |
| 85 | +replies, highly self-repetitive sessions). Arithmetic: `draft ≈ 0 + |
| 86 | +verify(4) = 120 ms`, committing 2.5/block → 0.78×, 3.5/block → 1.1× — on |
| 87 | +Kakeya's target workload (ADR 0006: local agent infrastructure) this is |
| 88 | +entirely plausible. One bridge command verifies it. |
| 89 | + |
| 90 | +### 4. Beyond single-host throughput, the split is the foundation for a multi-host architecture |
| 91 | + |
| 92 | +ADR 0009 / PR #105's capability-exchange plane is built with the |
| 93 | +proposer/verifier roles as primitives: the proposer is a fleet capability that |
| 94 | +can be gossip-discovered and remote-invoked. Even if a single Mac never runs |
| 95 | +spec-decode, the "verifier on host A, proposer capability on host B / cloud" |
| 96 | +shape (including the dev/eval tool plane) is **already running** — the Mac |
| 97 | +bridge used over the last two days is itself an instance of this |
| 98 | +architecture's tool plane. |
| 99 | + |
| 100 | +### Bottom line |
| 101 | + |
| 102 | +If the proposition is narrowed to *"Mac single-host + generic chat + the |
| 103 | +current un-aligned DFlash"* — yes, the proposer has **no runtime value in |
| 104 | +that one cell today**, and Step-1 reaches 1.0× without it. But the |
| 105 | +architecture's value map is: **the realised bounded-memory story (all |
| 106 | +platforms) + the realised throughput story (CUDA) + two explicitly-priced Mac |
| 107 | +throughput options (alignment fine-tuning / n-gram drafting) + the foundation |
| 108 | +for the multi-host capability plane.** |
| 109 | + |
| 110 | +## Consequences |
| 111 | + |
| 112 | +- The proposer/verifier split is **retained**; "Step-1 doesn't use the |
| 113 | + proposer on Mac/Gemma-4" is **not** grounds to deprecate it (it is the only |
| 114 | + recall source on pure-sliding-window models and on CUDA full-restoration). |
| 115 | +- Memory-axis claims (bounded KV, S5, affine4) are the **primary**, |
| 116 | + all-platform deliverable and should be reported as such; throughput claims |
| 117 | + must be qualified by platform (CUDA: realised; Mac: option-pending). |
| 118 | +- Two priced Mac throughput options are tracked as next steps: |
| 119 | + (a) **alignment fine-tuning** (ADR 0001/0004) to lift DFlash acceptance |
| 120 | + toward the 44.7 % reference / ~70 % target; (b) **NGramProposer × v3 loop** |
| 121 | + for agentic workloads (draft ≈ 0, no training). Option (b) is the cheapest |
| 122 | + and is verifiable in a single bridge run. |
| 123 | +- Any future "is spec-decode worth it?" discussion must specify the |
| 124 | + **(platform, workload, drafter)** cell; a negative in one cell does not |
| 125 | + generalise. |
| 126 | + |
| 127 | +## Alternatives considered |
| 128 | + |
| 129 | +- **"The Mac throughput number kills the architecture."** Rejected: the Mac |
| 130 | + cell has a concrete, movable bottleneck (acceptance 30–40 % vs 44.7 % |
| 131 | + reference), and the negative does not extrapolate to CUDA (1.27×, realised) |
| 132 | + or to the memory axis (89.8 % saving, realised, all platforms). |
| 133 | +- **"Deprecate the proposer because Step-1 reaches 1.0× without it on |
| 134 | + Gemma-4."** Rejected: S5 is a Gemma-4-specific free coupon (native hybrid |
| 135 | + attention); on pure sliding-window models (K1/K2 Qwen3) and the CUDA |
| 136 | + full-restoration path the proposer is the only recall source. |
| 137 | +- **"Only ship spec-decode if it beats AR everywhere."** Rejected: the value |
| 138 | + is platform-forked; CUDA already clears it, and the verification primitive's |
| 139 | + correctness containment makes the Mac throughput a strictly-additive option |
| 140 | + (it can never regress correctness). |
| 141 | + |
| 142 | +## Evidence pointers |
| 143 | + |
| 144 | +- Mac bounded-memory + recall: `results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json` |
| 145 | + (recall 5/5, KV 132.9 MB vs 1308.9 MB), `docs/pr109-mac-ctx280-validation.md`. |
| 146 | +- CUDA throughput: PR #107, `docs/k3-gpu-beta.md`. |
| 147 | +- verify(L) headroom: `results/research/verify_l_sweep.json` (3.92× measured @ |
| 148 | + L=16 vs 4.52× expert-union bound). |
| 149 | +- Drafter alignment status: ADR 0001/0004; `inference_engine/v04/dflash_drafter.py` |
| 150 | + ("Stage-2" fidelity note). |
| 151 | +- Capability plane / multi-host: ADR 0009, PR #105; `scripts/mac_bridge/`. |
0 commit comments