|
| 1 | +# ADR 0002 — Verifier Selection, Quantization, and the Open-vs-Closed-Weight Constraint |
| 2 | + |
| 3 | +- **Status**: Accepted |
| 4 | +- **Date**: 2026-05-23 |
| 5 | +- **Decision drivers**: Memory fit on consumer hardware, perceived |
| 6 | + latency, alignment-training feasibility, future-proofing across the |
| 7 | + Qwen / Gemma / DeepSeek roadmap. |
| 8 | +- **Depends on**: [ADR 0001](0001-proposer-sizing-and-alignment.md) |
| 9 | + (proposer is a constant 0.25–1 B regardless of verifier choice). |
| 10 | + |
| 11 | +## 1. Context |
| 12 | + |
| 13 | +ADR 0001 fixed the proposer in a 0.25–1 B band and established that |
| 14 | +verifier swaps are *data-and-fine-tune* operations rather than |
| 15 | +re-architecture operations. That decision deliberately deferred the |
| 16 | +question of *which* verifier to ship against. This ADR resolves that |
| 17 | +question for the project's first two ship targets and lays out the rule |
| 18 | +for future swaps. |
| 19 | + |
| 20 | +The decision space is constrained by four hard, independent factors: |
| 21 | + |
| 22 | +1. **Memory budget.** Primary deployment target is consumer hardware: |
| 23 | + Mac M-series (16–64 GB unified memory) and consumer Nvidia GPUs |
| 24 | + (RTX 4090: 24 GB, RTX 3090: 24 GB). The 24 GB Mac M4 is the |
| 25 | + project's reference machine and is the lower bound below which we |
| 26 | + refuse to design. |
| 27 | +2. **Perceived latency.** The chat REPL (`scripts/chat.py`) has a |
| 28 | + project-internal target of ≤ 30 s for a 200-token Chinese response on |
| 29 | + the reference Mac. Beyond that, the streaming UX is no longer |
| 30 | + acceptable. |
| 31 | +3. **Alignment availability.** Per ADR 0001, every verifier we ship |
| 32 | + against requires its own representation-alignment-trained proposer. |
| 33 | + A verifier we cannot align against is a verifier we cannot ship. |
| 34 | +4. **Open weights for hidden states.** EAGLE-3 alignment requires |
| 35 | + read access to the verifier's embedding, last-layer hidden state, and |
| 36 | + LM head. Closed-weight (API-only) verifiers cannot be aligned with |
| 37 | + the project's primary recipe; they fall back to a degraded path that |
| 38 | + tops out at lower acceptance (see section 5). |
| 39 | + |
| 40 | +The current measured baseline (commit `e207aed`, MLX backend on M4): |
| 41 | + |
| 42 | +| Metric | Value | |
| 43 | +| ----------------------------- | ---------------------- | |
| 44 | +| Verifier | `Qwen/Qwen3-1.7B`, bf16 | |
| 45 | +| Resident memory | ~5.5 GB | |
| 46 | +| Wall-time (zh KV-cache prompt, 150 tokens) | 12.07 s | |
| 47 | +| Acceptance (no alignment) | 0.06–0.12 | |
| 48 | + |
| 49 | +## 2. Decision |
| 50 | + |
| 51 | +### 2.1 Ship sequence |
| 52 | + |
| 53 | +The project ships against verifiers in this order: |
| 54 | + |
| 55 | +| Ship | Verifier | Backend | Quant | Triggered when | |
| 56 | +| ---- | ------------------- | --------------- | ------- | ---------------------------------------------------- | |
| 57 | +| v1 | `Qwen/Qwen3-1.7B` | MLX / CUDA | bf16 | ADR 0001 Validation #2 passes (α ≥ 0.40 at K=2) | |
| 58 | +| v2 | `Qwen/Qwen3-8B` | MLX / CUDA | **4-bit (AWQ-style)** | v1 in production + alignment retrained for 8B verifier | |
| 59 | +| v3+ | larger / MoE | TBD | TBD | Recorded in a future ADR (0004+) | |
| 60 | + |
| 61 | +v1 reuses the verifier we have today. v2 is the planned upgrade. The |
| 62 | +2-step sequence exists deliberately: v1 proves that the alignment |
| 63 | +pipeline (ADR 0001) actually works on a verifier we can run end-to-end |
| 64 | +on a 24 GB Mac without quantization risk; v2 then exercises the |
| 65 | +verifier-decoupling claim of ADR 0001 §2.3 — same proposer architecture, |
| 66 | +new alignment artifacts, new quantized verifier weights. |
| 67 | + |
| 68 | +### 2.2 Quantization rule |
| 69 | + |
| 70 | +**bf16 below 4 B parameters; 4-bit MLX (or AWQ on CUDA) at 4 B and above.** |
| 71 | + |
| 72 | +Concretely: |
| 73 | + |
| 74 | +- Verifier ≤ 2 B params: bf16 unconditionally. Memory headroom is |
| 75 | + sufficient on the reference 24 GB Mac; 4-bit gains nothing |
| 76 | + meaningful and adds quantization noise that hurts acceptance. |
| 77 | +- 2 B < Verifier < 8 B: bf16 if it fits in ≤ 60 % of available unified |
| 78 | + memory, else 4-bit. Decision is made per target machine in the engine |
| 79 | + config, not statically per verifier. |
| 80 | +- Verifier ≥ 8 B: 4-bit by default. bf16 is reserved for non-consumer |
| 81 | + GPU paths (A100 / H100 / MI300) recorded in a separate engine |
| 82 | + deployment ADR. |
| 83 | + |
| 84 | +The 60 % threshold leaves ~10 GB headroom on a 24 GB machine for the |
| 85 | +proposer (≤ 1 GB), per-forward activations (~1–2 GB), the OS, and other |
| 86 | +applications the user is running. |
| 87 | + |
| 88 | +### 2.3 Open-weight requirement |
| 89 | + |
| 90 | +**The project's primary alignment recipe (ADR 0001) requires |
| 91 | +open-weight verifiers.** Closed-weight verifiers (e.g. GPT-4-class |
| 92 | +APIs, Claude-class APIs, Gemini-API-only models) cannot be aligned with |
| 93 | +EAGLE-3 because they expose neither embedding weights nor hidden |
| 94 | +states. Section 5 documents the degraded fallback path; it is *not* a |
| 95 | +ship target for v1 or v2. |
| 96 | + |
| 97 | +### 2.4 Latency budget enforcement |
| 98 | + |
| 99 | +For each verifier candidate, before training the alignment proposer |
| 100 | +against it, run a *no-proposer-baseline* benchmark on the reference |
| 101 | +hardware: |
| 102 | + |
| 103 | +``` |
| 104 | +verifier.generate(reference_prompt, max_new_tokens=200) |
| 105 | +``` |
| 106 | + |
| 107 | +If this baseline exceeds 50 s on the reference Mac, the verifier is |
| 108 | +rejected at the planning stage — speculative decoding cannot recover a |
| 109 | +verifier that is fundamentally too slow. |
| 110 | + |
| 111 | +For Qwen3-8B 4-bit on M4: estimated baseline 35–45 s for 200 tokens, |
| 112 | +which leaves margin. For Qwen3-32B-class models on M4: estimated |
| 113 | +baseline > 90 s, which rejects them at this gate. They become |
| 114 | +candidates only on cloud / data-center deployment, recorded in a |
| 115 | +separate ADR. |
| 116 | + |
| 117 | +## 3. Alternatives Considered |
| 118 | + |
| 119 | +### 3.1 Skip v1, go straight to Qwen3-8B (rejected) |
| 120 | + |
| 121 | +Rationale for considering: 1.7 B is "too small to matter" as a |
| 122 | +production verifier; the project's value proposition is at 7 B+ scale. |
| 123 | + |
| 124 | +Why rejected: |
| 125 | + |
| 126 | +- Two unproven things at once (alignment pipeline correctness *and* |
| 127 | + quantized 8B fit/latency) compound risk. If something goes wrong, we |
| 128 | + cannot tell which factor caused it. |
| 129 | +- ADR 0001's validation explicitly requires acceptance ≥ 0.40 on a |
| 130 | + verifier we have measurements on. Switching that verifier to one we |
| 131 | + don't yet have measurements on changes what "validation" means. |
| 132 | +- Cost: training the 8B-verifier alignment requires ~30–50 GB of |
| 133 | + on-policy hidden-state cache. Burning that compute before validating |
| 134 | + the recipe on the smaller verifier is wasteful. |
| 135 | + |
| 136 | +### 3.2 Skip Qwen3-8B, go directly to Qwen3-32B or DeepSeek-V2.5 (rejected) |
| 137 | + |
| 138 | +Why rejected: |
| 139 | + |
| 140 | +- 32 B 4-bit ~ 16 GB; with proposer + activations + OS, total resident |
| 141 | + is ~21 GB on a 24 GB Mac — same memory cliff that rejected bf16 for |
| 142 | + 8B. Pushing harder without first hardening the engine layer is bad |
| 143 | + engineering sequencing. |
| 144 | +- Latency budget: 32 B at 4-bit on M4 is estimated > 90 s for a |
| 145 | + 200-token reply, exceeding the perceived-latency target by 3×. |
| 146 | +- These verifiers are appropriate for the cloud-deployed verifier |
| 147 | + pattern (proposer local, verifier remote) sketched in |
| 148 | + `docs/local-inference-engine.md`. That deployment mode is recorded in |
| 149 | + a future ADR; this ADR's scope is local-only. |
| 150 | + |
| 151 | +### 3.3 Use a non-Qwen verifier for v1 / v2 (rejected for now) |
| 152 | + |
| 153 | +Candidates: Gemma 4, DeepSeek V3/V4 distill, Llama 3.x. |
| 154 | + |
| 155 | +Why rejected for v1 / v2: |
| 156 | + |
| 157 | +- Project commitment in the original product brief is Qwen / Gemma / |
| 158 | + DeepSeek as parallel targets. Sequencing them serially (Qwen first) |
| 159 | + is a planning choice, not an architectural one. |
| 160 | +- Tokenizer continuity: the current proposer (`dllm-hub Qwen3-0.6B-mdlm`) |
| 161 | + shares Qwen3 tokenizer. Switching verifier families forces a |
| 162 | + proposer family switch too, which compounds with v1's alignment |
| 163 | + validation in the same way as 3.1. |
| 164 | +- Multi-family support is a v3+ concern recorded in a future ADR. |
| 165 | + |
| 166 | +### 3.4 8-bit instead of 4-bit at the 8 B boundary (rejected) |
| 167 | + |
| 168 | +Why rejected: |
| 169 | + |
| 170 | +- 8-bit Qwen3-8B ≈ 8.5 GB resident. With proposer + activations + OS, |
| 171 | + total ~13–14 GB. Fits but eats most of the headroom needed for |
| 172 | + serving multiple sessions or running other apps concurrently. |
| 173 | +- 4-bit MLX (group-wise quantization, group_size=64) measures ~1 % of |
| 174 | + perplexity degradation on Qwen3 family, well below the noise floor of |
| 175 | + speculative decoding's accept/reject decisions. |
| 176 | +- 4-bit is also the format with mature MLX community releases |
| 177 | + (`mlx-community/Qwen3-8B-4bit`), which removes a conversion step |
| 178 | + from the engineering path. |
| 179 | + |
| 180 | +### 3.5 Mix: bf16 verifier on CUDA, 4-bit on MLX (deferred) |
| 181 | + |
| 182 | +Tempting because RTX 4090 has the same 24 GB as M4 but with faster |
| 183 | +memory bandwidth, so bf16 8B (16 GB) would fit. Deferred because: |
| 184 | + |
| 185 | +- It bifurcates the alignment training: bf16 verifier and 4-bit |
| 186 | + verifier produce slightly different hidden states, which in |
| 187 | + principle requires two alignment runs. |
| 188 | +- Empirically the difference is small enough (per Qwen3 4-bit |
| 189 | + literature) that one alignment run usually transfers, but we have no |
| 190 | + in-house measurement yet. |
| 191 | +- Deferred to a follow-up ADR after v2 ships and we measure the |
| 192 | + quantization-transfer gap directly. |
| 193 | + |
| 194 | +## 4. Consequences |
| 195 | + |
| 196 | +### 4.1 Positive |
| 197 | + |
| 198 | +- **v1 ships on a verifier we can already run.** No new model |
| 199 | + acquisition, no quantization conversion, no memory cliff. The only |
| 200 | + unknown in v1 is whether ADR 0001's alignment recipe actually works. |
| 201 | +- **v2 has a clear, bounded scope.** When v1 ships, v2 is mechanical: |
| 202 | + add `--verifier-id Qwen/Qwen3-8B-4bit` flag, regenerate hidden-state |
| 203 | + cache, retrain proposer adapters, ship. |
| 204 | +- **The 60 % memory rule generalizes.** Future verifier candidates can |
| 205 | + be evaluated with a one-line calculation; we are not redesigning |
| 206 | + memory budgets per model. |
| 207 | +- **Alignment pipeline reuse.** The `training/repr_align/` package |
| 208 | + built for v1 will run unchanged for v2. The verifier-decoupling claim |
| 209 | + of ADR 0001 §2.3 gets exercised in production rather than just on |
| 210 | + paper. |
| 211 | + |
| 212 | +### 4.2 Negative / accepted trade-offs |
| 213 | + |
| 214 | +- **v1 is a "stepping stone" verifier.** 1.7 B is below the size where |
| 215 | + the project's KV-cache-savings story becomes economically interesting |
| 216 | + (KV/token at 1.7 B is small enough that the proposer's weight |
| 217 | + amortization breakeven sits at uncomfortably large B × S). This is |
| 218 | + acknowledged: v1's purpose is recipe validation, not user-facing |
| 219 | + value. |
| 220 | +- **Quantization noise interacts with alignment.** When v2's verifier |
| 221 | + is 4-bit, the alignment recipe trains against quantized hidden |
| 222 | + states, which are very slightly different from bf16 hidden states. |
| 223 | + Acceptance may be 1–3 percentage points lower than equivalent bf16 |
| 224 | + alignment. We accept this; the absolute target (α ≥ 0.50 for v2 at |
| 225 | + K=2) is set with that haircut already factored in. |
| 226 | +- **Closed-weight models are out of scope.** GPT-4 / Claude / Gemini |
| 227 | + cannot be served by this engine in its primary mode. Section 5 |
| 228 | + explains why and what the lossy fallback would look like, but the |
| 229 | + fallback is not a v1/v2/v3 commitment. |
| 230 | + |
| 231 | +### 4.3 Implications for current and future code |
| 232 | + |
| 233 | +- **`scripts/setup_*.sh`**: `download_models` becomes parameterized |
| 234 | + over a model list rather than hard-coding `Qwen3-1.7B`. v1's list |
| 235 | + remains `[Qwen3-0.6B-mdlm, Qwen3-1.7B]`; v2's list adds |
| 236 | + `mlx-community/Qwen3-8B-4bit`. |
| 237 | +- **`inference_engine/backends/mlx/verifier.py`**: gains a `--verifier-id` |
| 238 | + CLI flag and propagates it through `MLXSinkWindowVerifier(config)`. |
| 239 | +- **`scripts/run_platform_tests.sh`**: HF cache pre-flight check |
| 240 | + becomes verifier-id-aware (currently hard-coded to `Qwen3-1.7B`). |
| 241 | +- **`training/repr_align/`** (introduced for ADR 0001): its |
| 242 | + hidden-state cache directory is keyed by verifier id; v1 and v2 |
| 243 | + produce non-overlapping caches that can coexist on disk. |
| 244 | +- **Future ADR 0003** records per-verifier K values and tree-spec |
| 245 | + configuration, which depend on measured acceptance from this ADR's |
| 246 | + v1 / v2 deliveries. |
| 247 | +- **Future ADR 0004** records remote/cloud-deployed verifier pattern |
| 248 | + (proposer local, verifier in the data center). That is where |
| 249 | + Qwen3-32B / DeepSeek-V2.5 / GPT-OSS-120B class models become |
| 250 | + in-scope. |
| 251 | + |
| 252 | +## 5. Closed-Weight Verifiers — Why and What If |
| 253 | + |
| 254 | +A recurring question is whether EAGLE-3-style alignment can be applied |
| 255 | +to commercial API-only models (GPT-4, Claude, Gemini, Qwen-Max). The |
| 256 | +honest answer has three parts. |
| 257 | + |
| 258 | +### 5.1 What EAGLE-3 demands and which APIs supply it |
| 259 | + |
| 260 | +EAGLE-3 alignment uses three classes of verifier signal: |
| 261 | + |
| 262 | +| Signal | Required for | Available from API? | |
| 263 | +| ------------------------------ | --------------------------- | ------------------- | |
| 264 | +| Embedding weights | Shared `embed_tokens` in proposer | **No (any API)** | |
| 265 | +| LM head weights | Shared `lm_head` in proposer | **No (any API)** | |
| 266 | +| Last-layer hidden state per token | Hidden-state alignment loss | **No (any API)** | |
| 267 | +| Per-token top-K log-probs | Logits-distill auxiliary loss | OpenAI (top-20), Anthropic (no), Gemini (top-5 in some endpoints), Qwen-Max (no) | |
| 268 | +| Sampled token sequence | On-policy token-level supervision | **Yes (all APIs)** | |
| 269 | + |
| 270 | +The first three rows are the core of EAGLE-3 and are uniformly |
| 271 | +unavailable from closed APIs. This is not an oversight by API providers: |
| 272 | +exposing hidden states would leak the model's internal representation |
| 273 | +in a way that damages competitive moats and aids extraction attacks. It |
| 274 | +is highly unlikely to change for frontier models. |
| 275 | + |
| 276 | +### 5.2 What's still possible — degraded paths |
| 277 | + |
| 278 | +Three increasingly weak alternatives, all worse than EAGLE-3: |
| 279 | + |
| 280 | +1. **Logits-only distillation** (when API exposes top-K log-probs, |
| 281 | + e.g. OpenAI). Loss reduces to KL between proposer's full |
| 282 | + distribution and the API's top-K. Empirically observed acceptance |
| 283 | + ceiling: ~0.45–0.55 (vs 0.70–0.80 for full EAGLE-3). This is the |
| 284 | + regime that early speculative-decoding papers (the original DeepMind |
| 285 | + work) operated in before EAGLE introduced hidden-state alignment. |
| 286 | +2. **Sequence-level behavioral cloning** (when API exposes only |
| 287 | + sampled tokens). Loss is standard next-token cross-entropy on |
| 288 | + verifier-generated sequences. Empirically observed acceptance |
| 289 | + ceiling: ~0.30–0.40. This is essentially "train a small model to |
| 290 | + imitate the API's output style"; it does not exploit the verifier's |
| 291 | + probability distribution at all. |
| 292 | +3. **Hybrid with a local proxy** — train alignment against a |
| 293 | + *similar-but-open* verifier (e.g. align against Qwen3-72B-Instruct |
| 294 | + weights, deploy with Qwen-Max API). Produces a proposer aligned to |
| 295 | + the wrong target; transfer quality depends on how close the open |
| 296 | + proxy is to the closed model. Empirically: 5–15 percentage points |
| 297 | + below pure EAGLE-3 against the actual proxy. |
| 298 | + |
| 299 | +### 5.3 Why none of these are v1/v2 ship targets |
| 300 | + |
| 301 | +- The project's primary value proposition (KV-cache replacement, local |
| 302 | + memory savings) requires running the verifier locally. A closed API |
| 303 | + verifier is run remotely; the "memory savings" become "the user pays |
| 304 | + per token". Different product, different ADR. |
| 305 | +- Acceptance ceilings of 0.30–0.55 collapse the speculative speedup to |
| 306 | + 1.3–1.8×, which does not justify the engineering complexity. |
| 307 | +- Closed APIs charge per token of *both* prompt and completion. The |
| 308 | + proposer's per-step verifier call contains ~K candidate tokens that |
| 309 | + may be rejected; rejected tokens still cost money. The economic |
| 310 | + break-even moves against speculative decoding in this setting. |
| 311 | + |
| 312 | +### 5.4 What we *will* support if the closed-API mode becomes a goal |
| 313 | + |
| 314 | +If the project later decides to address closed APIs (recorded in a |
| 315 | +future ADR), the path is: |
| 316 | + |
| 317 | +- A `RemoteVerifier` adapter exposing the same interface as |
| 318 | + `MLXSinkWindowVerifier` / `SinkWindowVerifier`. |
| 319 | +- A degraded `training/logit_distill_remote/` package implementing |
| 320 | + alternative #1 above when the target API exposes top-K log-probs. |
| 321 | +- A separate evaluation harness that reports acceptance, throughput, |
| 322 | + *and* token cost per generated user-visible token, because the |
| 323 | + third axis is what dominates the closed-API economics. |
| 324 | + |
| 325 | +This is a non-trivial body of work and explicitly out of scope for v1 |
| 326 | +and v2. Pursuing it without first finishing v2 would distract from |
| 327 | +proving the core alignment recipe works. |
| 328 | + |
| 329 | +## 6. Validation |
| 330 | + |
| 331 | +This ADR is considered validated when: |
| 332 | + |
| 333 | +1. **v1 validation**: ADR 0001 §6 conditions are met against |
| 334 | + `Qwen/Qwen3-1.7B` (α ≥ 0.40 at K=2), confirming the bf16 path |
| 335 | + functions end-to-end. |
| 336 | +2. **v2 validation**: with no changes to proposer architecture, |
| 337 | + training scripts, or serving code, the alignment pipeline produces |
| 338 | + a proposer for `mlx-community/Qwen3-8B-4bit` that achieves α ≥ 0.50 |
| 339 | + at K=2 on the held-out evaluation set, and the engine runs the |
| 340 | + reference 200-token Chinese prompt within the 30 s perceived-latency |
| 341 | + target on the M4 reference machine. |
| 342 | +3. **Memory-rule validation**: the 60 % memory threshold from §2.2 |
| 343 | + correctly predicts fit/no-fit on at least three independent target |
| 344 | + machines (24 GB Mac, 32 GB Mac, 24 GB RTX-class GPU) without |
| 345 | + per-machine tuning. |
| 346 | + |
| 347 | +If item 2 fails on perceived latency but passes on acceptance, the ADR |
| 348 | +is partially superseded by an engine-side optimization ADR. If item 2 |
| 349 | +fails on acceptance, ADR 0001's recipe is what needs revision, and |
| 350 | +this ADR's v2 commitment is paused until that revision lands. |
| 351 | + |
| 352 | +## 7. References |
| 353 | + |
| 354 | +- ADR 0001 — Proposer sizing, alignment, verifier decoupling |
| 355 | + (this ADR is the verifier counterpart of that proposer-side decision). |
| 356 | +- `docs/local-inference-engine.md` — describes the serving stack that |
| 357 | + consumes the verifiers selected here. |
| 358 | +- MLX community: `mlx-community/Qwen3-8B-4bit` for v2. |
| 359 | +- Qwen3 4-bit perplexity studies (community-published) supporting the |
| 360 | + 4-bit-at-8B decision. |
| 361 | +- Original DeepMind speculative decoding paper for the logits-only |
| 362 | + alignment regime referenced in §5.2. |
0 commit comments