Add ADR 0002: verifier selection, quantization, open/closed-weight constraint

cursoragent · FluffyAIcode · cursoragent · commit 4309266b7c9c · 2026-05-23T10:21:00.000Z
Records the v1/v2 verifier ship sequence and the algorithmic constraint
that EAGLE-3 alignment requires open weights:

- v1 ships on Qwen3-1.7B bf16 (current verifier) once ADR 0001's
  alignment validation passes; v2 swaps to Qwen3-8B 4-bit, exercising
  ADR 0001's verifier-decoupling claim in production rather than just
  on paper.
- Quantization rule: bf16 below 4B params, 4-bit MLX/AWQ at 4B and
  above; 60% unified-memory threshold determines fit/no-fit on
  consumer hardware.
- Latency gate: a no-proposer baseline &gt;50s on the reference Mac
  rejects a verifier candidate at planning time (Qwen3-32B+ deferred
  to a future cloud-deployment ADR).
- Closed-weight APIs (GPT-4 / Claude / Gemini) cannot be aligned with
  the project's primary recipe because EAGLE-3 needs verifier
  embedding, LM head, and last-layer hidden state (none exposed by
  any frontier API). Section 5 documents three degraded fallback
  paths (logits-only distill ~0.45-0.55, sequence behavioral cloning
  ~0.30-0.40, hybrid open-proxy alignment) and explicitly puts them
  out of scope for v1/v2.
- Rejects four alternatives (skip v1, skip Qwen3-8B for 32B+, switch
  family, 8-bit at 8B) with reasons.

Adds the ADR to docs/adr/README.md index and to the top-level README.

Co-authored-by: FluffyAIcode &lt;FluffyAIcode@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -271,3 +271,9 @@ explicitly rejected.
   0.25–1 B band, treat EAGLE-3 representation alignment as the canonical
   training recipe, and design verifier swaps to be data-and-fine-tune
   operations rather than re-architecture operations.
+- [ADR 0002 — Verifier selection, quantization, and the
+  open-vs-closed-weight constraint](docs/adr/0002-verifier-selection-and-quantization.md):
+  the v1/v2 ship sequence (Qwen3-1.7B bf16 → Qwen3-8B 4-bit), the 60 %
+  memory rule for choosing bf16 vs 4-bit, and why closed-weight APIs
+  (GPT/Claude/Gemini) cannot be aligned with EAGLE-3 and are out of
+  scope for v1 / v2.
diff --git a/docs/adr/0002-verifier-selection-and-quantization.md b/docs/adr/0002-verifier-selection-and-quantization.md
@@ -0,0 +1,362 @@
+# ADR 0002 — Verifier Selection, Quantization, and the Open-vs-Closed-Weight Constraint
+
+- **Status**: Accepted
+- **Date**: 2026-05-23
+- **Decision drivers**: Memory fit on consumer hardware, perceived
+  latency, alignment-training feasibility, future-proofing across the
+  Qwen / Gemma / DeepSeek roadmap.
+- **Depends on**: [ADR 0001](0001-proposer-sizing-and-alignment.md)
+  (proposer is a constant 0.25–1 B regardless of verifier choice).
+
+## 1. Context
+
+ADR 0001 fixed the proposer in a 0.25–1 B band and established that
+verifier swaps are *data-and-fine-tune* operations rather than
+re-architecture operations. That decision deliberately deferred the
+question of *which* verifier to ship against. This ADR resolves that
+question for the project's first two ship targets and lays out the rule
+for future swaps.
+
+The decision space is constrained by four hard, independent factors:
+
+1. **Memory budget.** Primary deployment target is consumer hardware:
+   Mac M-series (16–64 GB unified memory) and consumer Nvidia GPUs
+   (RTX 4090: 24 GB, RTX 3090: 24 GB). The 24 GB Mac M4 is the
+   project's reference machine and is the lower bound below which we
+   refuse to design.
+2. **Perceived latency.** The chat REPL (`scripts/chat.py`) has a
+   project-internal target of ≤ 30 s for a 200-token Chinese response on
+   the reference Mac. Beyond that, the streaming UX is no longer
+   acceptable.
+3. **Alignment availability.** Per ADR 0001, every verifier we ship
+   against requires its own representation-alignment-trained proposer.
+   A verifier we cannot align against is a verifier we cannot ship.
+4. **Open weights for hidden states.** EAGLE-3 alignment requires
+   read access to the verifier's embedding, last-layer hidden state, and
+   LM head. Closed-weight (API-only) verifiers cannot be aligned with
+   the project's primary recipe; they fall back to a degraded path that
+   tops out at lower acceptance (see section 5).
+
+The current measured baseline (commit `e207aed`, MLX backend on M4):
+
+| Metric                        | Value                  |
+| ----------------------------- | ---------------------- |
+| Verifier                      | `Qwen/Qwen3-1.7B`, bf16 |
+| Resident memory               | ~5.5 GB                |
+| Wall-time (zh KV-cache prompt, 150 tokens) | 12.07 s   |
+| Acceptance (no alignment)     | 0.06–0.12              |
+
+## 2. Decision
+
+### 2.1 Ship sequence
+
+The project ships against verifiers in this order:
+
+| Ship | Verifier            | Backend         | Quant   | Triggered when                                       |
+| ---- | ------------------- | --------------- | ------- | ---------------------------------------------------- |
+| v1   | `Qwen/Qwen3-1.7B`   | MLX / CUDA      | bf16    | ADR 0001 Validation #2 passes (α ≥ 0.40 at K=2)      |
+| v2   | `Qwen/Qwen3-8B`     | MLX / CUDA      | **4-bit (AWQ-style)** | v1 in production + alignment retrained for 8B verifier |
+| v3+  | larger / MoE        | TBD             | TBD     | Recorded in a future ADR (0004+)                     |
+
+v1 reuses the verifier we have today. v2 is the planned upgrade. The
+2-step sequence exists deliberately: v1 proves that the alignment
+pipeline (ADR 0001) actually works on a verifier we can run end-to-end
+on a 24 GB Mac without quantization risk; v2 then exercises the
+verifier-decoupling claim of ADR 0001 §2.3 — same proposer architecture,
+new alignment artifacts, new quantized verifier weights.
+
+### 2.2 Quantization rule
+
+**bf16 below 4 B parameters; 4-bit MLX (or AWQ on CUDA) at 4 B and above.**
+
+Concretely:
+
+- Verifier ≤ 2 B params: bf16 unconditionally. Memory headroom is
+  sufficient on the reference 24 GB Mac; 4-bit gains nothing
+  meaningful and adds quantization noise that hurts acceptance.
+- 2 B < Verifier < 8 B: bf16 if it fits in ≤ 60 % of available unified
+  memory, else 4-bit. Decision is made per target machine in the engine
+  config, not statically per verifier.
+- Verifier ≥ 8 B: 4-bit by default. bf16 is reserved for non-consumer
+  GPU paths (A100 / H100 / MI300) recorded in a separate engine
+  deployment ADR.
+
+The 60 % threshold leaves ~10 GB headroom on a 24 GB machine for the
+proposer (≤ 1 GB), per-forward activations (~1–2 GB), the OS, and other
+applications the user is running.
+
+### 2.3 Open-weight requirement
+
+**The project's primary alignment recipe (ADR 0001) requires
+open-weight verifiers.** Closed-weight verifiers (e.g. GPT-4-class
+APIs, Claude-class APIs, Gemini-API-only models) cannot be aligned with
+EAGLE-3 because they expose neither embedding weights nor hidden
+states. Section 5 documents the degraded fallback path; it is *not* a
+ship target for v1 or v2.
+
+### 2.4 Latency budget enforcement
+
+For each verifier candidate, before training the alignment proposer
+against it, run a *no-proposer-baseline* benchmark on the reference
+hardware:
+
+```
+verifier.generate(reference_prompt, max_new_tokens=200)
+```
+
+If this baseline exceeds 50 s on the reference Mac, the verifier is
+rejected at the planning stage — speculative decoding cannot recover a
+verifier that is fundamentally too slow.
+
+For Qwen3-8B 4-bit on M4: estimated baseline 35–45 s for 200 tokens,
+which leaves margin. For Qwen3-32B-class models on M4: estimated
+baseline > 90 s, which rejects them at this gate. They become
+candidates only on cloud / data-center deployment, recorded in a
+separate ADR.
+
+## 3. Alternatives Considered
+
+### 3.1 Skip v1, go straight to Qwen3-8B (rejected)
+
+Rationale for considering: 1.7 B is "too small to matter" as a
+production verifier; the project's value proposition is at 7 B+ scale.
+
+Why rejected:
+
+- Two unproven things at once (alignment pipeline correctness *and*
+  quantized 8B fit/latency) compound risk. If something goes wrong, we
+  cannot tell which factor caused it.
+- ADR 0001's validation explicitly requires acceptance ≥ 0.40 on a
+  verifier we have measurements on. Switching that verifier to one we
+  don't yet have measurements on changes what "validation" means.
+- Cost: training the 8B-verifier alignment requires ~30–50 GB of
+  on-policy hidden-state cache. Burning that compute before validating
+  the recipe on the smaller verifier is wasteful.
+
+### 3.2 Skip Qwen3-8B, go directly to Qwen3-32B or DeepSeek-V2.5 (rejected)
+
+Why rejected:
+
+- 32 B 4-bit ~ 16 GB; with proposer + activations + OS, total resident
+  is ~21 GB on a 24 GB Mac — same memory cliff that rejected bf16 for
+  8B. Pushing harder without first hardening the engine layer is bad
+  engineering sequencing.
+- Latency budget: 32 B at 4-bit on M4 is estimated > 90 s for a
+  200-token reply, exceeding the perceived-latency target by 3×.
+- These verifiers are appropriate for the cloud-deployed verifier
+  pattern (proposer local, verifier remote) sketched in
+  `docs/local-inference-engine.md`. That deployment mode is recorded in
+  a future ADR; this ADR's scope is local-only.
+
+### 3.3 Use a non-Qwen verifier for v1 / v2 (rejected for now)
+
+Candidates: Gemma 4, DeepSeek V3/V4 distill, Llama 3.x.
+
+Why rejected for v1 / v2:
+
+- Project commitment in the original product brief is Qwen / Gemma /
+  DeepSeek as parallel targets. Sequencing them serially (Qwen first)
+  is a planning choice, not an architectural one.
+- Tokenizer continuity: the current proposer (`dllm-hub Qwen3-0.6B-mdlm`)
+  shares Qwen3 tokenizer. Switching verifier families forces a
+  proposer family switch too, which compounds with v1's alignment
+  validation in the same way as 3.1.
+- Multi-family support is a v3+ concern recorded in a future ADR.
+
+### 3.4 8-bit instead of 4-bit at the 8 B boundary (rejected)
+
+Why rejected:
+
+- 8-bit Qwen3-8B ≈ 8.5 GB resident. With proposer + activations + OS,
+  total ~13–14 GB. Fits but eats most of the headroom needed for
+  serving multiple sessions or running other apps concurrently.
+- 4-bit MLX (group-wise quantization, group_size=64) measures ~1 % of
+  perplexity degradation on Qwen3 family, well below the noise floor of
+  speculative decoding's accept/reject decisions.
+- 4-bit is also the format with mature MLX community releases
+  (`mlx-community/Qwen3-8B-4bit`), which removes a conversion step
+  from the engineering path.
+
+### 3.5 Mix: bf16 verifier on CUDA, 4-bit on MLX (deferred)
+
+Tempting because RTX 4090 has the same 24 GB as M4 but with faster
+memory bandwidth, so bf16 8B (16 GB) would fit. Deferred because:
+
+- It bifurcates the alignment training: bf16 verifier and 4-bit
+  verifier produce slightly different hidden states, which in
+  principle requires two alignment runs.
+- Empirically the difference is small enough (per Qwen3 4-bit
+  literature) that one alignment run usually transfers, but we have no
+  in-house measurement yet.
+- Deferred to a follow-up ADR after v2 ships and we measure the
+  quantization-transfer gap directly.
+
+## 4. Consequences
+
+### 4.1 Positive
+
+- **v1 ships on a verifier we can already run.** No new model
+  acquisition, no quantization conversion, no memory cliff. The only
+  unknown in v1 is whether ADR 0001's alignment recipe actually works.
+- **v2 has a clear, bounded scope.** When v1 ships, v2 is mechanical:
+  add `--verifier-id Qwen/Qwen3-8B-4bit` flag, regenerate hidden-state
+  cache, retrain proposer adapters, ship.
+- **The 60 % memory rule generalizes.** Future verifier candidates can
+  be evaluated with a one-line calculation; we are not redesigning
+  memory budgets per model.
+- **Alignment pipeline reuse.** The `training/repr_align/` package
+  built for v1 will run unchanged for v2. The verifier-decoupling claim
+  of ADR 0001 §2.3 gets exercised in production rather than just on
+  paper.
+
+### 4.2 Negative / accepted trade-offs
+
+- **v1 is a "stepping stone" verifier.** 1.7 B is below the size where
+  the project's KV-cache-savings story becomes economically interesting
+  (KV/token at 1.7 B is small enough that the proposer's weight
+  amortization breakeven sits at uncomfortably large B × S). This is
+  acknowledged: v1's purpose is recipe validation, not user-facing
+  value.
+- **Quantization noise interacts with alignment.** When v2's verifier
+  is 4-bit, the alignment recipe trains against quantized hidden
+  states, which are very slightly different from bf16 hidden states.
+  Acceptance may be 1–3 percentage points lower than equivalent bf16
+  alignment. We accept this; the absolute target (α ≥ 0.50 for v2 at
+  K=2) is set with that haircut already factored in.
+- **Closed-weight models are out of scope.** GPT-4 / Claude / Gemini
+  cannot be served by this engine in its primary mode. Section 5
+  explains why and what the lossy fallback would look like, but the
+  fallback is not a v1/v2/v3 commitment.
+
+### 4.3 Implications for current and future code
+
+- **`scripts/setup_*.sh`**: `download_models` becomes parameterized
+  over a model list rather than hard-coding `Qwen3-1.7B`. v1's list
+  remains `[Qwen3-0.6B-mdlm, Qwen3-1.7B]`; v2's list adds
+  `mlx-community/Qwen3-8B-4bit`.
+- **`inference_engine/backends/mlx/verifier.py`**: gains a `--verifier-id`
+  CLI flag and propagates it through `MLXSinkWindowVerifier(config)`.
+- **`scripts/run_platform_tests.sh`**: HF cache pre-flight check
+  becomes verifier-id-aware (currently hard-coded to `Qwen3-1.7B`).
+- **`training/repr_align/`** (introduced for ADR 0001): its
+  hidden-state cache directory is keyed by verifier id; v1 and v2
+  produce non-overlapping caches that can coexist on disk.
+- **Future ADR 0003** records per-verifier K values and tree-spec
+  configuration, which depend on measured acceptance from this ADR's
+  v1 / v2 deliveries.
+- **Future ADR 0004** records remote/cloud-deployed verifier pattern
+  (proposer local, verifier in the data center). That is where
+  Qwen3-32B / DeepSeek-V2.5 / GPT-OSS-120B class models become
+  in-scope.
+
+## 5. Closed-Weight Verifiers — Why and What If
+
+A recurring question is whether EAGLE-3-style alignment can be applied
+to commercial API-only models (GPT-4, Claude, Gemini, Qwen-Max). The
+honest answer has three parts.
+
+### 5.1 What EAGLE-3 demands and which APIs supply it
+
+EAGLE-3 alignment uses three classes of verifier signal:
+
+| Signal                         | Required for                | Available from API? |
+| ------------------------------ | --------------------------- | ------------------- |
+| Embedding weights              | Shared `embed_tokens` in proposer | **No (any API)** |
+| LM head weights                | Shared `lm_head` in proposer | **No (any API)** |
+| Last-layer hidden state per token | Hidden-state alignment loss   | **No (any API)** |
+| Per-token top-K log-probs      | Logits-distill auxiliary loss | OpenAI (top-20), Anthropic (no), Gemini (top-5 in some endpoints), Qwen-Max (no) |
+| Sampled token sequence         | On-policy token-level supervision | **Yes (all APIs)** |
+
+The first three rows are the core of EAGLE-3 and are uniformly
+unavailable from closed APIs. This is not an oversight by API providers:
+exposing hidden states would leak the model's internal representation
+in a way that damages competitive moats and aids extraction attacks. It
+is highly unlikely to change for frontier models.
+
+### 5.2 What's still possible — degraded paths
+
+Three increasingly weak alternatives, all worse than EAGLE-3:
+
+1. **Logits-only distillation** (when API exposes top-K log-probs,
+   e.g. OpenAI). Loss reduces to KL between proposer's full
+   distribution and the API's top-K. Empirically observed acceptance
+   ceiling: ~0.45–0.55 (vs 0.70–0.80 for full EAGLE-3). This is the
+   regime that early speculative-decoding papers (the original DeepMind
+   work) operated in before EAGLE introduced hidden-state alignment.
+2. **Sequence-level behavioral cloning** (when API exposes only
+   sampled tokens). Loss is standard next-token cross-entropy on
+   verifier-generated sequences. Empirically observed acceptance
+   ceiling: ~0.30–0.40. This is essentially "train a small model to
+   imitate the API's output style"; it does not exploit the verifier's
+   probability distribution at all.
+3. **Hybrid with a local proxy** — train alignment against a
+   *similar-but-open* verifier (e.g. align against Qwen3-72B-Instruct
+   weights, deploy with Qwen-Max API). Produces a proposer aligned to
+   the wrong target; transfer quality depends on how close the open
+   proxy is to the closed model. Empirically: 5–15 percentage points
+   below pure EAGLE-3 against the actual proxy.
+
+### 5.3 Why none of these are v1/v2 ship targets
+
+- The project's primary value proposition (KV-cache replacement, local
+  memory savings) requires running the verifier locally. A closed API
+  verifier is run remotely; the "memory savings" become "the user pays
+  per token". Different product, different ADR.
+- Acceptance ceilings of 0.30–0.55 collapse the speculative speedup to
+  1.3–1.8×, which does not justify the engineering complexity.
+- Closed APIs charge per token of *both* prompt and completion. The
+  proposer's per-step verifier call contains ~K candidate tokens that
+  may be rejected; rejected tokens still cost money. The economic
+  break-even moves against speculative decoding in this setting.
+
+### 5.4 What we *will* support if the closed-API mode becomes a goal
+
+If the project later decides to address closed APIs (recorded in a
+future ADR), the path is:
+
+- A `RemoteVerifier` adapter exposing the same interface as
+  `MLXSinkWindowVerifier` / `SinkWindowVerifier`.
+- A degraded `training/logit_distill_remote/` package implementing
+  alternative #1 above when the target API exposes top-K log-probs.
+- A separate evaluation harness that reports acceptance, throughput,
+  *and* token cost per generated user-visible token, because the
+  third axis is what dominates the closed-API economics.
+
+This is a non-trivial body of work and explicitly out of scope for v1
+and v2. Pursuing it without first finishing v2 would distract from
+proving the core alignment recipe works.
+
+## 6. Validation
+
+This ADR is considered validated when:
+
+1. **v1 validation**: ADR 0001 §6 conditions are met against
+   `Qwen/Qwen3-1.7B` (α ≥ 0.40 at K=2), confirming the bf16 path
+   functions end-to-end.
+2. **v2 validation**: with no changes to proposer architecture,
+   training scripts, or serving code, the alignment pipeline produces
+   a proposer for `mlx-community/Qwen3-8B-4bit` that achieves α ≥ 0.50
+   at K=2 on the held-out evaluation set, and the engine runs the
+   reference 200-token Chinese prompt within the 30 s perceived-latency
+   target on the M4 reference machine.
+3. **Memory-rule validation**: the 60 % memory threshold from §2.2
+   correctly predicts fit/no-fit on at least three independent target
+   machines (24 GB Mac, 32 GB Mac, 24 GB RTX-class GPU) without
+   per-machine tuning.
+
+If item 2 fails on perceived latency but passes on acceptance, the ADR
+is partially superseded by an engine-side optimization ADR. If item 2
+fails on acceptance, ADR 0001's recipe is what needs revision, and
+this ADR's v2 commitment is paused until that revision lands.
+
+## 7. References
+
+- ADR 0001 — Proposer sizing, alignment, verifier decoupling
+  (this ADR is the verifier counterpart of that proposer-side decision).
+- `docs/local-inference-engine.md` — describes the serving stack that
+  consumes the verifiers selected here.
+- MLX community: `mlx-community/Qwen3-8B-4bit` for v2.
+- Qwen3 4-bit perplexity studies (community-published) supporting the
+  4-bit-at-8B decision.
+- Original DeepMind speculative decoding paper for the logits-only
+  alignment regime referenced in §5.2.
diff --git a/docs/adr/README.md b/docs/adr/README.md
@@ -33,3 +33,4 @@ reader what was *not* chosen.
 | #    | Title                                                           | Status   |
 | ---- | --------------------------------------------------------------- | -------- |
 | 0001 | [Proposer sizing, alignment, and verifier decoupling](0001-proposer-sizing-and-alignment.md) | Accepted |
+| 0002 | [Verifier selection, quantization, and the open-vs-closed-weight constraint](0002-verifier-selection-and-quantization.md) | Accepted |