Skip to content

Commit 4309266

Browse files
Add ADR 0002: verifier selection, quantization, open/closed-weight constraint
Records the v1/v2 verifier ship sequence and the algorithmic constraint that EAGLE-3 alignment requires open weights: - v1 ships on Qwen3-1.7B bf16 (current verifier) once ADR 0001's alignment validation passes; v2 swaps to Qwen3-8B 4-bit, exercising ADR 0001's verifier-decoupling claim in production rather than just on paper. - Quantization rule: bf16 below 4B params, 4-bit MLX/AWQ at 4B and above; 60% unified-memory threshold determines fit/no-fit on consumer hardware. - Latency gate: a no-proposer baseline >50s on the reference Mac rejects a verifier candidate at planning time (Qwen3-32B+ deferred to a future cloud-deployment ADR). - Closed-weight APIs (GPT-4 / Claude / Gemini) cannot be aligned with the project's primary recipe because EAGLE-3 needs verifier embedding, LM head, and last-layer hidden state (none exposed by any frontier API). Section 5 documents three degraded fallback paths (logits-only distill ~0.45-0.55, sequence behavioral cloning ~0.30-0.40, hybrid open-proxy alignment) and explicitly puts them out of scope for v1/v2. - Rejects four alternatives (skip v1, skip Qwen3-8B for 32B+, switch family, 8-bit at 8B) with reasons. Adds the ADR to docs/adr/README.md index and to the top-level README. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent e207aed commit 4309266

3 files changed

Lines changed: 369 additions & 0 deletions

File tree

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,3 +271,9 @@ explicitly rejected.
271271
0.25–1 B band, treat EAGLE-3 representation alignment as the canonical
272272
training recipe, and design verifier swaps to be data-and-fine-tune
273273
operations rather than re-architecture operations.
274+
- [ADR 0002 — Verifier selection, quantization, and the
275+
open-vs-closed-weight constraint](docs/adr/0002-verifier-selection-and-quantization.md):
276+
the v1/v2 ship sequence (Qwen3-1.7B bf16 → Qwen3-8B 4-bit), the 60 %
277+
memory rule for choosing bf16 vs 4-bit, and why closed-weight APIs
278+
(GPT/Claude/Gemini) cannot be aligned with EAGLE-3 and are out of
279+
scope for v1 / v2.
Lines changed: 362 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,362 @@
1+
# ADR 0002 — Verifier Selection, Quantization, and the Open-vs-Closed-Weight Constraint
2+
3+
- **Status**: Accepted
4+
- **Date**: 2026-05-23
5+
- **Decision drivers**: Memory fit on consumer hardware, perceived
6+
latency, alignment-training feasibility, future-proofing across the
7+
Qwen / Gemma / DeepSeek roadmap.
8+
- **Depends on**: [ADR 0001](0001-proposer-sizing-and-alignment.md)
9+
(proposer is a constant 0.25–1 B regardless of verifier choice).
10+
11+
## 1. Context
12+
13+
ADR 0001 fixed the proposer in a 0.25–1 B band and established that
14+
verifier swaps are *data-and-fine-tune* operations rather than
15+
re-architecture operations. That decision deliberately deferred the
16+
question of *which* verifier to ship against. This ADR resolves that
17+
question for the project's first two ship targets and lays out the rule
18+
for future swaps.
19+
20+
The decision space is constrained by four hard, independent factors:
21+
22+
1. **Memory budget.** Primary deployment target is consumer hardware:
23+
Mac M-series (16–64 GB unified memory) and consumer Nvidia GPUs
24+
(RTX 4090: 24 GB, RTX 3090: 24 GB). The 24 GB Mac M4 is the
25+
project's reference machine and is the lower bound below which we
26+
refuse to design.
27+
2. **Perceived latency.** The chat REPL (`scripts/chat.py`) has a
28+
project-internal target of ≤ 30 s for a 200-token Chinese response on
29+
the reference Mac. Beyond that, the streaming UX is no longer
30+
acceptable.
31+
3. **Alignment availability.** Per ADR 0001, every verifier we ship
32+
against requires its own representation-alignment-trained proposer.
33+
A verifier we cannot align against is a verifier we cannot ship.
34+
4. **Open weights for hidden states.** EAGLE-3 alignment requires
35+
read access to the verifier's embedding, last-layer hidden state, and
36+
LM head. Closed-weight (API-only) verifiers cannot be aligned with
37+
the project's primary recipe; they fall back to a degraded path that
38+
tops out at lower acceptance (see section 5).
39+
40+
The current measured baseline (commit `e207aed`, MLX backend on M4):
41+
42+
| Metric | Value |
43+
| ----------------------------- | ---------------------- |
44+
| Verifier | `Qwen/Qwen3-1.7B`, bf16 |
45+
| Resident memory | ~5.5 GB |
46+
| Wall-time (zh KV-cache prompt, 150 tokens) | 12.07 s |
47+
| Acceptance (no alignment) | 0.06–0.12 |
48+
49+
## 2. Decision
50+
51+
### 2.1 Ship sequence
52+
53+
The project ships against verifiers in this order:
54+
55+
| Ship | Verifier | Backend | Quant | Triggered when |
56+
| ---- | ------------------- | --------------- | ------- | ---------------------------------------------------- |
57+
| v1 | `Qwen/Qwen3-1.7B` | MLX / CUDA | bf16 | ADR 0001 Validation #2 passes (α ≥ 0.40 at K=2) |
58+
| v2 | `Qwen/Qwen3-8B` | MLX / CUDA | **4-bit (AWQ-style)** | v1 in production + alignment retrained for 8B verifier |
59+
| v3+ | larger / MoE | TBD | TBD | Recorded in a future ADR (0004+) |
60+
61+
v1 reuses the verifier we have today. v2 is the planned upgrade. The
62+
2-step sequence exists deliberately: v1 proves that the alignment
63+
pipeline (ADR 0001) actually works on a verifier we can run end-to-end
64+
on a 24 GB Mac without quantization risk; v2 then exercises the
65+
verifier-decoupling claim of ADR 0001 §2.3 — same proposer architecture,
66+
new alignment artifacts, new quantized verifier weights.
67+
68+
### 2.2 Quantization rule
69+
70+
**bf16 below 4 B parameters; 4-bit MLX (or AWQ on CUDA) at 4 B and above.**
71+
72+
Concretely:
73+
74+
- Verifier ≤ 2 B params: bf16 unconditionally. Memory headroom is
75+
sufficient on the reference 24 GB Mac; 4-bit gains nothing
76+
meaningful and adds quantization noise that hurts acceptance.
77+
- 2 B < Verifier < 8 B: bf16 if it fits in ≤ 60 % of available unified
78+
memory, else 4-bit. Decision is made per target machine in the engine
79+
config, not statically per verifier.
80+
- Verifier ≥ 8 B: 4-bit by default. bf16 is reserved for non-consumer
81+
GPU paths (A100 / H100 / MI300) recorded in a separate engine
82+
deployment ADR.
83+
84+
The 60 % threshold leaves ~10 GB headroom on a 24 GB machine for the
85+
proposer (≤ 1 GB), per-forward activations (~1–2 GB), the OS, and other
86+
applications the user is running.
87+
88+
### 2.3 Open-weight requirement
89+
90+
**The project's primary alignment recipe (ADR 0001) requires
91+
open-weight verifiers.** Closed-weight verifiers (e.g. GPT-4-class
92+
APIs, Claude-class APIs, Gemini-API-only models) cannot be aligned with
93+
EAGLE-3 because they expose neither embedding weights nor hidden
94+
states. Section 5 documents the degraded fallback path; it is *not* a
95+
ship target for v1 or v2.
96+
97+
### 2.4 Latency budget enforcement
98+
99+
For each verifier candidate, before training the alignment proposer
100+
against it, run a *no-proposer-baseline* benchmark on the reference
101+
hardware:
102+
103+
```
104+
verifier.generate(reference_prompt, max_new_tokens=200)
105+
```
106+
107+
If this baseline exceeds 50 s on the reference Mac, the verifier is
108+
rejected at the planning stage — speculative decoding cannot recover a
109+
verifier that is fundamentally too slow.
110+
111+
For Qwen3-8B 4-bit on M4: estimated baseline 35–45 s for 200 tokens,
112+
which leaves margin. For Qwen3-32B-class models on M4: estimated
113+
baseline > 90 s, which rejects them at this gate. They become
114+
candidates only on cloud / data-center deployment, recorded in a
115+
separate ADR.
116+
117+
## 3. Alternatives Considered
118+
119+
### 3.1 Skip v1, go straight to Qwen3-8B (rejected)
120+
121+
Rationale for considering: 1.7 B is "too small to matter" as a
122+
production verifier; the project's value proposition is at 7 B+ scale.
123+
124+
Why rejected:
125+
126+
- Two unproven things at once (alignment pipeline correctness *and*
127+
quantized 8B fit/latency) compound risk. If something goes wrong, we
128+
cannot tell which factor caused it.
129+
- ADR 0001's validation explicitly requires acceptance ≥ 0.40 on a
130+
verifier we have measurements on. Switching that verifier to one we
131+
don't yet have measurements on changes what "validation" means.
132+
- Cost: training the 8B-verifier alignment requires ~30–50 GB of
133+
on-policy hidden-state cache. Burning that compute before validating
134+
the recipe on the smaller verifier is wasteful.
135+
136+
### 3.2 Skip Qwen3-8B, go directly to Qwen3-32B or DeepSeek-V2.5 (rejected)
137+
138+
Why rejected:
139+
140+
- 32 B 4-bit ~ 16 GB; with proposer + activations + OS, total resident
141+
is ~21 GB on a 24 GB Mac — same memory cliff that rejected bf16 for
142+
8B. Pushing harder without first hardening the engine layer is bad
143+
engineering sequencing.
144+
- Latency budget: 32 B at 4-bit on M4 is estimated > 90 s for a
145+
200-token reply, exceeding the perceived-latency target by 3×.
146+
- These verifiers are appropriate for the cloud-deployed verifier
147+
pattern (proposer local, verifier remote) sketched in
148+
`docs/local-inference-engine.md`. That deployment mode is recorded in
149+
a future ADR; this ADR's scope is local-only.
150+
151+
### 3.3 Use a non-Qwen verifier for v1 / v2 (rejected for now)
152+
153+
Candidates: Gemma 4, DeepSeek V3/V4 distill, Llama 3.x.
154+
155+
Why rejected for v1 / v2:
156+
157+
- Project commitment in the original product brief is Qwen / Gemma /
158+
DeepSeek as parallel targets. Sequencing them serially (Qwen first)
159+
is a planning choice, not an architectural one.
160+
- Tokenizer continuity: the current proposer (`dllm-hub Qwen3-0.6B-mdlm`)
161+
shares Qwen3 tokenizer. Switching verifier families forces a
162+
proposer family switch too, which compounds with v1's alignment
163+
validation in the same way as 3.1.
164+
- Multi-family support is a v3+ concern recorded in a future ADR.
165+
166+
### 3.4 8-bit instead of 4-bit at the 8 B boundary (rejected)
167+
168+
Why rejected:
169+
170+
- 8-bit Qwen3-8B ≈ 8.5 GB resident. With proposer + activations + OS,
171+
total ~13–14 GB. Fits but eats most of the headroom needed for
172+
serving multiple sessions or running other apps concurrently.
173+
- 4-bit MLX (group-wise quantization, group_size=64) measures ~1 % of
174+
perplexity degradation on Qwen3 family, well below the noise floor of
175+
speculative decoding's accept/reject decisions.
176+
- 4-bit is also the format with mature MLX community releases
177+
(`mlx-community/Qwen3-8B-4bit`), which removes a conversion step
178+
from the engineering path.
179+
180+
### 3.5 Mix: bf16 verifier on CUDA, 4-bit on MLX (deferred)
181+
182+
Tempting because RTX 4090 has the same 24 GB as M4 but with faster
183+
memory bandwidth, so bf16 8B (16 GB) would fit. Deferred because:
184+
185+
- It bifurcates the alignment training: bf16 verifier and 4-bit
186+
verifier produce slightly different hidden states, which in
187+
principle requires two alignment runs.
188+
- Empirically the difference is small enough (per Qwen3 4-bit
189+
literature) that one alignment run usually transfers, but we have no
190+
in-house measurement yet.
191+
- Deferred to a follow-up ADR after v2 ships and we measure the
192+
quantization-transfer gap directly.
193+
194+
## 4. Consequences
195+
196+
### 4.1 Positive
197+
198+
- **v1 ships on a verifier we can already run.** No new model
199+
acquisition, no quantization conversion, no memory cliff. The only
200+
unknown in v1 is whether ADR 0001's alignment recipe actually works.
201+
- **v2 has a clear, bounded scope.** When v1 ships, v2 is mechanical:
202+
add `--verifier-id Qwen/Qwen3-8B-4bit` flag, regenerate hidden-state
203+
cache, retrain proposer adapters, ship.
204+
- **The 60 % memory rule generalizes.** Future verifier candidates can
205+
be evaluated with a one-line calculation; we are not redesigning
206+
memory budgets per model.
207+
- **Alignment pipeline reuse.** The `training/repr_align/` package
208+
built for v1 will run unchanged for v2. The verifier-decoupling claim
209+
of ADR 0001 §2.3 gets exercised in production rather than just on
210+
paper.
211+
212+
### 4.2 Negative / accepted trade-offs
213+
214+
- **v1 is a "stepping stone" verifier.** 1.7 B is below the size where
215+
the project's KV-cache-savings story becomes economically interesting
216+
(KV/token at 1.7 B is small enough that the proposer's weight
217+
amortization breakeven sits at uncomfortably large B × S). This is
218+
acknowledged: v1's purpose is recipe validation, not user-facing
219+
value.
220+
- **Quantization noise interacts with alignment.** When v2's verifier
221+
is 4-bit, the alignment recipe trains against quantized hidden
222+
states, which are very slightly different from bf16 hidden states.
223+
Acceptance may be 1–3 percentage points lower than equivalent bf16
224+
alignment. We accept this; the absolute target (α ≥ 0.50 for v2 at
225+
K=2) is set with that haircut already factored in.
226+
- **Closed-weight models are out of scope.** GPT-4 / Claude / Gemini
227+
cannot be served by this engine in its primary mode. Section 5
228+
explains why and what the lossy fallback would look like, but the
229+
fallback is not a v1/v2/v3 commitment.
230+
231+
### 4.3 Implications for current and future code
232+
233+
- **`scripts/setup_*.sh`**: `download_models` becomes parameterized
234+
over a model list rather than hard-coding `Qwen3-1.7B`. v1's list
235+
remains `[Qwen3-0.6B-mdlm, Qwen3-1.7B]`; v2's list adds
236+
`mlx-community/Qwen3-8B-4bit`.
237+
- **`inference_engine/backends/mlx/verifier.py`**: gains a `--verifier-id`
238+
CLI flag and propagates it through `MLXSinkWindowVerifier(config)`.
239+
- **`scripts/run_platform_tests.sh`**: HF cache pre-flight check
240+
becomes verifier-id-aware (currently hard-coded to `Qwen3-1.7B`).
241+
- **`training/repr_align/`** (introduced for ADR 0001): its
242+
hidden-state cache directory is keyed by verifier id; v1 and v2
243+
produce non-overlapping caches that can coexist on disk.
244+
- **Future ADR 0003** records per-verifier K values and tree-spec
245+
configuration, which depend on measured acceptance from this ADR's
246+
v1 / v2 deliveries.
247+
- **Future ADR 0004** records remote/cloud-deployed verifier pattern
248+
(proposer local, verifier in the data center). That is where
249+
Qwen3-32B / DeepSeek-V2.5 / GPT-OSS-120B class models become
250+
in-scope.
251+
252+
## 5. Closed-Weight Verifiers — Why and What If
253+
254+
A recurring question is whether EAGLE-3-style alignment can be applied
255+
to commercial API-only models (GPT-4, Claude, Gemini, Qwen-Max). The
256+
honest answer has three parts.
257+
258+
### 5.1 What EAGLE-3 demands and which APIs supply it
259+
260+
EAGLE-3 alignment uses three classes of verifier signal:
261+
262+
| Signal | Required for | Available from API? |
263+
| ------------------------------ | --------------------------- | ------------------- |
264+
| Embedding weights | Shared `embed_tokens` in proposer | **No (any API)** |
265+
| LM head weights | Shared `lm_head` in proposer | **No (any API)** |
266+
| Last-layer hidden state per token | Hidden-state alignment loss | **No (any API)** |
267+
| Per-token top-K log-probs | Logits-distill auxiliary loss | OpenAI (top-20), Anthropic (no), Gemini (top-5 in some endpoints), Qwen-Max (no) |
268+
| Sampled token sequence | On-policy token-level supervision | **Yes (all APIs)** |
269+
270+
The first three rows are the core of EAGLE-3 and are uniformly
271+
unavailable from closed APIs. This is not an oversight by API providers:
272+
exposing hidden states would leak the model's internal representation
273+
in a way that damages competitive moats and aids extraction attacks. It
274+
is highly unlikely to change for frontier models.
275+
276+
### 5.2 What's still possible — degraded paths
277+
278+
Three increasingly weak alternatives, all worse than EAGLE-3:
279+
280+
1. **Logits-only distillation** (when API exposes top-K log-probs,
281+
e.g. OpenAI). Loss reduces to KL between proposer's full
282+
distribution and the API's top-K. Empirically observed acceptance
283+
ceiling: ~0.45–0.55 (vs 0.70–0.80 for full EAGLE-3). This is the
284+
regime that early speculative-decoding papers (the original DeepMind
285+
work) operated in before EAGLE introduced hidden-state alignment.
286+
2. **Sequence-level behavioral cloning** (when API exposes only
287+
sampled tokens). Loss is standard next-token cross-entropy on
288+
verifier-generated sequences. Empirically observed acceptance
289+
ceiling: ~0.30–0.40. This is essentially "train a small model to
290+
imitate the API's output style"; it does not exploit the verifier's
291+
probability distribution at all.
292+
3. **Hybrid with a local proxy** — train alignment against a
293+
*similar-but-open* verifier (e.g. align against Qwen3-72B-Instruct
294+
weights, deploy with Qwen-Max API). Produces a proposer aligned to
295+
the wrong target; transfer quality depends on how close the open
296+
proxy is to the closed model. Empirically: 5–15 percentage points
297+
below pure EAGLE-3 against the actual proxy.
298+
299+
### 5.3 Why none of these are v1/v2 ship targets
300+
301+
- The project's primary value proposition (KV-cache replacement, local
302+
memory savings) requires running the verifier locally. A closed API
303+
verifier is run remotely; the "memory savings" become "the user pays
304+
per token". Different product, different ADR.
305+
- Acceptance ceilings of 0.30–0.55 collapse the speculative speedup to
306+
1.3–1.8×, which does not justify the engineering complexity.
307+
- Closed APIs charge per token of *both* prompt and completion. The
308+
proposer's per-step verifier call contains ~K candidate tokens that
309+
may be rejected; rejected tokens still cost money. The economic
310+
break-even moves against speculative decoding in this setting.
311+
312+
### 5.4 What we *will* support if the closed-API mode becomes a goal
313+
314+
If the project later decides to address closed APIs (recorded in a
315+
future ADR), the path is:
316+
317+
- A `RemoteVerifier` adapter exposing the same interface as
318+
`MLXSinkWindowVerifier` / `SinkWindowVerifier`.
319+
- A degraded `training/logit_distill_remote/` package implementing
320+
alternative #1 above when the target API exposes top-K log-probs.
321+
- A separate evaluation harness that reports acceptance, throughput,
322+
*and* token cost per generated user-visible token, because the
323+
third axis is what dominates the closed-API economics.
324+
325+
This is a non-trivial body of work and explicitly out of scope for v1
326+
and v2. Pursuing it without first finishing v2 would distract from
327+
proving the core alignment recipe works.
328+
329+
## 6. Validation
330+
331+
This ADR is considered validated when:
332+
333+
1. **v1 validation**: ADR 0001 §6 conditions are met against
334+
`Qwen/Qwen3-1.7B` (α ≥ 0.40 at K=2), confirming the bf16 path
335+
functions end-to-end.
336+
2. **v2 validation**: with no changes to proposer architecture,
337+
training scripts, or serving code, the alignment pipeline produces
338+
a proposer for `mlx-community/Qwen3-8B-4bit` that achieves α ≥ 0.50
339+
at K=2 on the held-out evaluation set, and the engine runs the
340+
reference 200-token Chinese prompt within the 30 s perceived-latency
341+
target on the M4 reference machine.
342+
3. **Memory-rule validation**: the 60 % memory threshold from §2.2
343+
correctly predicts fit/no-fit on at least three independent target
344+
machines (24 GB Mac, 32 GB Mac, 24 GB RTX-class GPU) without
345+
per-machine tuning.
346+
347+
If item 2 fails on perceived latency but passes on acceptance, the ADR
348+
is partially superseded by an engine-side optimization ADR. If item 2
349+
fails on acceptance, ADR 0001's recipe is what needs revision, and
350+
this ADR's v2 commitment is paused until that revision lands.
351+
352+
## 7. References
353+
354+
- ADR 0001 — Proposer sizing, alignment, verifier decoupling
355+
(this ADR is the verifier counterpart of that proposer-side decision).
356+
- `docs/local-inference-engine.md` — describes the serving stack that
357+
consumes the verifiers selected here.
358+
- MLX community: `mlx-community/Qwen3-8B-4bit` for v2.
359+
- Qwen3 4-bit perplexity studies (community-published) supporting the
360+
4-bit-at-8B decision.
361+
- Original DeepMind speculative decoding paper for the logits-only
362+
alignment regime referenced in §5.2.

docs/adr/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ reader what was *not* chosen.
3333
| # | Title | Status |
3434
| ---- | --------------------------------------------------------------- | -------- |
3535
| 0001 | [Proposer sizing, alignment, and verifier decoupling](0001-proposer-sizing-and-alignment.md) | Accepted |
36+
| 0002 | [Verifier selection, quantization, and the open-vs-closed-weight constraint](0002-verifier-selection-and-quantization.md) | Accepted |

0 commit comments

Comments
 (0)