Skip to content

Commit 9407044

Browse files
ADR 0010 (Proposed, safety-net): full-attention verifier + low-precision (INT8/NF4) KV cache
Drafts the v0.4 GA path that's independent of ADR 0011's empirical outcome. Trades on a different memory axis than v0.3 sink+window (linear-but-thinner cache instead of constant-but-trimmed cache), preserves full-attention intelligence by construction, and is implementable on top of v0.3.0 weights without any new training step. Status: Proposed (safety-net for ADR 0011). Promotion to Accepted is gated on R1d-beta returning (I-2) i.e., refuting the cross-attention bridge hypothesis, OR on R1e failing to reach G-X1 within budget. Independent of that, ADR 0010 may ship as an additive optimization on top of an Accepted ADR 0011 in a future v0.5 (the two compose cleanly, see Q5). Key technical points: * Default precision NF4 (4-bit, ~4x reduction, <1% MMLU/HellaSwag delta vs bf16); INT8 fallback for backends without efficient NF4 kernels (~2x reduction, <0.1% delta). * Outlier-aware calibration: per-head top-1 outlier channel kept at bf16, remainder per-channel symmetric quant for K, asymmetric for V. * Backends: MLX (mx.quantize), PyTorch+CUDA (bitsandbytes), CPU INT8. * Sink+window stays as opt-in feature flag for memory-bounded edge. * Speculative decoding contract unchanged; only KV storage quantized. * INV-3 determinism preserved. Memory math demonstrates Mac mini 24 GB feasibility for Gemma 4-9B class at 64-100k tokens (vs bf16 KV which doesn't fit at all). 6 phases A-F, each with Linux CI + empirical gates. v0.4 GA acceptance criteria: NF4 recall >= 95% of full-bf16 on the existing 6-case mid-context benchmark, KV growth slope matches predicted 4x reduction, INV-3 gate passes, MLX and PyTorch backends produce matching argmax. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent b6fdec4 commit 9407044

1 file changed

Lines changed: 353 additions & 0 deletions

File tree

Lines changed: 353 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,353 @@
1+
# ADR 0010 — Full-attention verifier + low-precision (INT8 / NF4) KV cache
2+
3+
- **Status**: Proposed (safety-net for ADR 0011) — 2026-06-07
4+
- **Date**: 2026-06-07
5+
- **Decision drivers**:
6+
- The 2026-06-06 `sink+window` quality A/B benchmark
7+
(`results/platform-tests/sink_window_quality_ab_1780714635.json`)
8+
showed that v0.3's `SinkWindowVerifier` loses 83.3% recall on
9+
middle-context fact retrieval relative to a full-attention baseline.
10+
The `sink+window` design buys bounded KV by literally evicting K/V
11+
tensors for tokens outside `(sink ∪ window)`; nothing the proposer
12+
does at inference time can recover information that was deleted
13+
from the verifier's cache.
14+
- ADR 0011 ("cross-attention proposer/verifier coupling") is the
15+
hypothesis that cross-attention from a full-attention proposer
16+
hidden bank into a bounded verifier can rescue the lost recall.
17+
R1c GPU evidence
18+
(`results/research/cross_attn_toy_vast_full_1780806644.json`,
19+
`results/research/cross_attn_toy_vast_needle_small_1780806644.json`)
20+
establishes that the mechanism partially works (16% on a 20-vocab
21+
needle task, 0% on the 135k-vocab full task) but is far from the
22+
G-X1 ≥ 80% acceptance criterion. R1d-β will give a more definitive
23+
answer; this ADR is the safety net **independent of R1d-β's
24+
outcome**.
25+
- User-stated v0.4 strategic constraints (recorded 2026-06-06):
26+
*no deadline, no sunk-cost reasoning, extreme KV efficiency,
27+
zero intelligence regression*. ADR 0010 takes "zero intelligence
28+
regression" as a hard constraint and trades on the "extreme KV
29+
efficiency" axis using a different mechanism than ADR 0011.
30+
- **Depends on**: ADR 0001 (proposer sizing + speculative decoding
31+
contract), ADR 0002 (verifier selection — Qwen3-1.7B, Gemma 3-1B
32+
family).
33+
- **Relates to**: ADR 0011 (cross-attention bridge). The two are
34+
*alternative* approaches to the same problem
35+
("how do we get extreme KV savings on long-context workloads
36+
without intelligence regression?"). They are not mutually
37+
exclusive in code — a future v0.5 could combine bounded
38+
cross-attention rescue (ADR 0011 if validated) with low-precision
39+
full attention (this ADR) for compounding savings — but for v0.4
40+
GA they are exclusive choices because they share the verifier
41+
forward path and require different memory layout.
42+
43+
---
44+
45+
## 1. Context
46+
47+
### 1.1 What `sink+window` actually costs
48+
49+
v0.3's `SinkWindowVerifier` keeps K/V tensors only for
50+
`{0..sink-1} ∪ {q-window+1..q}` for each query position q. At
51+
`sink=4, window=64` over a 256–1024 token haystack with the needle
52+
at a random middle position, the A/B run measured:
53+
54+
| | Full-context Qwen3-1.7B greedy | v0.3 (Qwen3-0.6B dLM proposer + Qwen3-1.7B sink+window verifier) |
55+
|---|---|---|
56+
| Mid-context fact recall | 6/6 (100%) | 1/6 (16.7%) |
57+
| Peak KV bytes (B=1, S=84) | 56,311,808 | 7,798,784 |
58+
59+
Five of the six losses are middle-context fact recall failures: the
60+
needle's K/V was evicted before the answer position, and no
61+
proposer-side mechanism in v0.3 can rescue it.
62+
63+
The strategic question for v0.4 is whether to (a) accept the
64+
intelligence regression, (b) recover the lost information through
65+
cross-attention from a full-attention proposer (ADR 0011), or (c)
66+
keep full attention on the verifier and trade on memory in a
67+
different dimension (this ADR).
68+
69+
### 1.2 Where ADR 0011's R1c evidence stands
70+
71+
R1c (vast H200, 2 × 16 min, 2 GPU runs):
72+
73+
- 20-vocab diagnostic task: cross-attn bridge reaches 16% recall
74+
(final), peaks at 25% at step 800. The mechanism injects needle
75+
information in some fraction of cases — not noise.
76+
- 135k-vocab full task: 0.00 recall throughout 2000 training steps.
77+
Loss converges to perplexity ~2.3 yet recall does not rise.
78+
79+
This is consistent with two interpretations:
80+
81+
- **(I-1)** Single-layer cross-attn at depth 20 has too little
82+
capacity to encode an arbitrary needle into the verifier's
83+
residual stream as a precise argmax-flipping signal; multi-layer
84+
/ multi-depth bridges can close the gap (R1d-β → R1e).
85+
- **(I-2)** The full-attention proposer's hidden bank, as a generic
86+
pretrained representation, is not localizable enough by gradient
87+
descent to be a usable index — i.e., the §3 hypothesis is wrong
88+
in shape, and no amount of capacity in the bridge fixes it.
89+
90+
R1d-β (auxiliary retrieval loss + attention-localization metric) is
91+
designed to distinguish (I-1) from (I-2). ADR 0010 is the v0.4 GA
92+
plan if R1d-β returns (I-2) or if R1e cannot reach 80% within a
93+
reasonable compute budget.
94+
95+
### 1.3 The ADR 0010 framing
96+
97+
Keep full attention on the verifier — same intelligence as the
98+
oracle baseline by construction — but reduce the *bytes per cached
99+
token* by quantizing K/V to lower precision. The KV cache is the
100+
dominant memory term for long-context inference (it grows linearly
101+
with context length and dominates weights once context > a few k
102+
tokens), so a 2× or 4× per-token compression buys back most of the
103+
practical memory benefit `sink+window` provided.
104+
105+
### 1.4 Memory math
106+
107+
Per token, per layer KV bytes:
108+
109+
| Precision | bytes/elem | KV bytes/(token, layer) for hidden=1152 (Gemma 3-1B) | for hidden=3584 (Gemma 4-9B class) |
110+
|---|---|---|---|
111+
| **bf16** (current) | 2 | 4,608 | 14,336 |
112+
| **INT8** | 1 | 2,304 (-50%) | 7,168 (-50%) |
113+
| **INT4 / NF4** | 0.5 | 1,152 (-75%) | 3,584 (-75%) |
114+
115+
For multi-layer aggregate at typical layer counts:
116+
117+
- Gemma 3-1B (26 layers): bf16 ≈ **120 KB/token**, INT8 ≈ 60, NF4 ≈ 30
118+
- Gemma 4-9B-class (≈ 42 layers): bf16 ≈ **600 KB/token**, INT8 ≈ 300, NF4 ≈ 150
119+
120+
For Mac mini 24 GB targeting 64 k-token context on Gemma 4-9B class:
121+
122+
- bf16 KV: 64 k × 600 KB = **~37 GB** → does not fit. v0.3 only fit by trimming the cache.
123+
- INT8 KV: ~18 GB → fits with margin for weights/activations.
124+
- NF4 KV: ~9 GB → fits comfortably; leaves room for KV growth past 100 k tokens.
125+
126+
For comparison `sink+window=4+64`: caps at ~68 tokens × 600 KB ≈
127+
**41 MB** regardless of context length. ADR 0010's win-axis is
128+
**different from `sink+window`'s**: not "constant memory", but
129+
"linear memory at half/quarter the slope, with full intelligence".
130+
131+
The two are complementary — ADR 0010 + ADR 0011 (if validated) is a
132+
v0.5+ direction.
133+
134+
---
135+
136+
## 2. Decisions
137+
138+
### 2.1 Default precision: NF4 (4-bit normal-float)
139+
140+
NF4 (introduced in QLoRA, 2023) is a 4-bit quantization tuned for
141+
parameter distributions that are roughly normal — which the K/V
142+
projections after a transformer layer are, by training-time weight
143+
decay and layer-norm structure. Empirical benchmarks
144+
(QLoRA paper + follow-ups, AWQ paper) put NF4 within 0.3–0.8% of
145+
bf16 on MMLU / HellaSwag / ARC at 7B–13B parameter scale. INT4
146+
uniform quant is ~0.5% worse than NF4 at the same bit-rate.
147+
148+
INT8 is the **safe-default fallback** when a backend cannot host
149+
NF4 efficiently (e.g., MPS without bnb-style kernels). INT8 is
150+
within 0.05–0.1% of bf16 in the same benchmarks — effectively
151+
indistinguishable.
152+
153+
### 2.2 Calibration: per-tensor symmetric, asymmetric for outliers
154+
155+
KV tensors have outlier channels (well-documented in SmoothQuant,
156+
AWQ). Two-step quantization:
157+
158+
1. Per-token, per-head **outlier mask**: top-k channels by absolute
159+
magnitude (k = 1–2) are kept in bf16.
160+
2. Remaining channels: per-channel symmetric quant for K
161+
(zero-centered after layer-norm), per-channel asymmetric for V
162+
(no zero-centering guarantee).
163+
164+
This adds ~3–5% storage overhead (the bf16 outliers + per-channel
165+
scales) but recovers most of the long-context retrieval quality
166+
that uniform per-tensor quant loses.
167+
168+
### 2.3 Backends
169+
170+
- **MLX (Apple Silicon)**: implement NF4 KV via `mx.quantize` /
171+
`mx.dequantize` on the K/V projections immediately before they
172+
enter the cache, and dequant on the read side. INT8 fallback uses
173+
the same path with a different `bits=` argument. MLX 0.31+
174+
supports both.
175+
- **PyTorch / CUDA**: use `bitsandbytes` for NF4 (well-tested on
176+
CUDA), fall back to INT8 via `torch.quantize_per_channel` for
177+
hardware without `bnb`.
178+
- **CPU (test/CI)**: INT8 only; NF4 has no efficient CPU kernel and
179+
is not a v0.4 GA target.
180+
181+
### 2.4 Sink+window stays as a feature flag, not a default
182+
183+
`SinkWindowVerifier` is preserved in `inference_engine.backends.*`
184+
but defaults to disabled in v0.4. Workloads that explicitly request
185+
constant-memory KV (e.g., long-running agent loops on tiny edge
186+
hardware where even NF4 × full-context is too much) opt in via
187+
`Verifier(kv_strategy="sink_window", sink=..., window=...)`.
188+
189+
### 2.5 Speculative decoding contract: unchanged
190+
191+
The dLM proposer + AR verifier speculative decoding loop from ADR
192+
0001 remains exactly as in v0.3. Verification still happens at
193+
bf16 precision (logits are dequantized for argmax/softmax); only
194+
the *K/V cache storage* is quantized. This preserves byte-exact
195+
determinism under the ADR 0008 §6.5 INV-3 gate.
196+
197+
---
198+
199+
## 3. Alternatives considered
200+
201+
| Alternative | Status | Why rejected (or why deferred) |
202+
|---|---|---|
203+
| Keep `sink+window` as v0.4 default | Rejected | Empirically loses ≥83% on middle-context recall; conflicts with "zero intelligence regression". |
204+
| ADR 0011 cross-attention bridge | **Active research** | Conditional on R1d-β / R1e outcome. ADR 0010 is the safety net if 0011 is rejected. If 0011 is accepted, ADR 0010 may still ship as an *additive* memory optimization (combining bounded cross-attention + low-precision storage for compounded savings). |
205+
| Sliding-window-only (no sink) | Rejected | Same intelligence regression as `sink+window`; worse on early-context anchoring. |
206+
| H2O / SnapKV / PyramidKV importance-based eviction | Deferred | Improves on `sink+window` for some workloads but still evicts. Requires per-token importance scoring at inference time (compute cost). v0.5 candidate. |
207+
| Mamba / RWKV / RetNet long-context-native models | Out of scope | Changes the project's model-identity. ADR 0001 commits to Qwen3 / Gemma family. |
208+
| KV cache *offload* to disk / shared memory | Deferred | Mac mini 24 GB has no fast secondary storage path. Useful for desktops with ample SSD bandwidth — v0.6 candidate. |
209+
210+
---
211+
212+
## 4. Consequences
213+
214+
### 4.1 What is gained
215+
216+
- **Zero intelligence regression by construction**. Full attention
217+
means oracle-equivalent token argmax in the limit of perfect
218+
dequant; calibrated NF4 / INT8 keep the gap < 1% on standard
219+
benchmarks.
220+
- **2× (INT8) or 4× (NF4) reduction in per-token cache bytes**,
221+
enough to fit Gemma 4-9B class workloads at ~64–100 k tokens
222+
on Mac mini 24 GB.
223+
- **No new training step**. Unlike ADR 0011 (which needs cross-
224+
attention bridge training, alignment data prep, gate G-X1/2/3
225+
empirical validation), ADR 0010 is implementable on top of
226+
v0.3.0 weights without modifying the proposer or verifier.
227+
- **Backend-portable**. Apple Silicon, NVIDIA, and CPU all have
228+
established INT8 / NF4 kernels.
229+
230+
### 4.2 What is given up
231+
232+
- **Linear memory growth**. KV still grows with context length; on
233+
pathological multi-hour agent loops with no `clearKvCache` calls
234+
the cache will eventually exceed any fixed budget. ADR 0010
235+
trades an absolute bound (`sink+window`) for a *better slope* on
236+
a linear curve. Workloads that need an absolute bound must opt
237+
back into `sink+window` (§2.4).
238+
- **Compute overhead at the dequant boundary**. Each verifier
239+
forward pass dequantizes the K/V tensors it reads. On hardware
240+
with native int8/int4 tensor cores (H100, M-series GPU
241+
matmul-on-int8) this is negligible. On older NVIDIA cards (A100,
242+
L4) it is measurable (~5–15% slowdown vs bf16). Acceptable for
243+
v0.4; revisit on a per-backend basis.
244+
- **Outlier-aware calibration adds complexity**. Per-channel scales
245+
+ outlier mask is non-trivial code; the simpler per-tensor
246+
symmetric quant is faster but loses 2–5% on long-context
247+
retrieval. v0.4 ships outlier-aware as the default; per-tensor
248+
is a runtime flag for benchmarking.
249+
250+
---
251+
252+
## 5. Implementation plan (PR sequence)
253+
254+
| Phase | Scope | Deliverables |
255+
|---|---|---|
256+
| **A** | Quantization primitives (CPU + MLX + CUDA) | `inference_engine.backends.kv_quant` module with `quantize_kv(K, V, bits, scheme)` / `dequantize_kv(...)` and a `KVQuantConfig` dataclass. Linux unit tests for round-trip error bounds. |
257+
| **B** | Verifier integration (single backend first: MLX) | `inference_engine.backends.mlx.FullAttentionQuantizedVerifier` — same forward signature as `MLXSinkWindowVerifier`, but stores KV in NF4 / INT8 and dequantizes on read. INV-3 determinism gate must pass. |
258+
| **C** | A/B benchmark vs sink+window vs full-bf16 | Run the same `bench_sink_window_quality_ab.py` matrix on Mac M4 with NF4 / INT8 / sink+window / full-bf16 verifiers. Acceptance: NF4 recall ≥ 95% of full-bf16 on the existing 6-case mid-context fact retrieval benchmark. |
259+
| **D** | Backend port: PyTorch / CUDA | `inference_engine.backends.pytorch.FullAttentionQuantizedVerifier`. Linux integration tests on a small NVIDIA-equipped runner (or vast.ai). |
260+
| **E** | Long-session bench under quantized KV | Re-run `bench_session_long_run.py` 4 h at NF4 + INT8. Verify `kv_live_bytes` slope matches the predicted 2× / 4× reduction. |
261+
| **F** | Default flip + docs | v0.4 default verifier becomes `FullAttentionQuantizedVerifier(bits=4, scheme="nf4_outlier")`. Quickstart updated. `sink+window` documented as a feature flag for memory-bounded edge use. |
262+
263+
Each phase has Linux CI gates + (where applicable) Mac M4 / vast.ai
264+
empirical gates. PRs are stacked per ADR 0008 §9.
265+
266+
---
267+
268+
## 6. Validation criteria (v0.4 GA gates)
269+
270+
A v0.4 release shipping ADR 0010 must demonstrate, all on
271+
reproducible artifacts in `results/platform-tests/` or
272+
`results/research/`:
273+
274+
1. **Quality parity vs full-bf16**: NF4 verifier achieves ≥ 95% of
275+
full-bf16 recall on the 6-case mid-context benchmark, > 99% on
276+
short-context greedy completions. INT8 ≥ 99%.
277+
2. **Memory reduction realized**: per-turn `kv_live_bytes` reported
278+
by `GetSessionInfo` is within 5% of the theoretical
279+
2× / 4× target across a 1 h benchmark.
280+
3. **Determinism preserved**: ADR 0008 §6.5 INV-3 gate passes
281+
bit-exact between continuation and reset paths under the
282+
quantized cache.
283+
4. **Cross-backend equivalence**: MLX and PyTorch backends produce
284+
matching argmax across a 50-prompt eval set (within int4 / NF4
285+
numerical tolerance — exact int8 match expected).
286+
5. **Long-session stability**: 4 h `bench_session_long_run.py` on
287+
Mac M4 with `kv_strategy=nf4_full` shows no errors; KV growth
288+
matches the linear prediction (slope < the bf16 slope by 4×).
289+
290+
---
291+
292+
## 7. Open questions (to resolve during implementation)
293+
294+
- **Q1**: Per-channel vs per-token vs per-head granularity for
295+
outlier detection. Initial recommendation: per-head (matches
296+
attention computation natural axis), top-1 outlier channel
297+
retained at bf16. Validate empirically in Phase A.
298+
- **Q2**: Do we quantize on write only, or on both read and
299+
write (re-quantizing dequantized values during attention update
300+
passes)? Speculative decoding's verifier-recompute path may
301+
re-touch the same K/V tensors; double-quantization round-trip
302+
error compounds. Initial recommendation: quantize-on-write only,
303+
cache stays in low precision until evicted.
304+
- **Q3**: Interaction with cross-request KV reuse (deferred per
305+
ADR 0008 §6 — was ADR 0007's territory). When cross-request
306+
reuse lands in a future ADR, NF4 storage must round-trip cleanly
307+
across session boundaries. Out of scope here; flagged for
308+
whoever takes that on.
309+
- **Q4**: NF4 + speculative decoding interaction. The proposer
310+
reads no K/V (it's a dLM); the verifier reads K/V at quantized
311+
precision. Expected to be neutral. Validate in Phase C.
312+
- **Q5**: Compatibility with ADR 0011 cross-attention bridge if it
313+
later passes G-X1. The bridge consumes the proposer's hidden
314+
bank (which is computed at full-attention bf16 precision and
315+
stored separately, not in the verifier KV cache); the verifier
316+
KV cache is what ADR 0010 quantizes. The two should compose
317+
cleanly; validate in v0.5 if both ship.
318+
319+
---
320+
321+
## 8. Testing discipline
322+
323+
Same rules as ADR 0008 §9: no fakes, no fallbacks, no overfits,
324+
100% Linux unit-test coverage where the mechanism is testable
325+
without GPU; all empirical claims gated on reproducible Mac M4
326+
or vast.ai artifacts committed under `results/platform-tests/`.
327+
328+
NF4 round-trip error bounds, outlier mask correctness,
329+
quant/dequant idempotence, and INV-3 determinism are all
330+
testable on Linux in CI.
331+
332+
---
333+
334+
## 9. References
335+
336+
- `results/platform-tests/sink_window_quality_ab_1780714635.json`
337+
— the empirical surface that motivates this ADR
338+
- `results/research/cross_attn_toy_vast_full_1780806644.json`,
339+
`results/research/cross_attn_toy_vast_needle_small_1780806644.json`
340+
— R1c evidence informing the safety-net framing
341+
- ADR 0001 (proposer sizing + speculative decoding contract)
342+
- ADR 0002 (verifier selection — Qwen3-1.7B, Gemma 3-1B)
343+
- ADR 0008 (session-bound runtime, INV-3 determinism gate)
344+
- ADR 0011 (cross-attention bridge — proposed alternative)
345+
- QLoRA: Dettmers et al., "QLoRA: Efficient Finetuning of
346+
Quantized LLMs", NeurIPS 2023 (NF4 quantization scheme)
347+
- AWQ: Lin et al., "AWQ: Activation-aware Weight Quantization for
348+
LLM Compression and Acceleration", MLSys 2024 (outlier handling)
349+
- SmoothQuant: Xiao et al., "SmoothQuant: Accurate and Efficient
350+
Post-Training Quantization for Large Language Models",
351+
ICML 2023 (per-channel scaling)
352+
- KV Cache quantization survey: Liu et al., "KIVI: A Tuning-Free
353+
Asymmetric 2bit Quantization for KV Cache", ICML 2024

0 commit comments

Comments
 (0)