|
| 1 | +# ADR 0010 — Full-attention verifier + low-precision (INT8 / NF4) KV cache |
| 2 | + |
| 3 | +- **Status**: Proposed (safety-net for ADR 0011) — 2026-06-07 |
| 4 | +- **Date**: 2026-06-07 |
| 5 | +- **Decision drivers**: |
| 6 | + - The 2026-06-06 `sink+window` quality A/B benchmark |
| 7 | + (`results/platform-tests/sink_window_quality_ab_1780714635.json`) |
| 8 | + showed that v0.3's `SinkWindowVerifier` loses 83.3% recall on |
| 9 | + middle-context fact retrieval relative to a full-attention baseline. |
| 10 | + The `sink+window` design buys bounded KV by literally evicting K/V |
| 11 | + tensors for tokens outside `(sink ∪ window)`; nothing the proposer |
| 12 | + does at inference time can recover information that was deleted |
| 13 | + from the verifier's cache. |
| 14 | + - ADR 0011 ("cross-attention proposer/verifier coupling") is the |
| 15 | + hypothesis that cross-attention from a full-attention proposer |
| 16 | + hidden bank into a bounded verifier can rescue the lost recall. |
| 17 | + R1c GPU evidence |
| 18 | + (`results/research/cross_attn_toy_vast_full_1780806644.json`, |
| 19 | + `results/research/cross_attn_toy_vast_needle_small_1780806644.json`) |
| 20 | + establishes that the mechanism partially works (16% on a 20-vocab |
| 21 | + needle task, 0% on the 135k-vocab full task) but is far from the |
| 22 | + G-X1 ≥ 80% acceptance criterion. R1d-β will give a more definitive |
| 23 | + answer; this ADR is the safety net **independent of R1d-β's |
| 24 | + outcome**. |
| 25 | + - User-stated v0.4 strategic constraints (recorded 2026-06-06): |
| 26 | + *no deadline, no sunk-cost reasoning, extreme KV efficiency, |
| 27 | + zero intelligence regression*. ADR 0010 takes "zero intelligence |
| 28 | + regression" as a hard constraint and trades on the "extreme KV |
| 29 | + efficiency" axis using a different mechanism than ADR 0011. |
| 30 | +- **Depends on**: ADR 0001 (proposer sizing + speculative decoding |
| 31 | + contract), ADR 0002 (verifier selection — Qwen3-1.7B, Gemma 3-1B |
| 32 | + family). |
| 33 | +- **Relates to**: ADR 0011 (cross-attention bridge). The two are |
| 34 | + *alternative* approaches to the same problem |
| 35 | + ("how do we get extreme KV savings on long-context workloads |
| 36 | + without intelligence regression?"). They are not mutually |
| 37 | + exclusive in code — a future v0.5 could combine bounded |
| 38 | + cross-attention rescue (ADR 0011 if validated) with low-precision |
| 39 | + full attention (this ADR) for compounding savings — but for v0.4 |
| 40 | + GA they are exclusive choices because they share the verifier |
| 41 | + forward path and require different memory layout. |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## 1. Context |
| 46 | + |
| 47 | +### 1.1 What `sink+window` actually costs |
| 48 | + |
| 49 | +v0.3's `SinkWindowVerifier` keeps K/V tensors only for |
| 50 | +`{0..sink-1} ∪ {q-window+1..q}` for each query position q. At |
| 51 | +`sink=4, window=64` over a 256–1024 token haystack with the needle |
| 52 | +at a random middle position, the A/B run measured: |
| 53 | + |
| 54 | +| | Full-context Qwen3-1.7B greedy | v0.3 (Qwen3-0.6B dLM proposer + Qwen3-1.7B sink+window verifier) | |
| 55 | +|---|---|---| |
| 56 | +| Mid-context fact recall | 6/6 (100%) | 1/6 (16.7%) | |
| 57 | +| Peak KV bytes (B=1, S=84) | 56,311,808 | 7,798,784 | |
| 58 | + |
| 59 | +Five of the six losses are middle-context fact recall failures: the |
| 60 | +needle's K/V was evicted before the answer position, and no |
| 61 | +proposer-side mechanism in v0.3 can rescue it. |
| 62 | + |
| 63 | +The strategic question for v0.4 is whether to (a) accept the |
| 64 | +intelligence regression, (b) recover the lost information through |
| 65 | +cross-attention from a full-attention proposer (ADR 0011), or (c) |
| 66 | +keep full attention on the verifier and trade on memory in a |
| 67 | +different dimension (this ADR). |
| 68 | + |
| 69 | +### 1.2 Where ADR 0011's R1c evidence stands |
| 70 | + |
| 71 | +R1c (vast H200, 2 × 16 min, 2 GPU runs): |
| 72 | + |
| 73 | +- 20-vocab diagnostic task: cross-attn bridge reaches 16% recall |
| 74 | + (final), peaks at 25% at step 800. The mechanism injects needle |
| 75 | + information in some fraction of cases — not noise. |
| 76 | +- 135k-vocab full task: 0.00 recall throughout 2000 training steps. |
| 77 | + Loss converges to perplexity ~2.3 yet recall does not rise. |
| 78 | + |
| 79 | +This is consistent with two interpretations: |
| 80 | + |
| 81 | +- **(I-1)** Single-layer cross-attn at depth 20 has too little |
| 82 | + capacity to encode an arbitrary needle into the verifier's |
| 83 | + residual stream as a precise argmax-flipping signal; multi-layer |
| 84 | + / multi-depth bridges can close the gap (R1d-β → R1e). |
| 85 | +- **(I-2)** The full-attention proposer's hidden bank, as a generic |
| 86 | + pretrained representation, is not localizable enough by gradient |
| 87 | + descent to be a usable index — i.e., the §3 hypothesis is wrong |
| 88 | + in shape, and no amount of capacity in the bridge fixes it. |
| 89 | + |
| 90 | +R1d-β (auxiliary retrieval loss + attention-localization metric) is |
| 91 | +designed to distinguish (I-1) from (I-2). ADR 0010 is the v0.4 GA |
| 92 | +plan if R1d-β returns (I-2) or if R1e cannot reach 80% within a |
| 93 | +reasonable compute budget. |
| 94 | + |
| 95 | +### 1.3 The ADR 0010 framing |
| 96 | + |
| 97 | +Keep full attention on the verifier — same intelligence as the |
| 98 | +oracle baseline by construction — but reduce the *bytes per cached |
| 99 | +token* by quantizing K/V to lower precision. The KV cache is the |
| 100 | +dominant memory term for long-context inference (it grows linearly |
| 101 | +with context length and dominates weights once context > a few k |
| 102 | +tokens), so a 2× or 4× per-token compression buys back most of the |
| 103 | +practical memory benefit `sink+window` provided. |
| 104 | + |
| 105 | +### 1.4 Memory math |
| 106 | + |
| 107 | +Per token, per layer KV bytes: |
| 108 | + |
| 109 | +| Precision | bytes/elem | KV bytes/(token, layer) for hidden=1152 (Gemma 3-1B) | for hidden=3584 (Gemma 4-9B class) | |
| 110 | +|---|---|---|---| |
| 111 | +| **bf16** (current) | 2 | 4,608 | 14,336 | |
| 112 | +| **INT8** | 1 | 2,304 (-50%) | 7,168 (-50%) | |
| 113 | +| **INT4 / NF4** | 0.5 | 1,152 (-75%) | 3,584 (-75%) | |
| 114 | + |
| 115 | +For multi-layer aggregate at typical layer counts: |
| 116 | + |
| 117 | +- Gemma 3-1B (26 layers): bf16 ≈ **120 KB/token**, INT8 ≈ 60, NF4 ≈ 30 |
| 118 | +- Gemma 4-9B-class (≈ 42 layers): bf16 ≈ **600 KB/token**, INT8 ≈ 300, NF4 ≈ 150 |
| 119 | + |
| 120 | +For Mac mini 24 GB targeting 64 k-token context on Gemma 4-9B class: |
| 121 | + |
| 122 | +- bf16 KV: 64 k × 600 KB = **~37 GB** → does not fit. v0.3 only fit by trimming the cache. |
| 123 | +- INT8 KV: ~18 GB → fits with margin for weights/activations. |
| 124 | +- NF4 KV: ~9 GB → fits comfortably; leaves room for KV growth past 100 k tokens. |
| 125 | + |
| 126 | +For comparison `sink+window=4+64`: caps at ~68 tokens × 600 KB ≈ |
| 127 | +**41 MB** regardless of context length. ADR 0010's win-axis is |
| 128 | +**different from `sink+window`'s**: not "constant memory", but |
| 129 | +"linear memory at half/quarter the slope, with full intelligence". |
| 130 | + |
| 131 | +The two are complementary — ADR 0010 + ADR 0011 (if validated) is a |
| 132 | +v0.5+ direction. |
| 133 | + |
| 134 | +--- |
| 135 | + |
| 136 | +## 2. Decisions |
| 137 | + |
| 138 | +### 2.1 Default precision: NF4 (4-bit normal-float) |
| 139 | + |
| 140 | +NF4 (introduced in QLoRA, 2023) is a 4-bit quantization tuned for |
| 141 | +parameter distributions that are roughly normal — which the K/V |
| 142 | +projections after a transformer layer are, by training-time weight |
| 143 | +decay and layer-norm structure. Empirical benchmarks |
| 144 | +(QLoRA paper + follow-ups, AWQ paper) put NF4 within 0.3–0.8% of |
| 145 | +bf16 on MMLU / HellaSwag / ARC at 7B–13B parameter scale. INT4 |
| 146 | +uniform quant is ~0.5% worse than NF4 at the same bit-rate. |
| 147 | + |
| 148 | +INT8 is the **safe-default fallback** when a backend cannot host |
| 149 | +NF4 efficiently (e.g., MPS without bnb-style kernels). INT8 is |
| 150 | +within 0.05–0.1% of bf16 in the same benchmarks — effectively |
| 151 | +indistinguishable. |
| 152 | + |
| 153 | +### 2.2 Calibration: per-tensor symmetric, asymmetric for outliers |
| 154 | + |
| 155 | +KV tensors have outlier channels (well-documented in SmoothQuant, |
| 156 | +AWQ). Two-step quantization: |
| 157 | + |
| 158 | +1. Per-token, per-head **outlier mask**: top-k channels by absolute |
| 159 | + magnitude (k = 1–2) are kept in bf16. |
| 160 | +2. Remaining channels: per-channel symmetric quant for K |
| 161 | + (zero-centered after layer-norm), per-channel asymmetric for V |
| 162 | + (no zero-centering guarantee). |
| 163 | + |
| 164 | +This adds ~3–5% storage overhead (the bf16 outliers + per-channel |
| 165 | +scales) but recovers most of the long-context retrieval quality |
| 166 | +that uniform per-tensor quant loses. |
| 167 | + |
| 168 | +### 2.3 Backends |
| 169 | + |
| 170 | +- **MLX (Apple Silicon)**: implement NF4 KV via `mx.quantize` / |
| 171 | + `mx.dequantize` on the K/V projections immediately before they |
| 172 | + enter the cache, and dequant on the read side. INT8 fallback uses |
| 173 | + the same path with a different `bits=` argument. MLX 0.31+ |
| 174 | + supports both. |
| 175 | +- **PyTorch / CUDA**: use `bitsandbytes` for NF4 (well-tested on |
| 176 | + CUDA), fall back to INT8 via `torch.quantize_per_channel` for |
| 177 | + hardware without `bnb`. |
| 178 | +- **CPU (test/CI)**: INT8 only; NF4 has no efficient CPU kernel and |
| 179 | + is not a v0.4 GA target. |
| 180 | + |
| 181 | +### 2.4 Sink+window stays as a feature flag, not a default |
| 182 | + |
| 183 | +`SinkWindowVerifier` is preserved in `inference_engine.backends.*` |
| 184 | +but defaults to disabled in v0.4. Workloads that explicitly request |
| 185 | +constant-memory KV (e.g., long-running agent loops on tiny edge |
| 186 | +hardware where even NF4 × full-context is too much) opt in via |
| 187 | +`Verifier(kv_strategy="sink_window", sink=..., window=...)`. |
| 188 | + |
| 189 | +### 2.5 Speculative decoding contract: unchanged |
| 190 | + |
| 191 | +The dLM proposer + AR verifier speculative decoding loop from ADR |
| 192 | +0001 remains exactly as in v0.3. Verification still happens at |
| 193 | +bf16 precision (logits are dequantized for argmax/softmax); only |
| 194 | +the *K/V cache storage* is quantized. This preserves byte-exact |
| 195 | +determinism under the ADR 0008 §6.5 INV-3 gate. |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## 3. Alternatives considered |
| 200 | + |
| 201 | +| Alternative | Status | Why rejected (or why deferred) | |
| 202 | +|---|---|---| |
| 203 | +| Keep `sink+window` as v0.4 default | Rejected | Empirically loses ≥83% on middle-context recall; conflicts with "zero intelligence regression". | |
| 204 | +| ADR 0011 cross-attention bridge | **Active research** | Conditional on R1d-β / R1e outcome. ADR 0010 is the safety net if 0011 is rejected. If 0011 is accepted, ADR 0010 may still ship as an *additive* memory optimization (combining bounded cross-attention + low-precision storage for compounded savings). | |
| 205 | +| Sliding-window-only (no sink) | Rejected | Same intelligence regression as `sink+window`; worse on early-context anchoring. | |
| 206 | +| H2O / SnapKV / PyramidKV importance-based eviction | Deferred | Improves on `sink+window` for some workloads but still evicts. Requires per-token importance scoring at inference time (compute cost). v0.5 candidate. | |
| 207 | +| Mamba / RWKV / RetNet long-context-native models | Out of scope | Changes the project's model-identity. ADR 0001 commits to Qwen3 / Gemma family. | |
| 208 | +| KV cache *offload* to disk / shared memory | Deferred | Mac mini 24 GB has no fast secondary storage path. Useful for desktops with ample SSD bandwidth — v0.6 candidate. | |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +## 4. Consequences |
| 213 | + |
| 214 | +### 4.1 What is gained |
| 215 | + |
| 216 | +- **Zero intelligence regression by construction**. Full attention |
| 217 | + means oracle-equivalent token argmax in the limit of perfect |
| 218 | + dequant; calibrated NF4 / INT8 keep the gap < 1% on standard |
| 219 | + benchmarks. |
| 220 | +- **2× (INT8) or 4× (NF4) reduction in per-token cache bytes**, |
| 221 | + enough to fit Gemma 4-9B class workloads at ~64–100 k tokens |
| 222 | + on Mac mini 24 GB. |
| 223 | +- **No new training step**. Unlike ADR 0011 (which needs cross- |
| 224 | + attention bridge training, alignment data prep, gate G-X1/2/3 |
| 225 | + empirical validation), ADR 0010 is implementable on top of |
| 226 | + v0.3.0 weights without modifying the proposer or verifier. |
| 227 | +- **Backend-portable**. Apple Silicon, NVIDIA, and CPU all have |
| 228 | + established INT8 / NF4 kernels. |
| 229 | + |
| 230 | +### 4.2 What is given up |
| 231 | + |
| 232 | +- **Linear memory growth**. KV still grows with context length; on |
| 233 | + pathological multi-hour agent loops with no `clearKvCache` calls |
| 234 | + the cache will eventually exceed any fixed budget. ADR 0010 |
| 235 | + trades an absolute bound (`sink+window`) for a *better slope* on |
| 236 | + a linear curve. Workloads that need an absolute bound must opt |
| 237 | + back into `sink+window` (§2.4). |
| 238 | +- **Compute overhead at the dequant boundary**. Each verifier |
| 239 | + forward pass dequantizes the K/V tensors it reads. On hardware |
| 240 | + with native int8/int4 tensor cores (H100, M-series GPU |
| 241 | + matmul-on-int8) this is negligible. On older NVIDIA cards (A100, |
| 242 | + L4) it is measurable (~5–15% slowdown vs bf16). Acceptable for |
| 243 | + v0.4; revisit on a per-backend basis. |
| 244 | +- **Outlier-aware calibration adds complexity**. Per-channel scales |
| 245 | + + outlier mask is non-trivial code; the simpler per-tensor |
| 246 | + symmetric quant is faster but loses 2–5% on long-context |
| 247 | + retrieval. v0.4 ships outlier-aware as the default; per-tensor |
| 248 | + is a runtime flag for benchmarking. |
| 249 | + |
| 250 | +--- |
| 251 | + |
| 252 | +## 5. Implementation plan (PR sequence) |
| 253 | + |
| 254 | +| Phase | Scope | Deliverables | |
| 255 | +|---|---|---| |
| 256 | +| **A** | Quantization primitives (CPU + MLX + CUDA) | `inference_engine.backends.kv_quant` module with `quantize_kv(K, V, bits, scheme)` / `dequantize_kv(...)` and a `KVQuantConfig` dataclass. Linux unit tests for round-trip error bounds. | |
| 257 | +| **B** | Verifier integration (single backend first: MLX) | `inference_engine.backends.mlx.FullAttentionQuantizedVerifier` — same forward signature as `MLXSinkWindowVerifier`, but stores KV in NF4 / INT8 and dequantizes on read. INV-3 determinism gate must pass. | |
| 258 | +| **C** | A/B benchmark vs sink+window vs full-bf16 | Run the same `bench_sink_window_quality_ab.py` matrix on Mac M4 with NF4 / INT8 / sink+window / full-bf16 verifiers. Acceptance: NF4 recall ≥ 95% of full-bf16 on the existing 6-case mid-context fact retrieval benchmark. | |
| 259 | +| **D** | Backend port: PyTorch / CUDA | `inference_engine.backends.pytorch.FullAttentionQuantizedVerifier`. Linux integration tests on a small NVIDIA-equipped runner (or vast.ai). | |
| 260 | +| **E** | Long-session bench under quantized KV | Re-run `bench_session_long_run.py` 4 h at NF4 + INT8. Verify `kv_live_bytes` slope matches the predicted 2× / 4× reduction. | |
| 261 | +| **F** | Default flip + docs | v0.4 default verifier becomes `FullAttentionQuantizedVerifier(bits=4, scheme="nf4_outlier")`. Quickstart updated. `sink+window` documented as a feature flag for memory-bounded edge use. | |
| 262 | + |
| 263 | +Each phase has Linux CI gates + (where applicable) Mac M4 / vast.ai |
| 264 | +empirical gates. PRs are stacked per ADR 0008 §9. |
| 265 | + |
| 266 | +--- |
| 267 | + |
| 268 | +## 6. Validation criteria (v0.4 GA gates) |
| 269 | + |
| 270 | +A v0.4 release shipping ADR 0010 must demonstrate, all on |
| 271 | +reproducible artifacts in `results/platform-tests/` or |
| 272 | +`results/research/`: |
| 273 | + |
| 274 | +1. **Quality parity vs full-bf16**: NF4 verifier achieves ≥ 95% of |
| 275 | + full-bf16 recall on the 6-case mid-context benchmark, > 99% on |
| 276 | + short-context greedy completions. INT8 ≥ 99%. |
| 277 | +2. **Memory reduction realized**: per-turn `kv_live_bytes` reported |
| 278 | + by `GetSessionInfo` is within 5% of the theoretical |
| 279 | + 2× / 4× target across a 1 h benchmark. |
| 280 | +3. **Determinism preserved**: ADR 0008 §6.5 INV-3 gate passes |
| 281 | + bit-exact between continuation and reset paths under the |
| 282 | + quantized cache. |
| 283 | +4. **Cross-backend equivalence**: MLX and PyTorch backends produce |
| 284 | + matching argmax across a 50-prompt eval set (within int4 / NF4 |
| 285 | + numerical tolerance — exact int8 match expected). |
| 286 | +5. **Long-session stability**: 4 h `bench_session_long_run.py` on |
| 287 | + Mac M4 with `kv_strategy=nf4_full` shows no errors; KV growth |
| 288 | + matches the linear prediction (slope < the bf16 slope by 4×). |
| 289 | + |
| 290 | +--- |
| 291 | + |
| 292 | +## 7. Open questions (to resolve during implementation) |
| 293 | + |
| 294 | +- **Q1**: Per-channel vs per-token vs per-head granularity for |
| 295 | + outlier detection. Initial recommendation: per-head (matches |
| 296 | + attention computation natural axis), top-1 outlier channel |
| 297 | + retained at bf16. Validate empirically in Phase A. |
| 298 | +- **Q2**: Do we quantize on write only, or on both read and |
| 299 | + write (re-quantizing dequantized values during attention update |
| 300 | + passes)? Speculative decoding's verifier-recompute path may |
| 301 | + re-touch the same K/V tensors; double-quantization round-trip |
| 302 | + error compounds. Initial recommendation: quantize-on-write only, |
| 303 | + cache stays in low precision until evicted. |
| 304 | +- **Q3**: Interaction with cross-request KV reuse (deferred per |
| 305 | + ADR 0008 §6 — was ADR 0007's territory). When cross-request |
| 306 | + reuse lands in a future ADR, NF4 storage must round-trip cleanly |
| 307 | + across session boundaries. Out of scope here; flagged for |
| 308 | + whoever takes that on. |
| 309 | +- **Q4**: NF4 + speculative decoding interaction. The proposer |
| 310 | + reads no K/V (it's a dLM); the verifier reads K/V at quantized |
| 311 | + precision. Expected to be neutral. Validate in Phase C. |
| 312 | +- **Q5**: Compatibility with ADR 0011 cross-attention bridge if it |
| 313 | + later passes G-X1. The bridge consumes the proposer's hidden |
| 314 | + bank (which is computed at full-attention bf16 precision and |
| 315 | + stored separately, not in the verifier KV cache); the verifier |
| 316 | + KV cache is what ADR 0010 quantizes. The two should compose |
| 317 | + cleanly; validate in v0.5 if both ship. |
| 318 | + |
| 319 | +--- |
| 320 | + |
| 321 | +## 8. Testing discipline |
| 322 | + |
| 323 | +Same rules as ADR 0008 §9: no fakes, no fallbacks, no overfits, |
| 324 | +100% Linux unit-test coverage where the mechanism is testable |
| 325 | +without GPU; all empirical claims gated on reproducible Mac M4 |
| 326 | +or vast.ai artifacts committed under `results/platform-tests/`. |
| 327 | + |
| 328 | +NF4 round-trip error bounds, outlier mask correctness, |
| 329 | +quant/dequant idempotence, and INV-3 determinism are all |
| 330 | +testable on Linux in CI. |
| 331 | + |
| 332 | +--- |
| 333 | + |
| 334 | +## 9. References |
| 335 | + |
| 336 | +- `results/platform-tests/sink_window_quality_ab_1780714635.json` |
| 337 | + — the empirical surface that motivates this ADR |
| 338 | +- `results/research/cross_attn_toy_vast_full_1780806644.json`, |
| 339 | + `results/research/cross_attn_toy_vast_needle_small_1780806644.json` |
| 340 | + — R1c evidence informing the safety-net framing |
| 341 | +- ADR 0001 (proposer sizing + speculative decoding contract) |
| 342 | +- ADR 0002 (verifier selection — Qwen3-1.7B, Gemma 3-1B) |
| 343 | +- ADR 0008 (session-bound runtime, INV-3 determinism gate) |
| 344 | +- ADR 0011 (cross-attention bridge — proposed alternative) |
| 345 | +- QLoRA: Dettmers et al., "QLoRA: Efficient Finetuning of |
| 346 | + Quantized LLMs", NeurIPS 2023 (NF4 quantization scheme) |
| 347 | +- AWQ: Lin et al., "AWQ: Activation-aware Weight Quantization for |
| 348 | + LLM Compression and Acceleration", MLSys 2024 (outlier handling) |
| 349 | +- SmoothQuant: Xiao et al., "SmoothQuant: Accurate and Efficient |
| 350 | + Post-Training Quantization for Large Language Models", |
| 351 | + ICML 2023 (per-channel scaling) |
| 352 | +- KV Cache quantization survey: Liu et al., "KIVI: A Tuning-Free |
| 353 | + Asymmetric 2bit Quantization for KV Cache", ICML 2024 |
0 commit comments