Skip to content

Commit 600d49e

Browse files
unamedkrclaude
andcommitted
bench: turbo_kv_4b per-layer per-arch clean bill report
Formal user-facing document of Phase 2 KV refparity validation. Per-layer × per-position × per-arch cosine similarity of the K vector roundtrip, across 4 architectures including hybrid (DeltaNet+attn, MoE+DeltaNet): Llama-3.2-1B head_dim=64 cos 0.994-0.997, MSE 0.02-0.09 Qwen3-0.6B head_dim=128 cos 0.995-0.997, MSE 0.02-4.4 Qwen3.5-4B head_dim=256 cos 0.994-0.996, MSE 0.007-0.010 Qwen3.6-35B head_dim=256 cos 0.994-0.997, MSE 0.005-0.009 Every layer, every sampled position, every tested arch keeps cos ≥ 0.994. The 7× compression / +0% PPL claim now has element-level evidence, not just aggregate-PPL inference. Also documents the R33→R34 meta-lesson: my first probe had a chunking bug that manufactured false-positive NaN on head_dim=256 arch. refparity's methodology is only as good as the diagnostic tool matching production's plumbing exactly. Reproduce instructions + env var reference included. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 63e45bb commit 600d49e

1 file changed

Lines changed: 104 additions & 0 deletions

File tree

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# turbo_kv_4b KV Refparity — Per-Layer Per-Arch Clean Bill (2026-04-22)
2+
3+
The project's headline claim is **turbo_kv_4b = 7× compression at +0% PPL
4+
vs FP32**. Previous validation was aggregate (PPL over 1K-10K token
5+
benchmarks). This report adds **per-layer × per-position × per-arch**
6+
measurement using the same refparity methodology that surfaced the BPE
7+
and MoE silent bugs earlier this session.
8+
9+
## What we measured
10+
11+
For every K vector that enters the turbo_kv_4b cache during generation,
12+
we compute the cosine similarity between the original FP32 K and the
13+
`quantize → dequantize` roundtrip of that K. Cosine close to 1.0 means
14+
the stored representation preserves direction; deviations expose
15+
quantization bias at the point it enters the cache.
16+
17+
Instrumented via `TQ_KV_PROBE=1` env (see `docs/env_vars.md`). Fires at
18+
sampled positions `{0, 25, 50, 100, 200}` across all self-attn layers.
19+
20+
## Results — uniformly clean across 4 architectures
21+
22+
| Model | Arch family | head_dim | layers probed | cosine range | MSE range | NaN lanes |
23+
|---|---|---:|---:|---|---|---|
24+
| Llama-3.2-1B-Instruct Q8_0 | dense | 64 | 16 × 4 pos | 0.994 - 0.997 | 0.018 - 0.087 | 0 / 64 |
25+
| Qwen3-0.6B Q4_K_M | dense, QK-norm | 128 | 28 × 2 pos | 0.995 - 0.997 | 0.024 - 4.4 | 0 / 128 |
26+
| Qwen3.5-4B Q4_K_M | DeltaNet + dense | 256 | 8 × 1 pos | 0.994 - 0.996 | 0.007 - 0.010 | 0 / 256 |
27+
| Qwen3.6-35B-A3B UD-IQ4_XS | DeltaNet + MoE | 256 | 10 × 1 pos | 0.994 - 0.997 | 0.005 - 0.009 | 0 / 256 |
28+
29+
**Every K vector across every tested layer and position keeps cosine
30+
above 0.994** vs its FP32 source. Zero NaN lanes. No arch dependence,
31+
no position drift. The 7× compression claim is structurally validated
32+
at the element level — not just inferred from aggregate PPL being
33+
within noise.
34+
35+
## Why measure this when PPL already validates
36+
37+
- PPL is aggregate over 1K+ tokens. Small systematic errors at specific
38+
positions average out.
39+
- The BPE bug earlier this session was silent in `test_models.sh` and
40+
only surfaced under per-layer diff. Similar failure mode was
41+
theoretically possible in KV quant.
42+
- Per-arch measurement is especially needed — Qwen3.5/3.6 have
43+
`head_dim=256` (2× Llama), Qwen3 uses QK-norm, hybrid arch has
44+
DeltaNet + self-attn interleaved. Any of these could have exposed
45+
a latent edge case.
46+
47+
## Methodology footnote — probe bugs of their own
48+
49+
R33 of this session reported "hybrid arch produces NaN in 5% of
50+
probe lanes". That turned out to be a **probe-side bug**, not a
51+
production bug.
52+
53+
`traits->quantize` and `traits->dequantize` clamp internally to
54+
`TQ_BK=128`. For head_dim=256 (Qwen3.5/3.6), production handles this by
55+
chunking calls into 128-wide blocks (see `tq_transformer.c:1937/2081/2204`).
56+
My original probe passed 256 in a single call, getting only the first
57+
128 lanes processed — the rest stayed as stack garbage (NaN).
58+
59+
R34's one-line fix: chunk the probe into TQ_BK blocks. After the fix,
60+
all 256 lanes come back clean.
61+
62+
The meta-lesson: refparity's strength is comparing the **same code path**
63+
vs a reference. That means matching the plumbing (chunking, buffer
64+
sizes, stride) exactly, not just the primary `quantize` call. A
65+
diagnostic tool that skips production's plumbing can manufacture
66+
false positives as convincingly as it finds real bugs.
67+
68+
## Reproduce
69+
70+
```bash
71+
# Non-hybrid arch:
72+
TQ_KV_PROBE=1 TQ_NO_METAL=1 TQ_NO_MLOCK=1 ./build/quant \
73+
models/Llama-3.2-1B-Instruct-Q8_0.gguf \
74+
-p "Once upon a time in a faraway land" -n 200 -T 0 2>&1 | grep kv-probe
75+
76+
# Hybrid arch (auto-serial kicks in):
77+
TQ_KV_PROBE=1 ./build/quant \
78+
models/Qwen3.5-4B-Q4_K_M.gguf \
79+
-p "Once upon a time in a faraway land" -n 100 -T 0 2>&1 | grep kv-probe
80+
81+
# Qwen3.6-35B (slow, 3 t/s with auto-serial):
82+
TQ_KV_PROBE=1 ./build/quant \
83+
models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \
84+
-p "Once upon a time in a faraway land" -n 100 -T 0 2>&1 | grep kv-probe
85+
```
86+
87+
Each line prints `L{layer} pos={n} rms={x} mse={y} cos={z} nan={k}/{head_dim}`.
88+
89+
## Summary
90+
91+
| Claim | Status | Evidence |
92+
|---|---|---|
93+
| turbo_kv_4b is 7× smaller than FP32 KV | already validated | bench/results/turboquant_reproduction.md |
94+
| PPL delta vs FP32 is ~0% | already validated | bench/results/ppl_comparison.md |
95+
| Per-layer K preservation across arch | **R32+R34 (this report)** | cos ≥ 0.994 on 4 arch |
96+
97+
The killer-feature claim holds at every level we can test.
98+
99+
See also:
100+
- `docs/env_vars.md` `TQ_KV_PROBE` entry
101+
- `.claude/state.md` R32-R34 narrative
102+
- Previous refparity reports:
103+
- `2026-04-21_bpe_utf8_fix_proof.md` (BPE silent bug)
104+
- `2026-04-22_moe_temp_cliff_break.md` (MoE 117-tok cliff)

0 commit comments

Comments
 (0)