Skip to content

Commit 63e45bb

Browse files
unamedkrclaude
andcommitted
★ debug(kv): probe chunking fix — turbo_kv_4b CLEAN across all tested arch
R33's "hybrid NaN" was a probe bug, not production. GGUF metadata confirms Qwen3.5-4B and Qwen3.6-35B have key_length=256, but TQ_BK=128 means the traits quantize/dequantize clamp internally. Production handles this by chunking (see tq_transformer.c:1937/2081/2204). My probe didn't. Fix: match production — chunk probe calls into TQ_BK blocks. Post-fix per-arch KV quantization error (cos = K roundtrip cosine sim): Llama-3.2-1B head_dim=64 cos 0.994-0.997 NaN 0/64 Qwen3-0.6B head_dim=128 cos 0.995-0.997 NaN 0/128 Qwen3.5-4B head_dim=256 cos 0.994-0.996 NaN 0/256 ← was "inf/nan" Qwen3.6-35B head_dim=256 cos 0.994-0.997 NaN 0/256 ← was "inf/nan" turbo_kv_4b is now **uniformly clean across every tested architecture**. Per-layer per-position cos ≥ 0.994 confirms the 7x compression / +0% PPL claim is structurally preserved — not just aggregate-validated. Methodology lesson: refparity's methodology worked; my probe implementation had a silent bug of its own (the ironic kind). Always verify diagnostic tooling matches production code path for the PLUMBING too — chunking, buffer sizes, stride assumptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4b6019e commit 63e45bb

3 files changed

Lines changed: 53 additions & 4 deletions

File tree

.claude/state.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,48 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## ★ Phase 2 R34 — KV probe chunking fix — turbo_kv_4b CLEAN across all tested arch ★
7+
8+
R33 reported "hybrid arch produces NaN/inf in probe, production unaffected".
9+
That was HALF correct. The NaN/inf was real, but not from turbo_kv_4b's
10+
dequant edge case on small-rms keys — it was from **my probe code
11+
clamping head_dim>TQ_BK**.
12+
13+
Root cause (probe bug, not production bug):
14+
15+
```c
16+
dt->quantize(key, dst, head_dim); // traits clamp to TQ_BK=128
17+
// For Qwen3.5-4B / Qwen3.6-35B head_dim=256: only first 128 written
18+
// rc[128..255] stays as stack garbage (often NaN)
19+
```
20+
21+
GGUF metadata confirms Qwen3.5-4B and Qwen3.6-35B have `key_length=256`.
22+
Production handles head_dim > TQ_BK by chunking (tq_transformer.c:1937,
23+
2081, 2204). Probe didn't → false positive.
24+
25+
Fix: chunk probe calls into TQ_BK blocks, mirroring production.
26+
27+
After fix — TRUE per-arch KV quant error:
28+
29+
| arch | head_dim | cos | MSE | NaN |
30+
|---|---|---|---|---|
31+
| Llama-3.2-1B | 64 | 0.994-0.997 | 0.02-0.09 | 0/64 |
32+
| Qwen3-0.6B | 128 | 0.995-0.997 | 0.02-4.4 | 0/128 |
33+
| Qwen3.5-4B | **256** | **0.994-0.996** | **0.007-0.010** | **0/256** |
34+
| Qwen3.6-35B | **256** | **0.994-0.997** | **0.005-0.009** | **0/256** |
35+
36+
**turbo_kv_4b is uniformly clean** across every tested architecture.
37+
Cos ≥ 0.994 means the 7×-compression claim is structurally preserved
38+
per-layer per-position — not just at aggregate PPL level.
39+
40+
### Methodology double-lesson
41+
42+
R32 correctly said "Llama is clean". R33 wrongly said "hybrid is
43+
ambiguous". R34 finds: refparity's methodology is right; my probe
44+
implementation had a silent bug (just like the silent bugs we catch
45+
with this methodology). Always verify your diagnostic tools match the
46+
production code path even for the plumbing (chunking, buffer sizes).
47+
648
## Phase 2 R33 — KV probe: hybrid arch limitation surfaced, production unaffected (2026-04-22)
749
850
Extended R32's `TQ_KV_PROBE` to Qwen3 family:

docs/env_vars.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ here is opt-in; defaults are the tested production path.
1919
| `TQ_MOE_FAST_EXP` | off | Use Schraudolph fast-exp in MoE SwiGLU (vs exact expf default). ~2% per-call error; may re-introduce long-gen drift |
2020
| `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
2121
| `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
22-
| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. On **Llama-3.x** turbo_kv_4b holds cosine ≥0.994 cleanly (R32). On **Qwen3 non-hybrid** (Qwen3-0.6B) also clean. On **hybrid arch** (Qwen3.5-4B, Qwen3.6-35B: DeltaNet + self-attn) the probe's **full-dequant roundtrip** produces NaN in ~5% of lanes due to edge-case in turbo_kv_4b's Hadamard-inverse + codebook round-trip for small-rms post-norm keys. This is a **probe artifact, not a production bug** — production attention uses `attention_ref` (rotated-space dot product, no full dequant). The probe is thus useful on non-hybrid arch; treat hybrid readings with skepticism |
22+
| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. Measured cos ≥ 0.994 uniformly across Llama-3.x (head_dim=64), Qwen3-0.6B (128), Qwen3.5-4B (256), Qwen3.6-35B (256). No arch drift, no position drift. The probe chunks calls into TQ_BK-sized blocks to match how production handles head_dim > TQ_BK (R34 fix) |
2323

2424
## Quality / correctness
2525

src/engine/tq_transformer.c

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1841,9 +1841,16 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
18411841
if (_kv_probe_fire) {
18421842
const tq_type_traits_t* dt = &TQ_TRAITS[s->kv_quant_type];
18431843
const float* dbg_key = save_pre_norm_keys ? pre_norm_keys : s->k;
1844-
float mse=0,cn=0,cd1=0,cd2=0; uint8_t tb[1024]; float rc[512];
1845-
dt->quantize(dbg_key, tb, head_dim);
1846-
dt->dequantize(tb, rc, head_dim);
1844+
float mse=0,cn=0,cd1=0,cd2=0; uint8_t tb[1024]; float rc[1024];
1845+
/* head_dim may exceed TQ_BK=128 (Qwen3.5/3.6 key_length=256). Traits
1846+
* quantize/dequantize clamp to TQ_BK per call, so chunk manually —
1847+
* this mirrors how production handles it (tq_transformer.c:1937/2081). */
1848+
int block_sz = head_dim > TQ_BK ? TQ_BK : head_dim;
1849+
for (int blk = 0; blk < head_dim; blk += block_sz) {
1850+
int bl = (blk + block_sz > head_dim) ? (head_dim - blk) : block_sz;
1851+
dt->quantize(dbg_key + blk, tb, bl);
1852+
dt->dequantize(tb, rc + blk, bl);
1853+
}
18471854
for(int i=0;i<head_dim;i++){float d=dbg_key[i]-rc[i];mse+=d*d;cn+=dbg_key[i]*rc[i];cd1+=dbg_key[i]*dbg_key[i];cd2+=rc[i]*rc[i];}
18481855
float dbg_mn=dbg_key[0],dbg_mx=dbg_key[0];
18491856
int nz=0;

0 commit comments

Comments
 (0)