★ debug(kv): probe chunking fix — turbo_kv_4b CLEAN across all tested arch

unamedkr · claude · unamedkr · commit 63e45bb5361e · 2026-04-22T01:32:04.000+09:00
R33's "hybrid NaN" was a probe bug, not production. GGUF metadata confirms
Qwen3.5-4B and Qwen3.6-35B have key_length=256, but TQ_BK=128 means the
traits quantize/dequantize clamp internally. Production handles this by
chunking (see tq_transformer.c:1937/2081/2204). My probe didn't.

Fix: match production — chunk probe calls into TQ_BK blocks.

Post-fix per-arch KV quantization error (cos = K roundtrip cosine sim):

  Llama-3.2-1B   head_dim=64   cos 0.994-0.997 NaN 0/64
  Qwen3-0.6B     head_dim=128  cos 0.995-0.997 NaN 0/128
  Qwen3.5-4B     head_dim=256  cos 0.994-0.996 NaN 0/256  ← was "inf/nan"
  Qwen3.6-35B    head_dim=256  cos 0.994-0.997 NaN 0/256  ← was "inf/nan"

turbo_kv_4b is now **uniformly clean across every tested architecture**.
Per-layer per-position cos ≥ 0.994 confirms the 7x compression / +0% PPL
claim is structurally preserved — not just aggregate-validated.

Methodology lesson: refparity's methodology worked; my probe implementation
had a silent bug of its own (the ironic kind). Always verify diagnostic
tooling matches production code path for the PLUMBING too — chunking,
buffer sizes, stride assumptions.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,48 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## ★ Phase 2 R34 — KV probe chunking fix — turbo_kv_4b CLEAN across all tested arch ★
+
+R33 reported "hybrid arch produces NaN/inf in probe, production unaffected".
+That was HALF correct. The NaN/inf was real, but not from turbo_kv_4b's
+dequant edge case on small-rms keys — it was from **my probe code
+clamping head_dim>TQ_BK**.
+
+Root cause (probe bug, not production bug):
+
+```c
+dt->quantize(key, dst, head_dim);  // traits clamp to TQ_BK=128
+// For Qwen3.5-4B / Qwen3.6-35B head_dim=256: only first 128 written
+// rc[128..255] stays as stack garbage (often NaN)
+```
+
+GGUF metadata confirms Qwen3.5-4B and Qwen3.6-35B have `key_length=256`.
+Production handles head_dim > TQ_BK by chunking (tq_transformer.c:1937,
+2081, 2204). Probe didn't → false positive.
+
+Fix: chunk probe calls into TQ_BK blocks, mirroring production.
+
+After fix — TRUE per-arch KV quant error:
+
+| arch | head_dim | cos | MSE | NaN |
+|---|---|---|---|---|
+| Llama-3.2-1B | 64 | 0.994-0.997 | 0.02-0.09 | 0/64 |
+| Qwen3-0.6B | 128 | 0.995-0.997 | 0.02-4.4 | 0/128 |
+| Qwen3.5-4B | **256** | **0.994-0.996** | **0.007-0.010** | **0/256** |
+| Qwen3.6-35B | **256** | **0.994-0.997** | **0.005-0.009** | **0/256** |
+
+**turbo_kv_4b is uniformly clean** across every tested architecture.
+Cos ≥ 0.994 means the 7×-compression claim is structurally preserved
+per-layer per-position — not just at aggregate PPL level.
+
+### Methodology double-lesson
+
+R32 correctly said "Llama is clean". R33 wrongly said "hybrid is
+ambiguous". R34 finds: refparity's methodology is right; my probe
+implementation had a silent bug (just like the silent bugs we catch
+with this methodology). Always verify your diagnostic tools match the
+production code path even for the plumbing (chunking, buffer sizes).
+
 ## Phase 2 R33 — KV probe: hybrid arch limitation surfaced, production unaffected (2026-04-22)
 
 Extended R32's `TQ_KV_PROBE` to Qwen3 family:
diff --git a/docs/env_vars.md b/docs/env_vars.md
@@ -19,7 +19,7 @@ here is opt-in; defaults are the tested production path.
 | `TQ_MOE_FAST_EXP` | off | Use Schraudolph fast-exp in MoE SwiGLU (vs exact expf default). ~2% per-call error; may re-introduce long-gen drift |
 | `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
 | `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
-| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. On **Llama-3.x** turbo_kv_4b holds cosine ≥0.994 cleanly (R32). On **Qwen3 non-hybrid** (Qwen3-0.6B) also clean. On **hybrid arch** (Qwen3.5-4B, Qwen3.6-35B: DeltaNet + self-attn) the probe's **full-dequant roundtrip** produces NaN in ~5% of lanes due to edge-case in turbo_kv_4b's Hadamard-inverse + codebook round-trip for small-rms post-norm keys. This is a **probe artifact, not a production bug** — production attention uses `attention_ref` (rotated-space dot product, no full dequant). The probe is thus useful on non-hybrid arch; treat hybrid readings with skepticism |
+| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. Measured cos ≥ 0.994 uniformly across Llama-3.x (head_dim=64), Qwen3-0.6B (128), Qwen3.5-4B (256), Qwen3.6-35B (256). No arch drift, no position drift. The probe chunks calls into TQ_BK-sized blocks to match how production handles head_dim > TQ_BK (R34 fix) |
 
 ## Quality / correctness
 
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -1841,9 +1841,16 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
     if (_kv_probe_fire) {
         const tq_type_traits_t* dt = &TQ_TRAITS[s->kv_quant_type];
         const float* dbg_key = save_pre_norm_keys ? pre_norm_keys : s->k;
-        float mse=0,cn=0,cd1=0,cd2=0; uint8_t tb[1024]; float rc[512];
-        dt->quantize(dbg_key, tb, head_dim);
-        dt->dequantize(tb, rc, head_dim);
+        float mse=0,cn=0,cd1=0,cd2=0; uint8_t tb[1024]; float rc[1024];
+        /* head_dim may exceed TQ_BK=128 (Qwen3.5/3.6 key_length=256). Traits
+         * quantize/dequantize clamp to TQ_BK per call, so chunk manually —
+         * this mirrors how production handles it (tq_transformer.c:1937/2081). */
+        int block_sz = head_dim > TQ_BK ? TQ_BK : head_dim;
+        for (int blk = 0; blk < head_dim; blk += block_sz) {
+            int bl = (blk + block_sz > head_dim) ? (head_dim - blk) : block_sz;
+            dt->quantize(dbg_key + blk, tb, bl);
+            dt->dequantize(tb, rc + blk, bl);
+        }
         for(int i=0;i<head_dim;i++){float d=dbg_key[i]-rc[i];mse+=d*d;cn+=dbg_key[i]*rc[i];cd1+=dbg_key[i]*dbg_key[i];cd2+=rc[i]*rc[i];}
         float dbg_mn=dbg_key[0],dbg_mx=dbg_key[0];
         int nz=0;