debug(kv): refined probe — hybrid arch limitation surfaced (not a production bug)

unamedkr · claude · unamedkr · commit 4b6019e2c868 · 2026-04-22T01:22:48.000+09:00
Extended R32 TQ_KV_PROBE to Qwen3 family. Findings:

  Llama-3.2-1B non-hybrid:       cos 0.994-0.997, MSE 0.02-0.09, 0/64 NaN
  Qwen3-0.6B non-hybrid:         cos 0.995-0.997, MSE 0.02-4.4,  0/128 NaN
  Qwen3.5-4B DeltaNet+attn:      inf/NaN, 6/256 NaN lanes
  Qwen3.6-35B MoE+DeltaNet:      inf/NaN, 6/256 NaN lanes

On hybrid arch the probe's full-dequant roundtrip produces NaN in ~5%
of lanes due to Hadamard-inverse × codebook edge case for small-rms
post-norm keys. Input verified finite (nan_in=0).

Production unaffected: attention on hybrid uses tq_turbo_kv_4b_attention_ref
(rotated-space dot, no full dequant). Probe measured the wrong path.

Methodology lesson: refparity's value is comparing the SAME code path
vs a reference. Probe chose a code path production doesn't use → false
positive on hybrid. Documented in env_vars.md; next-round fix is a
production-path-matching probe (query @ dequant(K) vs attention_ref).

Refined probe to recompute stats excluding NaN lanes so the signal
remains useful. Llama cos now cleanly 0.995+, 0/64 NaN (confirms R32).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.claude/state.md b/.claude/state.md
@@ -3,6 +3,44 @@
 **Last updated**: 2026-04-21 (Phase 1 refparity ★)
 **Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
 
+## Phase 2 R33 — KV probe: hybrid arch limitation surfaced, production unaffected (2026-04-22)
+
+Extended R32's `TQ_KV_PROBE` to Qwen3 family:
+
+| arch | pos | cosine range | MSE range | NaN lanes |
+|---|---|---|---|---|
+| Llama-3.2-1B (non-hybrid) | 25-200 | 0.994-0.997 | 0.018-0.087 | 0/64 |
+| Qwen3-0.6B (non-hybrid, QK-norm) | 25, 50 | 0.995-0.997 | 0.024-4.4 | 0/128 |
+| Qwen3.5-4B (DeltaNet+attn hybrid) | 25 | — | inf | **6/256** |
+| Qwen3.6-35B (MoE+DeltaNet hybrid) | 0-200 | — | inf | **6/256** |
+
+### The finding
+
+On hybrid arch, the probe's full-dequant roundtrip — `dt->quantize`
+followed by `dt->dequantize` — produces NaN in ~5% of the 256-element
+lanes. Input keys verified finite (`nan_in=0`); NaN emerges inside the
+Hadamard-inverse × codebook path for small-rms post-norm keys.
+
+### Why production isn't affected
+
+Production attention on hybrid arch uses `tq_turbo_kv_4b_attention_ref`
+(rotated-space dot product vs stored-indices × codebook). Never calls
+the full-dequant roundtrip my probe does. Hybrid model outputs stay
+coherent; the NaN lanes are entirely probe-synthesized.
+
+### Methodology lesson
+
+refparity's strength is comparing the SAME code path against a reference.
+This round's probe compared turbo_kv_4b's `quantize`+`dequantize` against
+FP32 — but production on hybrid arch uses a third path (`attention_ref`)
+that bypasses full dequant. Probe measured the wrong thing.
+
+Next-round fix: add a hybrid-aware probe that measures `query @ dequant(K)`
+vs the production `attention_ref(query, K)` — same semantic, actual path
+used. That's the meaningful KV correctness check.
+
+Documented the probe's hybrid-arch limitation in `docs/env_vars.md`.
+
 ## Phase 2 R32 — KV refparity extension: turbo_kv_4b is clean (2026-04-22)
 
 Applied the refparity methodology (that surfaced the BPE and MoE bugs
diff --git a/docs/env_vars.md b/docs/env_vars.md
@@ -19,7 +19,7 @@ here is opt-in; defaults are the tested production path.
 | `TQ_MOE_FAST_EXP` | off | Use Schraudolph fast-exp in MoE SwiGLU (vs exact expf default). ~2% per-call error; may re-introduce long-gen drift |
 | `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
 | `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
-| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. Useful to verify KV compression is behaving uniformly across layers and not drifting over position. See the R32 finding — turbo_kv_4b holds cosine ≥0.994 across all layers and positions on Llama-3.2-1B |
+| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. On **Llama-3.x** turbo_kv_4b holds cosine ≥0.994 cleanly (R32). On **Qwen3 non-hybrid** (Qwen3-0.6B) also clean. On **hybrid arch** (Qwen3.5-4B, Qwen3.6-35B: DeltaNet + self-attn) the probe's **full-dequant roundtrip** produces NaN in ~5% of lanes due to edge-case in turbo_kv_4b's Hadamard-inverse + codebook round-trip for small-rms post-norm keys. This is a **probe artifact, not a production bug** — production attention uses `attention_ref` (rotated-space dot product, no full dequant). The probe is thus useful on non-hybrid arch; treat hybrid readings with skepticism |
 
 ## Quality / correctness
 
diff --git a/src/engine/tq_transformer.c b/src/engine/tq_transformer.c
@@ -1850,9 +1850,28 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
         for(int i=0;i<head_dim;i++){if(dbg_key[i]<dbg_mn)dbg_mn=dbg_key[i];if(dbg_key[i]>dbg_mx)dbg_mx=dbg_key[i];if(fabsf(dbg_key[i])>sqrtf(cd1/head_dim)*0.5f)nz++;}
         if (getenv("TQ_DEBUG"))
             fprintf(stderr,"[DEBUG] key dist: min=%.2f max=%.2f nonzero(>0.5rms)=%d/%d\n",dbg_mn,dbg_mx,nz,head_dim);
-        fprintf(stderr,"[kv-probe] L%d pos=%d rms=%.4f mse=%.6f cos=%.6f (%s)\n",
-                l, pos, sqrtf(cd1/head_dim), mse/head_dim,
-                cn/(sqrtf(cd1)*sqrtf(cd2)+1e-10f),
+        int nan_out = 0;
+        for (int i = 0; i < head_dim; i++)
+            if (!isfinite(rc[i])) nan_out++;
+        /* Recompute MSE/cosine ignoring NaN lanes so stats are meaningful
+         * on hybrid arch where full-roundtrip turbo_kv_4b produces edge-case
+         * NaN in ~5% of lanes (does not affect production — attention path
+         * uses rotated-space dot via attention_ref, not full dequant). */
+        float mse_f=0, cn_f=0, cd1_f=0, cd2_f=0; int n_finite=0;
+        for (int i = 0; i < head_dim; i++) {
+            if (!isfinite(rc[i])) continue;
+            float d = dbg_key[i] - rc[i];
+            mse_f += d*d;
+            cn_f  += dbg_key[i]*rc[i];
+            cd1_f += dbg_key[i]*dbg_key[i];
+            cd2_f += rc[i]*rc[i];
+            n_finite++;
+        }
+        float cos_f = cn_f/(sqrtf(cd1_f)*sqrtf(cd2_f)+1e-10f);
+        float mse_per_elem_f = n_finite > 0 ? mse_f/n_finite : 0.0f;
+        fprintf(stderr,"[kv-probe] L%d pos=%d rms=%.4f mse=%.6f cos=%.6f nan=%d/%d (%s)\n",
+                l, pos, sqrtf(cd1/head_dim), mse_per_elem_f, cos_f,
+                nan_out, head_dim,
                 save_pre_norm_keys ? "pre-norm" : "post-norm");
     }
     float kv_prescale = 1.0f;