Skip to content

Commit 4b6019e

Browse files
unamedkrclaude
andcommitted
debug(kv): refined probe — hybrid arch limitation surfaced (not a production bug)
Extended R32 TQ_KV_PROBE to Qwen3 family. Findings: Llama-3.2-1B non-hybrid: cos 0.994-0.997, MSE 0.02-0.09, 0/64 NaN Qwen3-0.6B non-hybrid: cos 0.995-0.997, MSE 0.02-4.4, 0/128 NaN Qwen3.5-4B DeltaNet+attn: inf/NaN, 6/256 NaN lanes Qwen3.6-35B MoE+DeltaNet: inf/NaN, 6/256 NaN lanes On hybrid arch the probe's full-dequant roundtrip produces NaN in ~5% of lanes due to Hadamard-inverse × codebook edge case for small-rms post-norm keys. Input verified finite (nan_in=0). Production unaffected: attention on hybrid uses tq_turbo_kv_4b_attention_ref (rotated-space dot, no full dequant). Probe measured the wrong path. Methodology lesson: refparity's value is comparing the SAME code path vs a reference. Probe chose a code path production doesn't use → false positive on hybrid. Documented in env_vars.md; next-round fix is a production-path-matching probe (query @ dequant(K) vs attention_ref). Refined probe to recompute stats excluding NaN lanes so the signal remains useful. Llama cos now cleanly 0.995+, 0/64 NaN (confirms R32). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4d378f0 commit 4b6019e

3 files changed

Lines changed: 61 additions & 4 deletions

File tree

.claude/state.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,44 @@
33
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
44
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
55

6+
## Phase 2 R33 — KV probe: hybrid arch limitation surfaced, production unaffected (2026-04-22)
7+
8+
Extended R32's `TQ_KV_PROBE` to Qwen3 family:
9+
10+
| arch | pos | cosine range | MSE range | NaN lanes |
11+
|---|---|---|---|---|
12+
| Llama-3.2-1B (non-hybrid) | 25-200 | 0.994-0.997 | 0.018-0.087 | 0/64 |
13+
| Qwen3-0.6B (non-hybrid, QK-norm) | 25, 50 | 0.995-0.997 | 0.024-4.4 | 0/128 |
14+
| Qwen3.5-4B (DeltaNet+attn hybrid) | 25 || inf | **6/256** |
15+
| Qwen3.6-35B (MoE+DeltaNet hybrid) | 0-200 || inf | **6/256** |
16+
17+
### The finding
18+
19+
On hybrid arch, the probe's full-dequant roundtrip — `dt->quantize`
20+
followed by `dt->dequantize` — produces NaN in ~5% of the 256-element
21+
lanes. Input keys verified finite (`nan_in=0`); NaN emerges inside the
22+
Hadamard-inverse × codebook path for small-rms post-norm keys.
23+
24+
### Why production isn't affected
25+
26+
Production attention on hybrid arch uses `tq_turbo_kv_4b_attention_ref`
27+
(rotated-space dot product vs stored-indices × codebook). Never calls
28+
the full-dequant roundtrip my probe does. Hybrid model outputs stay
29+
coherent; the NaN lanes are entirely probe-synthesized.
30+
31+
### Methodology lesson
32+
33+
refparity's strength is comparing the SAME code path against a reference.
34+
This round's probe compared turbo_kv_4b's `quantize`+`dequantize` against
35+
FP32 — but production on hybrid arch uses a third path (`attention_ref`)
36+
that bypasses full dequant. Probe measured the wrong thing.
37+
38+
Next-round fix: add a hybrid-aware probe that measures `query @ dequant(K)`
39+
vs the production `attention_ref(query, K)` — same semantic, actual path
40+
used. That's the meaningful KV correctness check.
41+
42+
Documented the probe's hybrid-arch limitation in `docs/env_vars.md`.
43+
644
## Phase 2 R32 — KV refparity extension: turbo_kv_4b is clean (2026-04-22)
745

846
Applied the refparity methodology (that surfaced the BPE and MoE bugs

docs/env_vars.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ here is opt-in; defaults are the tested production path.
1919
| `TQ_MOE_FAST_EXP` | off | Use Schraudolph fast-exp in MoE SwiGLU (vs exact expf default). ~2% per-call error; may re-introduce long-gen drift |
2020
| `TQ_MOE_ROUTE_TEMP` | `1.0` (auto-flipped to `2.0` on qwen35moe arch at load time) | Softmax temperature on top-K expert routing. **`2.0` extends Qwen3.6-35B coherence from 117 → 200+ tokens** on the "Once upon a time" drift-trigger prompt (measured R26). Auto-detected for qwen35moe at model load (see also `TQ_NO_MOE_TEMP_AUTO`). Other arch default stays `1.0` (identity). Trade: slightly less decisive routing = slightly broader expert mix, but top-K set unchanged. `"Paris"` factual probe still correct at T=2.0 |
2121
| `TQ_NO_MOE_TEMP_AUTO` | off | Disable the qwen35moe auto-default flip. Use if you want the prior baseline T=1.0 behavior on Qwen3.6-35B |
22-
| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. Useful to verify KV compression is behaving uniformly across layers and not drifting over position. See the R32 finding — turbo_kv_4b holds cosine ≥0.994 across all layers and positions on Llama-3.2-1B |
22+
| `TQ_KV_PROBE` | off | Dump per-layer K quantization roundtrip stats (rms, MSE, cosine) at positions 0/25/50/100/200. On **Llama-3.x** turbo_kv_4b holds cosine ≥0.994 cleanly (R32). On **Qwen3 non-hybrid** (Qwen3-0.6B) also clean. On **hybrid arch** (Qwen3.5-4B, Qwen3.6-35B: DeltaNet + self-attn) the probe's **full-dequant roundtrip** produces NaN in ~5% of lanes due to edge-case in turbo_kv_4b's Hadamard-inverse + codebook round-trip for small-rms post-norm keys. This is a **probe artifact, not a production bug** — production attention uses `attention_ref` (rotated-space dot product, no full dequant). The probe is thus useful on non-hybrid arch; treat hybrid readings with skepticism |
2323

2424
## Quality / correctness
2525

src/engine/tq_transformer.c

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1850,9 +1850,28 @@ static void self_attn_forward(tq_model_t* model, tq_state_t* s, int l, int pos)
18501850
for(int i=0;i<head_dim;i++){if(dbg_key[i]<dbg_mn)dbg_mn=dbg_key[i];if(dbg_key[i]>dbg_mx)dbg_mx=dbg_key[i];if(fabsf(dbg_key[i])>sqrtf(cd1/head_dim)*0.5f)nz++;}
18511851
if (getenv("TQ_DEBUG"))
18521852
fprintf(stderr,"[DEBUG] key dist: min=%.2f max=%.2f nonzero(>0.5rms)=%d/%d\n",dbg_mn,dbg_mx,nz,head_dim);
1853-
fprintf(stderr,"[kv-probe] L%d pos=%d rms=%.4f mse=%.6f cos=%.6f (%s)\n",
1854-
l, pos, sqrtf(cd1/head_dim), mse/head_dim,
1855-
cn/(sqrtf(cd1)*sqrtf(cd2)+1e-10f),
1853+
int nan_out = 0;
1854+
for (int i = 0; i < head_dim; i++)
1855+
if (!isfinite(rc[i])) nan_out++;
1856+
/* Recompute MSE/cosine ignoring NaN lanes so stats are meaningful
1857+
* on hybrid arch where full-roundtrip turbo_kv_4b produces edge-case
1858+
* NaN in ~5% of lanes (does not affect production — attention path
1859+
* uses rotated-space dot via attention_ref, not full dequant). */
1860+
float mse_f=0, cn_f=0, cd1_f=0, cd2_f=0; int n_finite=0;
1861+
for (int i = 0; i < head_dim; i++) {
1862+
if (!isfinite(rc[i])) continue;
1863+
float d = dbg_key[i] - rc[i];
1864+
mse_f += d*d;
1865+
cn_f += dbg_key[i]*rc[i];
1866+
cd1_f += dbg_key[i]*dbg_key[i];
1867+
cd2_f += rc[i]*rc[i];
1868+
n_finite++;
1869+
}
1870+
float cos_f = cn_f/(sqrtf(cd1_f)*sqrtf(cd2_f)+1e-10f);
1871+
float mse_per_elem_f = n_finite > 0 ? mse_f/n_finite : 0.0f;
1872+
fprintf(stderr,"[kv-probe] L%d pos=%d rms=%.4f mse=%.6f cos=%.6f nan=%d/%d (%s)\n",
1873+
l, pos, sqrtf(cd1/head_dim), mse_per_elem_f, cos_f,
1874+
nan_out, head_dim,
18561875
save_pre_norm_keys ? "pre-norm" : "post-norm");
18571876
}
18581877
float kv_prescale = 1.0f;

0 commit comments

Comments
 (0)