|
| 1 | +# SESSION HANDOVER: Calibration, Encoding & Cronbach Alpha Validation |
| 2 | + |
| 3 | +## HYPOTHESIS |
| 4 | + |
| 5 | +The thinking engine's distance tables are a MEASUREMENT INSTRUMENT. |
| 6 | +Like any instrument, they need validity testing before the measurements |
| 7 | +mean anything. The hypothesis chain: |
| 8 | + |
| 9 | +### H1: BF16 truncation causes systematic rank flips |
| 10 | +``` |
| 11 | +TEST: For 10,000 centroid pairs: |
| 12 | + Compute cos from ONNX f32 weights → rank order R_f32 |
| 13 | + Compute cos from GGUF BF16 weights → rank order R_bf16 |
| 14 | + Count rank disagreements where |cos_f32| < 0.008 (BF16 uncertainty zone) |
| 15 | +
|
| 16 | +PREDICT: ~5% of pairs within ±0.008 of bucket boundaries flip rank. |
| 17 | +MEASURE: Spearman ρ(R_f32, R_bf16) < 1.0 by exactly this amount. |
| 18 | +VALIDATE: If ρ matches prediction → we understand the truncation physics. |
| 19 | +``` |
| 20 | + |
| 21 | +### H2: γ+φ encoding preserves more rank order than linear CDF |
| 22 | +``` |
| 23 | +TEST: Same 10,000 pairs: |
| 24 | + Encode cos → u8 via linear CDF → rank order R_linear |
| 25 | + Encode cos → u8 via γ+φ redistribution → rank order R_gamma |
| 26 | + |
| 27 | +PREDICT: ρ(R_f32, R_gamma) > ρ(R_f32, R_linear) |
| 28 | + because γ concentrates resolution near the gate decision boundary |
| 29 | + where BF16 truncation causes the most rank flips. |
| 30 | +VALIDATE: If γ+φ > linear → golden ratio redistribution IS calibration. |
| 31 | +``` |
| 32 | + |
| 33 | +### H3: i8 signed preserves more information than u8 unsigned for gate-heavy roles |
| 34 | +``` |
| 35 | +TEST: For ffn_gate role specifically: |
| 36 | + Encode cos → u8[0,255] → Spearman vs f32 → ρ_unsigned |
| 37 | + Encode cos → i8[-128,+127] → Spearman vs f32 → ρ_signed |
| 38 | + |
| 39 | +PREDICT: ρ_signed > ρ_unsigned for gate (68.9% near zero, sign matters) |
| 40 | + ρ_signed ≈ ρ_unsigned for Q (positive-skewed, sign rarely matters) |
| 41 | +VALIDATE: If gate benefits but Q doesn't → sign preservation is role-specific. |
| 42 | +``` |
| 43 | + |
| 44 | +### H4: ICC profile correction brings ALL encoding paths to ρ > 0.998 |
| 45 | +``` |
| 46 | +TEST: After ICC correction on each path: |
| 47 | + ρ(R_f32, R_corrected_linear) > 0.998 |
| 48 | + ρ(R_f32, R_corrected_gamma) > 0.998 |
| 49 | + ρ(R_f32, R_corrected_signed) > 0.998 |
| 50 | + |
| 51 | +PREDICT: ICC correction absorbs the residual error regardless of encoding path. |
| 52 | + The CHEAPEST encoding + ICC ≈ the BEST encoding without ICC. |
| 53 | +VALIDATE: If all paths reach 0.998 after ICC → encoding choice doesn't matter, |
| 54 | + only the ICC quality matters. Simplify to cheapest encoding + good ICC. |
| 55 | +``` |
| 56 | + |
| 57 | +### H5: Multi-lens Cronbach alpha shows internal consistency |
| 58 | +``` |
| 59 | +TEST: For N sentence pairs, compute distances via all 6 lenses. |
| 60 | + Each lens = one "item" in the psychometric instrument. |
| 61 | + Cronbach α = internal consistency of the multi-lens measurement. |
| 62 | + |
| 63 | +PREDICT: α > 0.90 for similar-pair detection (all lenses agree on "similar") |
| 64 | + α < 0.70 for relevance detection (lenses disagree = DIFFERENT information) |
| 65 | +VALIDATE: High α for similarity = lenses are redundant (use one, save compute). |
| 66 | + Low α for relevance = lenses are complementary (use all, superposition helps). |
| 67 | +``` |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## TESTING PROTOCOL |
| 72 | + |
| 73 | +### Phase 1: Ground Truth (ONNX f32) |
| 74 | + |
| 75 | +```python |
| 76 | +# Load Jina v5 ONNX via rten |
| 77 | +import rten # or: use burn, or: use candle |
| 78 | + |
| 79 | +model = rten.load("model.onnx", "model.onnx_data") |
| 80 | +tokenizer = load_tokenizer("tokenizer.json") |
| 81 | + |
| 82 | +# Generate ground truth for 1000 sentence pairs |
| 83 | +pairs = load_test_pairs() # diverse: similar, dissimilar, related, unrelated |
| 84 | +f32_cosines = [] |
| 85 | +for a, b in pairs: |
| 86 | + emb_a = model.forward(tokenizer.encode(a)) # f32 embedding |
| 87 | + emb_b = model.forward(tokenizer.encode(b)) |
| 88 | + f32_cosines.append(cosine(emb_a, emb_b)) |
| 89 | + |
| 90 | +# This is the RAW. Everything calibrates against this. |
| 91 | +``` |
| 92 | + |
| 93 | +### Phase 2: BF16 Baseline |
| 94 | + |
| 95 | +```python |
| 96 | +# Stream Jina v5 F16 GGUF |
| 97 | +bf16_weights = stream_gguf("v5-small-text-matching-F16.gguf", "token_embd.weight") |
| 98 | + |
| 99 | +# Same CLAM centroids, but from BF16 |
| 100 | +bf16_centroids = clam_sample(bf16_weights, n=256) |
| 101 | +bf16_table = build_cosine_table(bf16_centroids) # f32 cosine on bf16 inputs |
| 102 | + |
| 103 | +# Map same sentences through BF16-derived codebook |
| 104 | +bf16_distances = [] |
| 105 | +for a, b in pairs: |
| 106 | + ca = codebook_lookup(tokenizer.encode(a), bf16_assignments) |
| 107 | + cb = codebook_lookup(tokenizer.encode(b), bf16_assignments) |
| 108 | + bf16_distances.append(bf16_table[ca][cb]) |
| 109 | +``` |
| 110 | + |
| 111 | +### Phase 3: Encoding Paths |
| 112 | + |
| 113 | +```rust |
| 114 | +// For each encoding path, produce a u8/i8 distance table: |
| 115 | + |
| 116 | +// Path 1: Linear CDF (current HDR encoding) |
| 117 | +let linear_table = hdr_cdf_encode(&raw_cosines, 256); // u8[0,255] |
| 118 | + |
| 119 | +// Path 2: γ+φ redistributed |
| 120 | +let gamma_table = gamma_phi_encode(&raw_cosines, gamma=1.50, 256); |
| 121 | + |
| 122 | +// Path 3: Signed i8 |
| 123 | +let signed_table = signed_encode(&raw_cosines, 256); // i8[-128,+127] |
| 124 | + |
| 125 | +// Path 4: γ+φ signed |
| 126 | +let gamma_signed_table = gamma_phi_signed_encode(&raw_cosines, gamma=1.50, 256); |
| 127 | + |
| 128 | +// Path 5: highheelbgz spiral |
| 129 | +let spiral_table = spiral_encode(&raw_cosines, stride=golden_ratio, 256); |
| 130 | +``` |
| 131 | + |
| 132 | +### Phase 4: Spearman ρ per path |
| 133 | + |
| 134 | +```rust |
| 135 | +fn evaluate_path(f32_cosines: &[f32], encoded_table: &[u8], assignments: &[u16], pairs: &[(Text, Text)]) -> f32 { |
| 136 | + let encoded_distances: Vec<f32> = pairs.iter() |
| 137 | + .map(|(a, b)| { |
| 138 | + let ca = assignments[tokenize(a)]; |
| 139 | + let cb = assignments[tokenize(b)]; |
| 140 | + encoded_table[ca * N + cb] as f32 |
| 141 | + }) |
| 142 | + .collect(); |
| 143 | + |
| 144 | + spearman(&f32_cosines, &encoded_distances) |
| 145 | +} |
| 146 | + |
| 147 | +// Measure BEFORE ICC: |
| 148 | +let rho_linear = evaluate_path(&f32_cosines, &linear_table, ...); |
| 149 | +let rho_gamma = evaluate_path(&f32_cosines, &gamma_table, ...); |
| 150 | +let rho_signed = evaluate_path(&f32_cosines, &signed_table, ...); |
| 151 | +let rho_spiral = evaluate_path(&f32_cosines, &spiral_table, ...); |
| 152 | + |
| 153 | +// Build ICC profiles: |
| 154 | +let icc_linear = LensProfile::build("jina-v5", "token_embd", Linear, &f32_cosines, &linear_table, N); |
| 155 | +let icc_gamma = LensProfile::build("jina-v5", "token_embd", GammaPhi, &f32_cosines, &gamma_table, N); |
| 156 | + |
| 157 | +// Measure AFTER ICC: |
| 158 | +let rho_linear_corrected = evaluate_corrected(&f32_cosines, &linear_table, &icc_linear, ...); |
| 159 | +let rho_gamma_corrected = evaluate_corrected(&f32_cosines, &gamma_table, &icc_gamma, ...); |
| 160 | +``` |
| 161 | + |
| 162 | +### Phase 5: Cronbach Alpha |
| 163 | + |
| 164 | +```rust |
| 165 | +/// Cronbach's alpha for multi-lens internal consistency. |
| 166 | +/// |
| 167 | +/// items[lens][pair] = distance measured by this lens for this pair. |
| 168 | +/// Higher α = more agreement between lenses. |
| 169 | +fn cronbach_alpha(items: &[Vec<f32>]) -> f32 { |
| 170 | + let k = items.len() as f32; // number of lenses |
| 171 | + let n = items[0].len(); // number of pairs |
| 172 | + |
| 173 | + // Total score variance |
| 174 | + let totals: Vec<f32> = (0..n) |
| 175 | + .map(|pair| items.iter().map(|lens| lens[pair]).sum::<f32>()) |
| 176 | + .collect(); |
| 177 | + let var_total = variance(&totals); |
| 178 | + |
| 179 | + // Sum of item variances |
| 180 | + let var_sum: f32 = items.iter() |
| 181 | + .map(|lens| variance(lens)) |
| 182 | + .sum(); |
| 183 | + |
| 184 | + // α = (k / (k-1)) × (1 - Σvar_item / var_total) |
| 185 | + (k / (k - 1.0)) * (1.0 - var_sum / var_total) |
| 186 | +} |
| 187 | + |
| 188 | +fn variance(data: &[f32]) -> f32 { |
| 189 | + let n = data.len() as f32; |
| 190 | + let mean = data.iter().sum::<f32>() / n; |
| 191 | + data.iter().map(|x| (x - mean).powi(2)).sum::<f32>() / n |
| 192 | +} |
| 193 | + |
| 194 | +// Measure: |
| 195 | +let lens_distances = vec![ |
| 196 | + jina_distances, // lens 0 |
| 197 | + bge_distances, // lens 1 |
| 198 | + reranker_distances, // lens 2 |
| 199 | + reader_distances, // lens 3 |
| 200 | + qwopus_distances, // lens 4 |
| 201 | +]; |
| 202 | + |
| 203 | +let alpha = cronbach_alpha(&lens_distances); |
| 204 | +// α > 0.90: lenses are redundant (one is enough for this task) |
| 205 | +// α 0.70-0.90: lenses agree mostly (superposition adds a little) |
| 206 | +// α < 0.70: lenses see different things (superposition is valuable) |
| 207 | +``` |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## SYNTHESIS: What the Results Tell Us |
| 212 | + |
| 213 | +### If H1 confirms (BF16 flips ~5% of ranks): |
| 214 | +→ boundary_risk metadata is ESSENTIAL |
| 215 | +→ 95/5 split: fast cascade for safe pairs, LEAF validation for boundary pairs |
| 216 | +→ γ+φ should reduce the 5% by moving boundaries away from BF16 quant steps |
| 217 | + |
| 218 | +### If H2 confirms (γ+φ > linear): |
| 219 | +→ γ+φ becomes the DEFAULT encoding (not optional, mandatory) |
| 220 | +→ Per-role γ offsets are critical (Gate=1.50, Q=0.37) |
| 221 | +→ The golden ratio IS the calibration (not a cosmetic choice) |
| 222 | + |
| 223 | +### If H3 confirms (i8 > u8 for gate, ≈ for Q): |
| 224 | +→ i8 for gate-modulated roles (K, V, Up), u8 for extern roles (Q, Down) |
| 225 | +→ Mixed encoding per role within the same layer |
| 226 | +→ SiLU-ONNX is definitively unnecessary |
| 227 | + |
| 228 | +### If H4 confirms (ICC brings all to 0.998): |
| 229 | +→ Encoding choice is secondary to ICC quality |
| 230 | +→ Cheapest encoding + good ICC = optimal |
| 231 | +→ Focus engineering effort on ICC, not on better encodings |
| 232 | + |
| 233 | +### If H5 shows high Cronbach α for similarity: |
| 234 | +→ Multi-lens superposition is REDUNDANT for similarity tasks |
| 235 | +→ Use ONE lens (cheapest) for similarity, save 5× compute |
| 236 | +→ Reserve multi-lens for tasks where α < 0.70 (complementary lenses) |
| 237 | + |
| 238 | +### If H5 shows low Cronbach α for relevance: |
| 239 | +→ Multi-lens superposition IS VALUABLE for relevance tasks |
| 240 | +→ Embedding (Jina) sees different things than reranker |
| 241 | +→ The DISAGREEMENT between lenses IS information |
| 242 | +→ Keep all lenses, the superposition product captures what no single lens sees |
| 243 | + |
| 244 | +--- |
| 245 | + |
| 246 | +## CODE LOCATIONS |
| 247 | + |
| 248 | +``` |
| 249 | +Contract DTOs: |
| 250 | + crates/lance-graph-contract/src/high_heel.rs |
| 251 | + → LensProfile (ICC profile DTO) |
| 252 | + → LensConfig (6-lane registry) |
| 253 | + → EncodingPath enum |
| 254 | + → LENS_REGISTRY static array |
| 255 | +
|
| 256 | +Thinking Engine: |
| 257 | + crates/thinking-engine/src/jina_lens.rs (Jina v3 lens, 250K vocab) |
| 258 | + crates/thinking-engine/src/bge_m3_lens.rs (BGE-M3 lens, 250K vocab) |
| 259 | + crates/thinking-engine/src/reranker_lens.rs (Reranker v3, 151K vocab, NEW) |
| 260 | + crates/thinking-engine/src/silu_correction.rs (may be replaced by i8) |
| 261 | + crates/thinking-engine/src/engine.rs (MatVec cycle) |
| 262 | +
|
| 263 | +Calibration: |
| 264 | + crates/thinking-engine/examples/calibrate_lenses.rs (Spearman + ICC harness) |
| 265 | + crates/thinking-engine/examples/hdr_audit.rs (all models compared) |
| 266 | + crates/thinking-engine/examples/silu_crosscheck.rs (u8 vs corrected) |
| 267 | +
|
| 268 | +Codec: |
| 269 | + crates/bgz-tensor/src/gamma_phi.rs (γ+φ encode/decode) |
| 270 | + crates/bgz-tensor/src/codebook_calibrated.rs (two-pass build) |
| 271 | + crates/highheelbgz/src/ (spiral stride, golden ratio) |
| 272 | +
|
| 273 | +ONNX: |
| 274 | + AdaWorldAPI/rten (ONNX runtime, your fork) |
| 275 | + AdaWorldAPI/rten-ndarray-demo (rten ↔ ndarray bridge) |
| 276 | +
|
| 277 | +Ground Truth Models: |
| 278 | + jinaai/jina-embeddings-v5-text-small-text-matching |
| 279 | + → model.onnx + model.onnx_data (2.4 GB, f32 precision) |
| 280 | + → v5-small-text-matching-F16.gguf (1.2 GB, streamable) |
| 281 | + → tokenizer.json (11.4 MB, real BPE) |
| 282 | +
|
| 283 | +Data: |
| 284 | + crates/thinking-engine/data/Qwopus3.5-27B-v3-BF16-silu/ |
| 285 | + → 64 layers × 5 role tables (305 binary files) |
| 286 | + → 248K token assignments |
| 287 | + → tokenizer.json (Qwen2 BPE) |
| 288 | + → layer_stats.json |
| 289 | + crates/thinking-engine/data/jina-v3-hdr/ (64 KB table + 488 KB index) |
| 290 | + crates/thinking-engine/data/bge-m3-hdr/ (64 KB table + 488 KB index) |
| 291 | + crates/thinking-engine/data/jina-reranker-v3-BF16-hdr/ (64 KB table + 296 KB index) |
| 292 | + crates/thinking-engine/data/codebooks/ (8 models, 64×64 tables) |
| 293 | +``` |
| 294 | + |
| 295 | +--- |
| 296 | + |
| 297 | +## IMPLEMENTATION ORDER |
| 298 | + |
| 299 | +``` |
| 300 | +1. Download Jina v5 ONNX (2.4 GB) + GGUF (1.2 GB) + tokenizer |
| 301 | +2. Load ONNX via rten → generate f32 ground truth for 1000 pairs |
| 302 | +3. Stream GGUF → CLAM → build 5 encoding variants |
| 303 | +4. Measure Spearman ρ for each (H1-H3) |
| 304 | +5. Build ICC profiles for each |
| 305 | +6. Measure corrected ρ (H4) |
| 306 | +7. Run all 6 lenses on same pairs → Cronbach α (H5) |
| 307 | +8. Synthesize: which encoding × which role × ICC or not |
| 308 | +9. Encode findings as LensProfile metadata in contract |
| 309 | +10. Re-bake tables with winning encoding per role |
| 310 | +
|
| 311 | +Estimated: 3-4 hours for complete validation. |
| 312 | +``` |
| 313 | + |
| 314 | +--- |
| 315 | + |
| 316 | +## SESSION CONTEXT (what was built before this) |
| 317 | + |
| 318 | +``` |
| 319 | +This session (session_01ChLvBfpJS8dQhHxRD4pYNp) delivered: |
| 320 | + 67+ commits across lance-graph + ndarray |
| 321 | + 235K LOC Rust across 18 crates |
| 322 | + |
| 323 | +Key deliverables: |
| 324 | + - Qwopus 27B: 64 layers streamed from 53.8 GB BF16 in 116s |
| 325 | + - SiLU gate correction: 86% material (BUT may be replaced by i8 signed) |
| 326 | + - 4096-centroid codebook: 248K tokens |
| 327 | + - Real Qwen BPE tokenizer |
| 328 | + - Living thought loop (tension-driven autoregressive) |
| 329 | + - MoE architecture (4096 experts, top-128) |
| 330 | + - NARS gate modulator (three modes) |
| 331 | + - Jina Reranker lens (wired, 9 tests) |
| 332 | + - LensProfile ICC DTO |
| 333 | + - LensConfig 6-lane registry |
| 334 | + - Calibration harness (Spearman + ICC builder) |
| 335 | + - OSINT pipeline (spider + OCR + NARS expansion) |
| 336 | + - Wikileaks graph (1,872 nodes) |
| 337 | + - SIMD OCR (10× faster than tesseract) |
| 338 | + - Felt OCR (Base17/polar/palette) |
| 339 | +
|
| 340 | +Parallel session doing: |
| 341 | + - i8 signed tables (excitation/inhibition) |
| 342 | + - u8 vs i8 dual-path comparison |
| 343 | + - Started from reranker lens (this session wired it) |
| 344 | +
|
| 345 | +This calibration session validates ALL of the above. |
| 346 | +Without it, every measurement drifts. GPS without relativity. |
| 347 | +``` |
0 commit comments