Skip to content

Commit 87b304e

Browse files
authored
Merge pull request #114 from AdaWorldAPI/claude/setup-embedding-pipeline-Fa65C
docs: calibration session handover — H1-H5 hypotheses, Cronbach α, full protocol 5 testable hypotheses: H1: BF16 truncation flips ~5% of ranks (bucket boundary effect) H2: γ+φ encoding preserves more rank order than linear CDF H3: i8 signed > u8 unsigned for gate-heavy roles specifically H4: ICC profile correction brings ALL encoding paths to ρ > 0.998 H5: Cronbach α reveals which tasks need multi-lens vs single lens Testing protocol: Phase 1: ONNX f32 ground truth (rten, Jina v5, 1000 pairs) Phase 2: BF16 baseline (stream GGUF, same CLAM) Phase 3: 5 encoding paths (linear, γ+φ, i8, γ+φ signed, spiral) Phase 4: Spearman ρ before/after ICC per path Phase 5: Cronbach α across 6 lenses Synthesis matrix: which encoding × which role × ICC or not. Estimated: 3-4 hours. Validates everything built in 67+ commits. https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
2 parents 4e0298e + 7ffa72f commit 87b304e

2 files changed

Lines changed: 450 additions & 0 deletions

File tree

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# SESSION HANDOVER: Calibration, Encoding & Cronbach Alpha Validation
2+
3+
## HYPOTHESIS
4+
5+
The thinking engine's distance tables are a MEASUREMENT INSTRUMENT.
6+
Like any instrument, they need validity testing before the measurements
7+
mean anything. The hypothesis chain:
8+
9+
### H1: BF16 truncation causes systematic rank flips
10+
```
11+
TEST: For 10,000 centroid pairs:
12+
Compute cos from ONNX f32 weights → rank order R_f32
13+
Compute cos from GGUF BF16 weights → rank order R_bf16
14+
Count rank disagreements where |cos_f32| < 0.008 (BF16 uncertainty zone)
15+
16+
PREDICT: ~5% of pairs within ±0.008 of bucket boundaries flip rank.
17+
MEASURE: Spearman ρ(R_f32, R_bf16) < 1.0 by exactly this amount.
18+
VALIDATE: If ρ matches prediction → we understand the truncation physics.
19+
```
20+
21+
### H2: γ+φ encoding preserves more rank order than linear CDF
22+
```
23+
TEST: Same 10,000 pairs:
24+
Encode cos → u8 via linear CDF → rank order R_linear
25+
Encode cos → u8 via γ+φ redistribution → rank order R_gamma
26+
27+
PREDICT: ρ(R_f32, R_gamma) > ρ(R_f32, R_linear)
28+
because γ concentrates resolution near the gate decision boundary
29+
where BF16 truncation causes the most rank flips.
30+
VALIDATE: If γ+φ > linear → golden ratio redistribution IS calibration.
31+
```
32+
33+
### H3: i8 signed preserves more information than u8 unsigned for gate-heavy roles
34+
```
35+
TEST: For ffn_gate role specifically:
36+
Encode cos → u8[0,255] → Spearman vs f32 → ρ_unsigned
37+
Encode cos → i8[-128,+127] → Spearman vs f32 → ρ_signed
38+
39+
PREDICT: ρ_signed > ρ_unsigned for gate (68.9% near zero, sign matters)
40+
ρ_signed ≈ ρ_unsigned for Q (positive-skewed, sign rarely matters)
41+
VALIDATE: If gate benefits but Q doesn't → sign preservation is role-specific.
42+
```
43+
44+
### H4: ICC profile correction brings ALL encoding paths to ρ > 0.998
45+
```
46+
TEST: After ICC correction on each path:
47+
ρ(R_f32, R_corrected_linear) > 0.998
48+
ρ(R_f32, R_corrected_gamma) > 0.998
49+
ρ(R_f32, R_corrected_signed) > 0.998
50+
51+
PREDICT: ICC correction absorbs the residual error regardless of encoding path.
52+
The CHEAPEST encoding + ICC ≈ the BEST encoding without ICC.
53+
VALIDATE: If all paths reach 0.998 after ICC → encoding choice doesn't matter,
54+
only the ICC quality matters. Simplify to cheapest encoding + good ICC.
55+
```
56+
57+
### H5: Multi-lens Cronbach alpha shows internal consistency
58+
```
59+
TEST: For N sentence pairs, compute distances via all 6 lenses.
60+
Each lens = one "item" in the psychometric instrument.
61+
Cronbach α = internal consistency of the multi-lens measurement.
62+
63+
PREDICT: α > 0.90 for similar-pair detection (all lenses agree on "similar")
64+
α < 0.70 for relevance detection (lenses disagree = DIFFERENT information)
65+
VALIDATE: High α for similarity = lenses are redundant (use one, save compute).
66+
Low α for relevance = lenses are complementary (use all, superposition helps).
67+
```
68+
69+
---
70+
71+
## TESTING PROTOCOL
72+
73+
### Phase 1: Ground Truth (ONNX f32)
74+
75+
```python
76+
# Load Jina v5 ONNX via rten
77+
import rten # or: use burn, or: use candle
78+
79+
model = rten.load("model.onnx", "model.onnx_data")
80+
tokenizer = load_tokenizer("tokenizer.json")
81+
82+
# Generate ground truth for 1000 sentence pairs
83+
pairs = load_test_pairs() # diverse: similar, dissimilar, related, unrelated
84+
f32_cosines = []
85+
for a, b in pairs:
86+
emb_a = model.forward(tokenizer.encode(a)) # f32 embedding
87+
emb_b = model.forward(tokenizer.encode(b))
88+
f32_cosines.append(cosine(emb_a, emb_b))
89+
90+
# This is the RAW. Everything calibrates against this.
91+
```
92+
93+
### Phase 2: BF16 Baseline
94+
95+
```python
96+
# Stream Jina v5 F16 GGUF
97+
bf16_weights = stream_gguf("v5-small-text-matching-F16.gguf", "token_embd.weight")
98+
99+
# Same CLAM centroids, but from BF16
100+
bf16_centroids = clam_sample(bf16_weights, n=256)
101+
bf16_table = build_cosine_table(bf16_centroids) # f32 cosine on bf16 inputs
102+
103+
# Map same sentences through BF16-derived codebook
104+
bf16_distances = []
105+
for a, b in pairs:
106+
ca = codebook_lookup(tokenizer.encode(a), bf16_assignments)
107+
cb = codebook_lookup(tokenizer.encode(b), bf16_assignments)
108+
bf16_distances.append(bf16_table[ca][cb])
109+
```
110+
111+
### Phase 3: Encoding Paths
112+
113+
```rust
114+
// For each encoding path, produce a u8/i8 distance table:
115+
116+
// Path 1: Linear CDF (current HDR encoding)
117+
let linear_table = hdr_cdf_encode(&raw_cosines, 256); // u8[0,255]
118+
119+
// Path 2: γ+φ redistributed
120+
let gamma_table = gamma_phi_encode(&raw_cosines, gamma=1.50, 256);
121+
122+
// Path 3: Signed i8
123+
let signed_table = signed_encode(&raw_cosines, 256); // i8[-128,+127]
124+
125+
// Path 4: γ+φ signed
126+
let gamma_signed_table = gamma_phi_signed_encode(&raw_cosines, gamma=1.50, 256);
127+
128+
// Path 5: highheelbgz spiral
129+
let spiral_table = spiral_encode(&raw_cosines, stride=golden_ratio, 256);
130+
```
131+
132+
### Phase 4: Spearman ρ per path
133+
134+
```rust
135+
fn evaluate_path(f32_cosines: &[f32], encoded_table: &[u8], assignments: &[u16], pairs: &[(Text, Text)]) -> f32 {
136+
let encoded_distances: Vec<f32> = pairs.iter()
137+
.map(|(a, b)| {
138+
let ca = assignments[tokenize(a)];
139+
let cb = assignments[tokenize(b)];
140+
encoded_table[ca * N + cb] as f32
141+
})
142+
.collect();
143+
144+
spearman(&f32_cosines, &encoded_distances)
145+
}
146+
147+
// Measure BEFORE ICC:
148+
let rho_linear = evaluate_path(&f32_cosines, &linear_table, ...);
149+
let rho_gamma = evaluate_path(&f32_cosines, &gamma_table, ...);
150+
let rho_signed = evaluate_path(&f32_cosines, &signed_table, ...);
151+
let rho_spiral = evaluate_path(&f32_cosines, &spiral_table, ...);
152+
153+
// Build ICC profiles:
154+
let icc_linear = LensProfile::build("jina-v5", "token_embd", Linear, &f32_cosines, &linear_table, N);
155+
let icc_gamma = LensProfile::build("jina-v5", "token_embd", GammaPhi, &f32_cosines, &gamma_table, N);
156+
157+
// Measure AFTER ICC:
158+
let rho_linear_corrected = evaluate_corrected(&f32_cosines, &linear_table, &icc_linear, ...);
159+
let rho_gamma_corrected = evaluate_corrected(&f32_cosines, &gamma_table, &icc_gamma, ...);
160+
```
161+
162+
### Phase 5: Cronbach Alpha
163+
164+
```rust
165+
/// Cronbach's alpha for multi-lens internal consistency.
166+
///
167+
/// items[lens][pair] = distance measured by this lens for this pair.
168+
/// Higher α = more agreement between lenses.
169+
fn cronbach_alpha(items: &[Vec<f32>]) -> f32 {
170+
let k = items.len() as f32; // number of lenses
171+
let n = items[0].len(); // number of pairs
172+
173+
// Total score variance
174+
let totals: Vec<f32> = (0..n)
175+
.map(|pair| items.iter().map(|lens| lens[pair]).sum::<f32>())
176+
.collect();
177+
let var_total = variance(&totals);
178+
179+
// Sum of item variances
180+
let var_sum: f32 = items.iter()
181+
.map(|lens| variance(lens))
182+
.sum();
183+
184+
// α = (k / (k-1)) × (1 - Σvar_item / var_total)
185+
(k / (k - 1.0)) * (1.0 - var_sum / var_total)
186+
}
187+
188+
fn variance(data: &[f32]) -> f32 {
189+
let n = data.len() as f32;
190+
let mean = data.iter().sum::<f32>() / n;
191+
data.iter().map(|x| (x - mean).powi(2)).sum::<f32>() / n
192+
}
193+
194+
// Measure:
195+
let lens_distances = vec![
196+
jina_distances, // lens 0
197+
bge_distances, // lens 1
198+
reranker_distances, // lens 2
199+
reader_distances, // lens 3
200+
qwopus_distances, // lens 4
201+
];
202+
203+
let alpha = cronbach_alpha(&lens_distances);
204+
// α > 0.90: lenses are redundant (one is enough for this task)
205+
// α 0.70-0.90: lenses agree mostly (superposition adds a little)
206+
// α < 0.70: lenses see different things (superposition is valuable)
207+
```
208+
209+
---
210+
211+
## SYNTHESIS: What the Results Tell Us
212+
213+
### If H1 confirms (BF16 flips ~5% of ranks):
214+
→ boundary_risk metadata is ESSENTIAL
215+
→ 95/5 split: fast cascade for safe pairs, LEAF validation for boundary pairs
216+
→ γ+φ should reduce the 5% by moving boundaries away from BF16 quant steps
217+
218+
### If H2 confirms (γ+φ > linear):
219+
→ γ+φ becomes the DEFAULT encoding (not optional, mandatory)
220+
→ Per-role γ offsets are critical (Gate=1.50, Q=0.37)
221+
→ The golden ratio IS the calibration (not a cosmetic choice)
222+
223+
### If H3 confirms (i8 > u8 for gate, ≈ for Q):
224+
→ i8 for gate-modulated roles (K, V, Up), u8 for extern roles (Q, Down)
225+
→ Mixed encoding per role within the same layer
226+
→ SiLU-ONNX is definitively unnecessary
227+
228+
### If H4 confirms (ICC brings all to 0.998):
229+
→ Encoding choice is secondary to ICC quality
230+
→ Cheapest encoding + good ICC = optimal
231+
→ Focus engineering effort on ICC, not on better encodings
232+
233+
### If H5 shows high Cronbach α for similarity:
234+
→ Multi-lens superposition is REDUNDANT for similarity tasks
235+
→ Use ONE lens (cheapest) for similarity, save 5× compute
236+
→ Reserve multi-lens for tasks where α < 0.70 (complementary lenses)
237+
238+
### If H5 shows low Cronbach α for relevance:
239+
→ Multi-lens superposition IS VALUABLE for relevance tasks
240+
→ Embedding (Jina) sees different things than reranker
241+
→ The DISAGREEMENT between lenses IS information
242+
→ Keep all lenses, the superposition product captures what no single lens sees
243+
244+
---
245+
246+
## CODE LOCATIONS
247+
248+
```
249+
Contract DTOs:
250+
crates/lance-graph-contract/src/high_heel.rs
251+
→ LensProfile (ICC profile DTO)
252+
→ LensConfig (6-lane registry)
253+
→ EncodingPath enum
254+
→ LENS_REGISTRY static array
255+
256+
Thinking Engine:
257+
crates/thinking-engine/src/jina_lens.rs (Jina v3 lens, 250K vocab)
258+
crates/thinking-engine/src/bge_m3_lens.rs (BGE-M3 lens, 250K vocab)
259+
crates/thinking-engine/src/reranker_lens.rs (Reranker v3, 151K vocab, NEW)
260+
crates/thinking-engine/src/silu_correction.rs (may be replaced by i8)
261+
crates/thinking-engine/src/engine.rs (MatVec cycle)
262+
263+
Calibration:
264+
crates/thinking-engine/examples/calibrate_lenses.rs (Spearman + ICC harness)
265+
crates/thinking-engine/examples/hdr_audit.rs (all models compared)
266+
crates/thinking-engine/examples/silu_crosscheck.rs (u8 vs corrected)
267+
268+
Codec:
269+
crates/bgz-tensor/src/gamma_phi.rs (γ+φ encode/decode)
270+
crates/bgz-tensor/src/codebook_calibrated.rs (two-pass build)
271+
crates/highheelbgz/src/ (spiral stride, golden ratio)
272+
273+
ONNX:
274+
AdaWorldAPI/rten (ONNX runtime, your fork)
275+
AdaWorldAPI/rten-ndarray-demo (rten ↔ ndarray bridge)
276+
277+
Ground Truth Models:
278+
jinaai/jina-embeddings-v5-text-small-text-matching
279+
→ model.onnx + model.onnx_data (2.4 GB, f32 precision)
280+
→ v5-small-text-matching-F16.gguf (1.2 GB, streamable)
281+
→ tokenizer.json (11.4 MB, real BPE)
282+
283+
Data:
284+
crates/thinking-engine/data/Qwopus3.5-27B-v3-BF16-silu/
285+
→ 64 layers × 5 role tables (305 binary files)
286+
→ 248K token assignments
287+
→ tokenizer.json (Qwen2 BPE)
288+
→ layer_stats.json
289+
crates/thinking-engine/data/jina-v3-hdr/ (64 KB table + 488 KB index)
290+
crates/thinking-engine/data/bge-m3-hdr/ (64 KB table + 488 KB index)
291+
crates/thinking-engine/data/jina-reranker-v3-BF16-hdr/ (64 KB table + 296 KB index)
292+
crates/thinking-engine/data/codebooks/ (8 models, 64×64 tables)
293+
```
294+
295+
---
296+
297+
## IMPLEMENTATION ORDER
298+
299+
```
300+
1. Download Jina v5 ONNX (2.4 GB) + GGUF (1.2 GB) + tokenizer
301+
2. Load ONNX via rten → generate f32 ground truth for 1000 pairs
302+
3. Stream GGUF → CLAM → build 5 encoding variants
303+
4. Measure Spearman ρ for each (H1-H3)
304+
5. Build ICC profiles for each
305+
6. Measure corrected ρ (H4)
306+
7. Run all 6 lenses on same pairs → Cronbach α (H5)
307+
8. Synthesize: which encoding × which role × ICC or not
308+
9. Encode findings as LensProfile metadata in contract
309+
10. Re-bake tables with winning encoding per role
310+
311+
Estimated: 3-4 hours for complete validation.
312+
```
313+
314+
---
315+
316+
## SESSION CONTEXT (what was built before this)
317+
318+
```
319+
This session (session_01ChLvBfpJS8dQhHxRD4pYNp) delivered:
320+
67+ commits across lance-graph + ndarray
321+
235K LOC Rust across 18 crates
322+
323+
Key deliverables:
324+
- Qwopus 27B: 64 layers streamed from 53.8 GB BF16 in 116s
325+
- SiLU gate correction: 86% material (BUT may be replaced by i8 signed)
326+
- 4096-centroid codebook: 248K tokens
327+
- Real Qwen BPE tokenizer
328+
- Living thought loop (tension-driven autoregressive)
329+
- MoE architecture (4096 experts, top-128)
330+
- NARS gate modulator (three modes)
331+
- Jina Reranker lens (wired, 9 tests)
332+
- LensProfile ICC DTO
333+
- LensConfig 6-lane registry
334+
- Calibration harness (Spearman + ICC builder)
335+
- OSINT pipeline (spider + OCR + NARS expansion)
336+
- Wikileaks graph (1,872 nodes)
337+
- SIMD OCR (10× faster than tesseract)
338+
- Felt OCR (Base17/polar/palette)
339+
340+
Parallel session doing:
341+
- i8 signed tables (excitation/inhibition)
342+
- u8 vs i8 dual-path comparison
343+
- Started from reranker lens (this session wired it)
344+
345+
This calibration session validates ALL of the above.
346+
Without it, every measurement drifts. GPS without relativity.
347+
```

0 commit comments

Comments
 (0)