Skip to content

Commit 40e6d92

Browse files
authored
Merge pull request #115 from AdaWorldAPI/claude/risc-thought-engine-TCZw7
feat: wire candle 0.9 + hf-hub as calibration deps (feature-gated) Three tools, three roles: candle: Training (SiLU) + Forward Pass Ground Truth (Jina/BGE-M3/Qwen) ort: Reranker Ground Truth (only path) + Bulk Calibration speed rten: Medical Imaging Sensor (U-Net/ViT, pure Rust, separate) New feature flag: --features calibration candle-core 0.9, candle-nn 0.9, candle-transformers 0.9, hf-hub 0.4 All optional — default build unchanged (188 tests pass) Follows EmbedAnything's proven versions (candle 0.9.2). Jina BERT, ModernBERT, Qwen3 forward pass patterns available from StarlightSearch/EmbedAnything/rust/src/models/ as reference. ort for Reranker is next (separate dep, only needed for cross-encoder). https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
2 parents 87b304e + 282f9bc commit 40e6d92

15 files changed

Lines changed: 3380 additions & 155 deletions
Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
# SESSION: Jina v5 ONNX Calibration — Ground Truth Without API Keys
2+
3+
## THE GOAL
4+
5+
Replace synthetic ground truth in `calibrate_lenses.rs` with REAL embeddings.
6+
Jina v5 ONNX + rten = f32 ground truth on CPU. No API key. No network at runtime.
7+
8+
Then: run the 5-path calibration matrix from PR #113 handover.
9+
10+
---
11+
12+
## JINA v5 MODEL CARD
13+
14+
```
15+
Model: jinaai/jina-embeddings-v5-text-small-text-matching
16+
Base: Qwen3-0.6B-Base
17+
Params: 677M (0.6B)
18+
Emb dim: 1024
19+
Matryoshka: 32, 64, 128, 256, 512, 768, 1024
20+
Max seq: 32,768
21+
Pooling: last-token
22+
Tensor: BF16
23+
Vocab: Qwen3 BPE (tokenizer.json = 11.4 MB)
24+
25+
Files:
26+
model.safetensors 1.19 GB (safetensors weights)
27+
v5-small-text-matching-F16.gguf 1.2 GB (F16 GGUF, streamable)
28+
onnx/model.onnx 1.27 MB (graph only)
29+
onnx/model.onnx_data 2.38 GB (weights, external data)
30+
tokenizer.json 11.4 MB (Qwen3 BPE)
31+
32+
v5 Task Variants (all have GGUF + ONNX):
33+
text-small: general, retrieval, text-matching, clustering, classification
34+
text-nano (0.2B): same 5 tasks (smaller, faster, less accurate)
35+
```
36+
37+
## WHY v5 REPLACES v3 AS TRUTH ANCHOR
38+
39+
```
40+
v3: API-only ground truth (needs JINA_API_KEY, network, rate limits)
41+
v5: ONNX available (rten loads it, forward pass on CPU, no API)
42+
GGUF available (stream F16 → CLAM → bake, same pipeline as v3)
43+
SAME repo has both → calibration in one session, no external deps
44+
45+
v5 is also newer, better MTEB scores, Matryoshka dims, and based on Qwen3.
46+
```
47+
48+
---
49+
50+
## IMPLEMENTATION PLAN
51+
52+
### Phase 1: Add rten dependency (10 min)
53+
54+
```toml
55+
# crates/thinking-engine/Cargo.toml
56+
[dependencies]
57+
rten = { version = "0.16", optional = true }
58+
rten-tensor = { version = "0.16", optional = true }
59+
60+
[features]
61+
default = ["tokenizer"]
62+
tokenizer = ["dep:tokenizers"]
63+
onnx-calibration = ["dep:rten", "dep:rten-tensor"]
64+
```
65+
66+
Feature-gated. Default builds don't pull 2+ GB of ONNX weights.
67+
Only the calibration example enables it.
68+
69+
### Phase 2: Download Jina v5 ONNX + GGUF (15 min, one-time)
70+
71+
```bash
72+
# Create data directory
73+
mkdir -p crates/thinking-engine/data/jina-v5-text-matching
74+
75+
# Download ONNX (2.39 GB total — model graph + external weights)
76+
# NOTE: model.onnx_data is 2.38 GB. Do NOT commit to git.
77+
cd crates/thinking-engine/data/jina-v5-text-matching
78+
wget https://huggingface.co/jinaai/jina-embeddings-v5-text-small-text-matching/resolve/main/onnx/model.onnx
79+
wget https://huggingface.co/jinaai/jina-embeddings-v5-text-small-text-matching/resolve/main/onnx/model.onnx_data
80+
81+
# Download tokenizer (11.4 MB)
82+
wget https://huggingface.co/jinaai/jina-embeddings-v5-text-small-text-matching/resolve/main/tokenizer.json
83+
84+
# GGUF is STREAMED, not downloaded (existing pipeline via HTTP range requests)
85+
# Source: v5-small-text-matching-F16.gguf (1.2 GB)
86+
```
87+
88+
Add to `.gitignore`:
89+
```
90+
crates/thinking-engine/data/jina-v5-text-matching/model.onnx
91+
crates/thinking-engine/data/jina-v5-text-matching/model.onnx_data
92+
```
93+
94+
### Phase 3: ONNX inference module (jina_v5_onnx.rs, ~150 lines)
95+
96+
```rust
97+
//! Jina v5 ONNX ground truth via rten.
98+
//!
99+
//! Loads the ONNX model, tokenizes input, runs forward pass,
100+
//! returns 1024D f32 embedding. This IS the ground truth.
101+
102+
use rten::Model;
103+
use rten_tensor::NdTensor;
104+
105+
pub struct JinaV5Onnx {
106+
model: Model,
107+
}
108+
109+
impl JinaV5Onnx {
110+
/// Load from ONNX file path.
111+
pub fn load(onnx_path: &str) -> Result<Self, Box<dyn std::error::Error>> {
112+
let model = Model::load_file(onnx_path)?;
113+
Ok(Self { model })
114+
}
115+
116+
/// Run inference: token_ids → 1024D f32 embedding.
117+
/// Uses last-token pooling (Jina v5 convention).
118+
pub fn embed(&self, token_ids: &[i64]) -> Vec<f32> {
119+
let seq_len = token_ids.len();
120+
let input_ids = NdTensor::from_data(
121+
[1, seq_len],
122+
token_ids.to_vec(),
123+
);
124+
let attention_mask = NdTensor::from_data(
125+
[1, seq_len],
126+
vec![1i64; seq_len],
127+
);
128+
129+
let result = self.model.run(
130+
vec![
131+
("input_ids", input_ids.into()),
132+
("attention_mask", attention_mask.into()),
133+
],
134+
&["last_hidden_state"], // or "sentence_embedding"
135+
).expect("ONNX forward pass failed");
136+
137+
// Extract last-token embedding (1024D)
138+
let output = result[0].as_float().unwrap();
139+
// Last token pooling: take embedding at position seq_len-1
140+
let embedding: Vec<f32> = (0..1024)
141+
.map(|d| output[[0, seq_len - 1, d]])
142+
.collect();
143+
144+
// L2 normalize
145+
let norm: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt();
146+
if norm > 1e-10 {
147+
embedding.iter().map(|x| x / norm).collect()
148+
} else {
149+
embedding
150+
}
151+
}
152+
153+
/// Cosine similarity between two texts (ground truth).
154+
pub fn cosine(&self, ids_a: &[i64], ids_b: &[i64]) -> f32 {
155+
let emb_a = self.embed(ids_a);
156+
let emb_b = self.embed(ids_b);
157+
emb_a.iter().zip(&emb_b).map(|(a, b)| a * b).sum()
158+
}
159+
}
160+
```
161+
162+
### Phase 4: Build Jina v5 lens (stream_jina_v5.rs example)
163+
164+
```
165+
Stream v5-small-text-matching-F16.gguf via HTTP range requests
166+
→ Extract token_embedding rows (vocab × 1024 × BF16)
167+
→ CLAM 256 centroids
168+
→ Build 256×256 distance table
169+
→ CDF-percentile HDR encoding → u8 table
170+
→ Also build i8 signed table (subtract 128)
171+
→ Bake via include_bytes! → jina_v5_lens.rs
172+
```
173+
174+
### Phase 5: Calibration matrix (calibrate_v5.rs example)
175+
176+
```
177+
For each of 16 sentence pairs:
178+
1. Tokenize with Jina v5 tokenizer (tokenizers crate)
179+
2. ONNX inference → f32 cosine = GROUND TRUTH
180+
3. Baked lens distance (u8 HDR) = PATH 1
181+
4. Baked lens signed (i8) = PATH 2
182+
5. γ+φ encoded distance = PATH 3
183+
6. highheelbgz spiral distance = PATH 4
184+
185+
Compare each path vs ground truth:
186+
Spearman ρ (rank preservation)
187+
Linear ICC (transfer curve)
188+
RMSE (absolute error)
189+
Effective bits (information preserved)
190+
191+
Result: per-path calibration scores.
192+
Winner per model × role.
193+
```
194+
195+
### Phase 6: Wire v5 as default truth anchor
196+
197+
```
198+
Update calibrate_lenses.rs:
199+
Replace synthetic ground truth with JinaV5Onnx::cosine()
200+
Keep the Jina v3 and Reranker baked lens comparisons
201+
Add Jina v5 baked lens when available
202+
Report: v3 vs v5 correlation (should be high)
203+
```
204+
205+
---
206+
207+
## 5-PATH CALIBRATION MATRIX (from PR #113 handover)
208+
209+
```
210+
For EACH model (6 models × 6 roles = 36 cells):
211+
212+
Path 1: ONNX (rten) f32 ground truth ← REFERENCE
213+
Path 2: GGUF raw u8 CDF existing HDR pipeline
214+
Path 3: GGUF γ+φ golden ratio redistribution
215+
Path 4: GGUF i8 signed preserves gate sign ← OUR NEW PATH
216+
Path 5: GGUF highheelbgz spiral + golden stride
217+
218+
ICC profile per path:
219+
transfer_curve: f(baked_distance) → ground_truth_similarity
220+
noise_floor: minimum detectable difference
221+
effective_bits: Shannon entropy of the encoding
222+
spearman_rho: rank correlation with ground truth
223+
224+
Per model × role winner:
225+
i8 expected to win for: reranker (symmetric), gate roles (sign matters)
226+
γ+φ expected to win for: narrow-range roles (gate, up)
227+
raw CDF expected to win for: positive-skewed (embedding models)
228+
hhbgz expected to win for: ??? (that's why we test)
229+
```
230+
231+
## MODELS IN THE MATRIX
232+
233+
```
234+
Model ONNX available? GGUF available? Notes
235+
───── ─────────────── ─────────────── ─────
236+
Jina v5 text-matching ✓ (2.39 GB) ✓ (1.2 GB F16) NEW truth anchor
237+
BGE-M3 ✓ (via ONNX repo) ✓ (baked) multilingual
238+
Jina Reranker v3 ? (check) ✓ (baked) widest cos range
239+
Reader-LM 1.5B ? (check) ✓ (baked) HTML→text
240+
Qwopus 27B ✗ (too large) ✓ (streamed) dense LLM
241+
Maverick 128E ✗ (800 GB) ✓ (18 shards) real MoE
242+
```
243+
244+
For models without ONNX: use Jina v5 embeddings of the TEXT as cross-model truth.
245+
The thinking-engine distance should correlate with Jina v5 text similarity.
246+
247+
## SIGNED EXPERIMENT INTEGRATION
248+
249+
```
250+
The dual signed experiment (just pushed) showed:
251+
Jina v3: 88% agreement (narrow cos, weak inhibition)
252+
Jina Reranker: 50% agreement (wide cos, strong inhibition)
253+
BGE-M3: 62% agreement (moderate)
254+
255+
The calibration matrix will answer:
256+
Does the 50% DISAGREEMENT on the reranker mean
257+
(a) signed is MORE accurate (finds real structure unsigned misses), or
258+
(b) signed is LESS accurate (inhibition kills valid peaks)?
259+
260+
Compare both paths against ONNX ground truth.
261+
If signed ρ > unsigned ρ on reranker: signed wins, drop SiLU-ONNX.
262+
If unsigned ρ > signed ρ: keep both, SiLU-ONNX still needed.
263+
```
264+
265+
---
266+
267+
## DISK BUDGET
268+
269+
```
270+
ONNX model: 2.39 GB (downloaded, NOT committed, .gitignore'd)
271+
GGUF F16: 1.2 GB (streamed via HTTP, never on disk)
272+
tokenizer: 11.4 MB (downloaded, could commit or gitignore)
273+
Baked table: 64 KB (committed, include_bytes!)
274+
Baked index: ~300 KB (committed, include_bytes!)
275+
276+
Total new disk: ~2.4 GB temporary for calibration, 364 KB permanent.
277+
```
278+
279+
## WHAT NOT TO DO
280+
281+
- Do NOT commit the ONNX model to git (2.4 GB). gitignore it.
282+
- Do NOT stream ONNX via HTTP range requests. rten needs the full file.
283+
- Do NOT use Q8_0 GGUF. cos[0,0]. F16 required. (Doctrine #9)
284+
- Do NOT run calibration in CI. It needs the ONNX file on disk.
285+
- Do NOT assume output tensor name. Check model.onnx graph for actual names.
286+
- Do NOT skip L2 normalization. Jina v5 uses last-token pooling, needs normalize.

0 commit comments

Comments
 (0)