|
| 1 | +# SESSION: Jina v5 ONNX Calibration — Ground Truth Without API Keys |
| 2 | + |
| 3 | +## THE GOAL |
| 4 | + |
| 5 | +Replace synthetic ground truth in `calibrate_lenses.rs` with REAL embeddings. |
| 6 | +Jina v5 ONNX + rten = f32 ground truth on CPU. No API key. No network at runtime. |
| 7 | + |
| 8 | +Then: run the 5-path calibration matrix from PR #113 handover. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## JINA v5 MODEL CARD |
| 13 | + |
| 14 | +``` |
| 15 | +Model: jinaai/jina-embeddings-v5-text-small-text-matching |
| 16 | +Base: Qwen3-0.6B-Base |
| 17 | +Params: 677M (0.6B) |
| 18 | +Emb dim: 1024 |
| 19 | +Matryoshka: 32, 64, 128, 256, 512, 768, 1024 |
| 20 | +Max seq: 32,768 |
| 21 | +Pooling: last-token |
| 22 | +Tensor: BF16 |
| 23 | +Vocab: Qwen3 BPE (tokenizer.json = 11.4 MB) |
| 24 | +
|
| 25 | +Files: |
| 26 | + model.safetensors 1.19 GB (safetensors weights) |
| 27 | + v5-small-text-matching-F16.gguf 1.2 GB (F16 GGUF, streamable) |
| 28 | + onnx/model.onnx 1.27 MB (graph only) |
| 29 | + onnx/model.onnx_data 2.38 GB (weights, external data) |
| 30 | + tokenizer.json 11.4 MB (Qwen3 BPE) |
| 31 | +
|
| 32 | +v5 Task Variants (all have GGUF + ONNX): |
| 33 | + text-small: general, retrieval, text-matching, clustering, classification |
| 34 | + text-nano (0.2B): same 5 tasks (smaller, faster, less accurate) |
| 35 | +``` |
| 36 | + |
| 37 | +## WHY v5 REPLACES v3 AS TRUTH ANCHOR |
| 38 | + |
| 39 | +``` |
| 40 | +v3: API-only ground truth (needs JINA_API_KEY, network, rate limits) |
| 41 | +v5: ONNX available (rten loads it, forward pass on CPU, no API) |
| 42 | + GGUF available (stream F16 → CLAM → bake, same pipeline as v3) |
| 43 | + SAME repo has both → calibration in one session, no external deps |
| 44 | +
|
| 45 | +v5 is also newer, better MTEB scores, Matryoshka dims, and based on Qwen3. |
| 46 | +``` |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## IMPLEMENTATION PLAN |
| 51 | + |
| 52 | +### Phase 1: Add rten dependency (10 min) |
| 53 | + |
| 54 | +```toml |
| 55 | +# crates/thinking-engine/Cargo.toml |
| 56 | +[dependencies] |
| 57 | +rten = { version = "0.16", optional = true } |
| 58 | +rten-tensor = { version = "0.16", optional = true } |
| 59 | + |
| 60 | +[features] |
| 61 | +default = ["tokenizer"] |
| 62 | +tokenizer = ["dep:tokenizers"] |
| 63 | +onnx-calibration = ["dep:rten", "dep:rten-tensor"] |
| 64 | +``` |
| 65 | + |
| 66 | +Feature-gated. Default builds don't pull 2+ GB of ONNX weights. |
| 67 | +Only the calibration example enables it. |
| 68 | + |
| 69 | +### Phase 2: Download Jina v5 ONNX + GGUF (15 min, one-time) |
| 70 | + |
| 71 | +```bash |
| 72 | +# Create data directory |
| 73 | +mkdir -p crates/thinking-engine/data/jina-v5-text-matching |
| 74 | + |
| 75 | +# Download ONNX (2.39 GB total — model graph + external weights) |
| 76 | +# NOTE: model.onnx_data is 2.38 GB. Do NOT commit to git. |
| 77 | +cd crates/thinking-engine/data/jina-v5-text-matching |
| 78 | +wget https://huggingface.co/jinaai/jina-embeddings-v5-text-small-text-matching/resolve/main/onnx/model.onnx |
| 79 | +wget https://huggingface.co/jinaai/jina-embeddings-v5-text-small-text-matching/resolve/main/onnx/model.onnx_data |
| 80 | + |
| 81 | +# Download tokenizer (11.4 MB) |
| 82 | +wget https://huggingface.co/jinaai/jina-embeddings-v5-text-small-text-matching/resolve/main/tokenizer.json |
| 83 | + |
| 84 | +# GGUF is STREAMED, not downloaded (existing pipeline via HTTP range requests) |
| 85 | +# Source: v5-small-text-matching-F16.gguf (1.2 GB) |
| 86 | +``` |
| 87 | + |
| 88 | +Add to `.gitignore`: |
| 89 | +``` |
| 90 | +crates/thinking-engine/data/jina-v5-text-matching/model.onnx |
| 91 | +crates/thinking-engine/data/jina-v5-text-matching/model.onnx_data |
| 92 | +``` |
| 93 | + |
| 94 | +### Phase 3: ONNX inference module (jina_v5_onnx.rs, ~150 lines) |
| 95 | + |
| 96 | +```rust |
| 97 | +//! Jina v5 ONNX ground truth via rten. |
| 98 | +//! |
| 99 | +//! Loads the ONNX model, tokenizes input, runs forward pass, |
| 100 | +//! returns 1024D f32 embedding. This IS the ground truth. |
| 101 | + |
| 102 | +use rten::Model; |
| 103 | +use rten_tensor::NdTensor; |
| 104 | + |
| 105 | +pub struct JinaV5Onnx { |
| 106 | + model: Model, |
| 107 | +} |
| 108 | + |
| 109 | +impl JinaV5Onnx { |
| 110 | + /// Load from ONNX file path. |
| 111 | + pub fn load(onnx_path: &str) -> Result<Self, Box<dyn std::error::Error>> { |
| 112 | + let model = Model::load_file(onnx_path)?; |
| 113 | + Ok(Self { model }) |
| 114 | + } |
| 115 | + |
| 116 | + /// Run inference: token_ids → 1024D f32 embedding. |
| 117 | + /// Uses last-token pooling (Jina v5 convention). |
| 118 | + pub fn embed(&self, token_ids: &[i64]) -> Vec<f32> { |
| 119 | + let seq_len = token_ids.len(); |
| 120 | + let input_ids = NdTensor::from_data( |
| 121 | + [1, seq_len], |
| 122 | + token_ids.to_vec(), |
| 123 | + ); |
| 124 | + let attention_mask = NdTensor::from_data( |
| 125 | + [1, seq_len], |
| 126 | + vec![1i64; seq_len], |
| 127 | + ); |
| 128 | + |
| 129 | + let result = self.model.run( |
| 130 | + vec![ |
| 131 | + ("input_ids", input_ids.into()), |
| 132 | + ("attention_mask", attention_mask.into()), |
| 133 | + ], |
| 134 | + &["last_hidden_state"], // or "sentence_embedding" |
| 135 | + ).expect("ONNX forward pass failed"); |
| 136 | + |
| 137 | + // Extract last-token embedding (1024D) |
| 138 | + let output = result[0].as_float().unwrap(); |
| 139 | + // Last token pooling: take embedding at position seq_len-1 |
| 140 | + let embedding: Vec<f32> = (0..1024) |
| 141 | + .map(|d| output[[0, seq_len - 1, d]]) |
| 142 | + .collect(); |
| 143 | + |
| 144 | + // L2 normalize |
| 145 | + let norm: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt(); |
| 146 | + if norm > 1e-10 { |
| 147 | + embedding.iter().map(|x| x / norm).collect() |
| 148 | + } else { |
| 149 | + embedding |
| 150 | + } |
| 151 | + } |
| 152 | + |
| 153 | + /// Cosine similarity between two texts (ground truth). |
| 154 | + pub fn cosine(&self, ids_a: &[i64], ids_b: &[i64]) -> f32 { |
| 155 | + let emb_a = self.embed(ids_a); |
| 156 | + let emb_b = self.embed(ids_b); |
| 157 | + emb_a.iter().zip(&emb_b).map(|(a, b)| a * b).sum() |
| 158 | + } |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +### Phase 4: Build Jina v5 lens (stream_jina_v5.rs example) |
| 163 | + |
| 164 | +``` |
| 165 | +Stream v5-small-text-matching-F16.gguf via HTTP range requests |
| 166 | +→ Extract token_embedding rows (vocab × 1024 × BF16) |
| 167 | +→ CLAM 256 centroids |
| 168 | +→ Build 256×256 distance table |
| 169 | +→ CDF-percentile HDR encoding → u8 table |
| 170 | +→ Also build i8 signed table (subtract 128) |
| 171 | +→ Bake via include_bytes! → jina_v5_lens.rs |
| 172 | +``` |
| 173 | + |
| 174 | +### Phase 5: Calibration matrix (calibrate_v5.rs example) |
| 175 | + |
| 176 | +``` |
| 177 | +For each of 16 sentence pairs: |
| 178 | + 1. Tokenize with Jina v5 tokenizer (tokenizers crate) |
| 179 | + 2. ONNX inference → f32 cosine = GROUND TRUTH |
| 180 | + 3. Baked lens distance (u8 HDR) = PATH 1 |
| 181 | + 4. Baked lens signed (i8) = PATH 2 |
| 182 | + 5. γ+φ encoded distance = PATH 3 |
| 183 | + 6. highheelbgz spiral distance = PATH 4 |
| 184 | +
|
| 185 | +Compare each path vs ground truth: |
| 186 | + Spearman ρ (rank preservation) |
| 187 | + Linear ICC (transfer curve) |
| 188 | + RMSE (absolute error) |
| 189 | + Effective bits (information preserved) |
| 190 | +
|
| 191 | +Result: per-path calibration scores. |
| 192 | +Winner per model × role. |
| 193 | +``` |
| 194 | + |
| 195 | +### Phase 6: Wire v5 as default truth anchor |
| 196 | + |
| 197 | +``` |
| 198 | +Update calibrate_lenses.rs: |
| 199 | + Replace synthetic ground truth with JinaV5Onnx::cosine() |
| 200 | + Keep the Jina v3 and Reranker baked lens comparisons |
| 201 | + Add Jina v5 baked lens when available |
| 202 | + Report: v3 vs v5 correlation (should be high) |
| 203 | +``` |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## 5-PATH CALIBRATION MATRIX (from PR #113 handover) |
| 208 | + |
| 209 | +``` |
| 210 | +For EACH model (6 models × 6 roles = 36 cells): |
| 211 | +
|
| 212 | + Path 1: ONNX (rten) f32 ground truth ← REFERENCE |
| 213 | + Path 2: GGUF raw u8 CDF existing HDR pipeline |
| 214 | + Path 3: GGUF γ+φ golden ratio redistribution |
| 215 | + Path 4: GGUF i8 signed preserves gate sign ← OUR NEW PATH |
| 216 | + Path 5: GGUF highheelbgz spiral + golden stride |
| 217 | +
|
| 218 | +ICC profile per path: |
| 219 | + transfer_curve: f(baked_distance) → ground_truth_similarity |
| 220 | + noise_floor: minimum detectable difference |
| 221 | + effective_bits: Shannon entropy of the encoding |
| 222 | + spearman_rho: rank correlation with ground truth |
| 223 | +
|
| 224 | +Per model × role winner: |
| 225 | + i8 expected to win for: reranker (symmetric), gate roles (sign matters) |
| 226 | + γ+φ expected to win for: narrow-range roles (gate, up) |
| 227 | + raw CDF expected to win for: positive-skewed (embedding models) |
| 228 | + hhbgz expected to win for: ??? (that's why we test) |
| 229 | +``` |
| 230 | + |
| 231 | +## MODELS IN THE MATRIX |
| 232 | + |
| 233 | +``` |
| 234 | +Model ONNX available? GGUF available? Notes |
| 235 | +───── ─────────────── ─────────────── ───── |
| 236 | +Jina v5 text-matching ✓ (2.39 GB) ✓ (1.2 GB F16) NEW truth anchor |
| 237 | +BGE-M3 ✓ (via ONNX repo) ✓ (baked) multilingual |
| 238 | +Jina Reranker v3 ? (check) ✓ (baked) widest cos range |
| 239 | +Reader-LM 1.5B ? (check) ✓ (baked) HTML→text |
| 240 | +Qwopus 27B ✗ (too large) ✓ (streamed) dense LLM |
| 241 | +Maverick 128E ✗ (800 GB) ✓ (18 shards) real MoE |
| 242 | +``` |
| 243 | + |
| 244 | +For models without ONNX: use Jina v5 embeddings of the TEXT as cross-model truth. |
| 245 | +The thinking-engine distance should correlate with Jina v5 text similarity. |
| 246 | + |
| 247 | +## SIGNED EXPERIMENT INTEGRATION |
| 248 | + |
| 249 | +``` |
| 250 | +The dual signed experiment (just pushed) showed: |
| 251 | + Jina v3: 88% agreement (narrow cos, weak inhibition) |
| 252 | + Jina Reranker: 50% agreement (wide cos, strong inhibition) |
| 253 | + BGE-M3: 62% agreement (moderate) |
| 254 | +
|
| 255 | +The calibration matrix will answer: |
| 256 | + Does the 50% DISAGREEMENT on the reranker mean |
| 257 | + (a) signed is MORE accurate (finds real structure unsigned misses), or |
| 258 | + (b) signed is LESS accurate (inhibition kills valid peaks)? |
| 259 | +
|
| 260 | + Compare both paths against ONNX ground truth. |
| 261 | + If signed ρ > unsigned ρ on reranker: signed wins, drop SiLU-ONNX. |
| 262 | + If unsigned ρ > signed ρ: keep both, SiLU-ONNX still needed. |
| 263 | +``` |
| 264 | + |
| 265 | +--- |
| 266 | + |
| 267 | +## DISK BUDGET |
| 268 | + |
| 269 | +``` |
| 270 | +ONNX model: 2.39 GB (downloaded, NOT committed, .gitignore'd) |
| 271 | +GGUF F16: 1.2 GB (streamed via HTTP, never on disk) |
| 272 | +tokenizer: 11.4 MB (downloaded, could commit or gitignore) |
| 273 | +Baked table: 64 KB (committed, include_bytes!) |
| 274 | +Baked index: ~300 KB (committed, include_bytes!) |
| 275 | +
|
| 276 | +Total new disk: ~2.4 GB temporary for calibration, 364 KB permanent. |
| 277 | +``` |
| 278 | + |
| 279 | +## WHAT NOT TO DO |
| 280 | + |
| 281 | +- Do NOT commit the ONNX model to git (2.4 GB). gitignore it. |
| 282 | +- Do NOT stream ONNX via HTTP range requests. rten needs the full file. |
| 283 | +- Do NOT use Q8_0 GGUF. cos[0,0]. F16 required. (Doctrine #9) |
| 284 | +- Do NOT run calibration in CI. It needs the ONNX file on disk. |
| 285 | +- Do NOT assume output tensor name. Check model.onnx graph for actual names. |
| 286 | +- Do NOT skip L2 normalization. Jina v5 uses last-token pooling, needs normalize. |
0 commit comments