Skip to content

Commit 5c0352d

Browse files
committed
feat: XLM-RoBERTa tokenizer.json for Jina v3 + BGE-M3 + German NER
Downloaded from FacebookAI/xlm-roberta-large-finetuned-conll03-{english,german}: data/jina-v3-hdr/tokenizer.json 8.7 MB (XLM-RoBERTa 250K vocab) data/bge-m3-hdr/tokenizer.json 8.7 MB (same tokenizer, BGE-M3 = XLM-RoBERTa) data/xlm-roberta-de/tokenizer.json 8.7 MB (German NER variant, same vocab) Calibration now runs with REAL BPE tokenization: Jina v3: LOADED (XLM-RoBERTa 250K) Reranker: LOADED (Qwen2 151K, from Qwopus tokenizer on disk) Results with real BPE: ρ still low (0.07 Jina) because avg pairwise centroid distance ≠ text similarity. Need full MatVec think cycle (perturb → think → commit) for meaningful calibration, not just raw centroid lookup averages. The ONNX ground truth path solves this. https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
1 parent c4d91f3 commit 5c0352d

4 files changed

Lines changed: 5 additions & 0 deletions

File tree

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
*.onnx
2+
*.onnx_data

crates/thinking-engine/data/bge-m3-hdr/tokenizer.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

crates/thinking-engine/data/jina-v3-hdr/tokenizer.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

crates/thinking-engine/data/xlm-roberta-de/tokenizer.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)