Commit 5c0352d
committed
feat: XLM-RoBERTa tokenizer.json for Jina v3 + BGE-M3 + German NER
Downloaded from FacebookAI/xlm-roberta-large-finetuned-conll03-{english,german}:
data/jina-v3-hdr/tokenizer.json 8.7 MB (XLM-RoBERTa 250K vocab)
data/bge-m3-hdr/tokenizer.json 8.7 MB (same tokenizer, BGE-M3 = XLM-RoBERTa)
data/xlm-roberta-de/tokenizer.json 8.7 MB (German NER variant, same vocab)
Calibration now runs with REAL BPE tokenization:
Jina v3: LOADED (XLM-RoBERTa 250K)
Reranker: LOADED (Qwen2 151K, from Qwopus tokenizer on disk)
Results with real BPE: ρ still low (0.07 Jina) because avg pairwise
centroid distance ≠ text similarity. Need full MatVec think cycle
(perturb → think → commit) for meaningful calibration, not just
raw centroid lookup averages. The ONNX ground truth path solves this.
https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A1 parent c4d91f3 commit 5c0352d
4 files changed
Lines changed: 5 additions & 0 deletions
File tree
- crates/thinking-engine/data
- bge-m3-hdr
- jina-v3-hdr
- xlm-roberta-de
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
0 commit comments