docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80 (#1186)

carlos-alm · web-flow · commit 6ff057201734 · 2026-05-21T15:47:11.000-06:00
* docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80 Replaces the placeholder jina-base row in §8 with actual recall numbers (Hit@1: 72.9%, Hit@3: 91.3%, Hit@5: 95.0%, misses: 41/1500) and rewrites the assessment to reflect the finding that jina-base (768d) underperforms jina-small (512d) at every rank cutoff on the code-identifier corpus. Reproduced against the dev.80 source commit (1a6ee7b) using the v3.10.1-dev.81 native tarball; the dev.80 tarball had been pruned from GitHub releases but the only commit between dev.80 and dev.81 is a CI-workflow refactor (4d8df7b) that leaves the Rust source unchanged. Re-running minilm and jina-small as controls produced +1-2 pp drift vs published values, attributable to a +2-file / +46-node corpus shift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The footnote in §8 discloses this so future readers can read the jina-base row with the same tolerance. Closes #1181 * docs: correct embedding-size and miss-rate multipliers in dev.80 report (#1186) * docs: correct corpus-drift range in dev.80 jina-base footnote (#1186) The footnote claimed controls were ~+1-2 pp higher than published values but the only cited example (jina-small +0.4 pp) sat well below 1 pp. Replaces the inflated range with the actual observed deltas (minilm +1.2 pp, jina-small +0.4 pp) and updates the jina-base tolerance to match (~+0.4-1.2 pp instead of +/-1-2 pp).
diff --git a/generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md b/generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md
@@ -312,14 +312,17 @@ Observations:
 |-------|------:|------:|------:|-------:|
 | minilm (384d) | 981/1500 (65.4%) | 1291/1500 (86.1%) | 1367/1500 (91.1%) | 63 |
 | jina-small (512d) | 1168/1500 (77.9%) | 1402/1500 (93.5%) | 1445/1500 (96.3%) | 23 |
-| jina-base (768d) | _benchmark still running at report cut_ | | | |
+| jina-base (768d) † | 1094/1500 (72.9%) | 1370/1500 (91.3%) | 1425/1500 (95.0%) | 41 |
+
+† Backfilled in follow-up [#1181](https://github.com/optave/ops-codegraph-tool/issues/1181) after the session. Reproduced against the dev.80 source commit (`1a6ee7b`) with the `v3.10.1-dev.81` native binary — the Rust source is unchanged between dev.80 and dev.81 (only a CI-only commit between them), and the `v3.10.1-dev.80` GitHub release tarball had been pruned by the time the follow-up ran. Re-running `minilm` and `jina-small` as controls on the reproduced corpus produced numbers ~+0.4–1.2 pp higher than the published values (`minilm` Hit@5 92.3% vs 91.1%, a +1.2 pp delta; `jina-small` Hit@5 96.7% vs 96.3%, a +0.4 pp delta), attributable to a +2-file / +46-node corpus drift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The jina-base row should be read with the same ±0.4–1.2 pp tolerance.
 
 ### Benchmark Assessment
 
 - No regressions vs the v3.10.0 baseline in `generated/benchmarks/BUILD-BENCHMARKS.md`. The corpus shrank (745 → 612 files) due to PR #1134's fixture exclusion, but per-file metrics improved on every engine.
 - Native fast-skip preflight (#1054) is firing as expected: 16 ms no-op rebuild matches WASM's, validating the `detectNoChanges` short-circuit.
 - The 1-file rebuild gap (WASM 45ms vs Native 67ms) is the inverse of full-build performance — WASM's lighter orchestrator setup wins on tiny incremental work.
-- jina-small is the recall sweet spot (96.3% Hit@5 with only 512d vectors) — minilm's 91% Hit@5 leaves embedding misses at 4× the jina-small rate.
+- jina-small remains the recall sweet spot — its 96.3% Hit@5 (512d) actually *beats* jina-base's 95.0% (768d) on this code-identifier corpus despite the larger model and 1.5× larger embeddings. The +1.3 pp Hit@5 gap holds at every rank cutoff (Hit@1: 77.9% vs 72.9%; Hit@3: 93.5% vs 91.3%; misses: 23 vs 41), suggesting the gain from going 512d → 768d is negative for split-identifier queries against a general-text encoder. The code-tuned variants (`jina-code`, `jina-embeddings-v2-base-code`) would likely close the gap — `jina-code` requires `HF_TOKEN` and was not run in this session.
+- minilm's 91.1% Hit@5 still leaves embedding misses at roughly 2.5× the jina-small rate (8.9% vs 3.7% miss rate; 63 vs 23 absolute misses), so the recall floor argument for jina-small over minilm holds. Picking jina-base over jina-small only pays off if you also need its 8192-token context window for long identifiers; otherwise it's strictly worse here.
 
 ---