docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80#1186
Conversation
Replaces the placeholder jina-base row in §8 with actual recall numbers (Hit@1: 72.9%, Hit@3: 91.3%, Hit@5: 95.0%, misses: 41/1500) and rewrites the assessment to reflect the finding that jina-base (768d) underperforms jina-small (512d) at every rank cutoff on the code-identifier corpus. Reproduced against the dev.80 source commit (1a6ee7b) using the v3.10.1-dev.81 native tarball; the dev.80 tarball had been pruned from GitHub releases but the only commit between dev.80 and dev.81 is a CI-workflow refactor (4d8df7b) that leaves the Rust source unchanged. Re-running minilm and jina-small as controls produced +1-2 pp drift vs published values, attributable to a +2-file / +46-node corpus shift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The footnote in §8 discloses this so future readers can read the jina-base row with the same tolerance. Closes #1181
|
Claude finished @carlos-alm's task —— View job PR Review CompleteReviewing PR #1186: docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80 Todo List:
Root Cause AnalysisProblem Being Solved: Issue #1181 requested completion of missing benchmark data in the v3.10.1-dev.80 dogfood report. The original report contained a placeholder for jina-base embedding performance ( Why This Approach: The author ran the missing jina-base benchmark using rigorous reproduction methodology:
Risk Assessment: ✅ LOW RISK - Pure documentation backfill with no code changes. Backlog Compliance
Review Assessment✅ Excellent Documentation Quality
✅ Proper Change Scope
✅ Scientific RigorThe key finding (jina-base performs worse than jina-small) is well-supported:
✅ PR Interaction HandlingAuthor proactively noted interaction with PR #1180, which modifies the same line. Clear communication about rebase needs. Minor Observations
Critical ConcernsNone. This is a straightforward documentation improvement that completes missing benchmark data with appropriate scientific rigor. Final Recommendation
The benchmark finding (jina-base underperforming jina-small on code identifiers) is valuable for users choosing embedding models, and the reproduction methodology sets a good standard for benchmark validation. |
Greptile SummaryThis PR backfills the jina-base (768d) embedding benchmark results that were missing (marked as still running) in the v3.10.1-dev.80 dogfood report. The numbers are added to the Hit@k table with a methodology footnote, and the §8 Benchmark Assessment prose is rewritten to reflect the headline finding that jina-small (512d) outperforms jina-base (768d) on this code-identifier corpus.
Confidence Score: 5/5Documentation-only change to a single report file; no source, config, or test changes. Safe to merge after addressing the prose inaccuracies. The change touches only a generated dogfood report. The new table row arithmetic is correct and the methodology footnote is well-explained. The two issues found are in the interpretive prose — both are documentation quality concerns with no impact on any code or system behaviour. generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md — lines 324-325 in the Benchmark Assessment section have the prose inaccuracies flagged above. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["v3.10.1-dev.80 report\n(placeholder in §8 table)"] --> B["Re-run benchmark\n(dev.81 binary, dev.80 source @ 1a6ee7b)"]
B --> C["minilm control\nHit@5: 92.3% (+1.2 pp vs published)"]
B --> D["jina-small control\nHit@5: 96.7% (+0.4 pp vs published)"]
B --> E["jina-base NEW\nHit@5: 95.0%"]
C --> F["Corpus drift: +2 files / +46 nodes\n612→614 files, 17873→17919 nodes"]
D --> F
E --> F
F --> G["Backfill PR #1181\nReplace placeholder row\nUpdate §8 Assessment"]
G --> H["Recommendation:\njina-small beats jina-base at every rank\n(Hit@1: +5.0pp, Hit@3: +2.2pp, Hit@5: +1.3pp)"]
Reviews (6): Last reviewed commit: "docs: correct corpus-drift range in dev...." | Re-trigger Greptile |
| - Native fast-skip preflight (#1054) is firing as expected: 16 ms no-op rebuild matches WASM's, validating the `detectNoChanges` short-circuit. | ||
| - The 1-file rebuild gap (WASM 45ms vs Native 67ms) is the inverse of full-build performance — WASM's lighter orchestrator setup wins on tiny incremental work. | ||
| - jina-small is the recall sweet spot (96.3% Hit@5 with only 512d vectors) — minilm's 91% Hit@5 leaves embedding misses at 4× the jina-small rate. | ||
| - jina-small remains the recall sweet spot — its 96.3% Hit@5 (512d) actually *beats* jina-base's 95.0% (768d) on this code-identifier corpus despite the larger model and 2× larger embeddings. The +1.3 pp Hit@5 gap holds at every rank cutoff (Hit@1: 77.9% vs 72.9%; Hit@3: 93.5% vs 91.3%; misses: 23 vs 41), suggesting the gain from going 512d → 768d is negative for split-identifier queries against a general-text encoder. The code-tuned variants (`jina-code`, `jina-embeddings-v2-base-code`) would likely close the gap — `jina-code` requires `HF_TOKEN` and was not run in this session. |
There was a problem hiding this comment.
Incorrect embedding size multiplier
The bullet states "2× larger embeddings" when comparing jina-small (512d) to jina-base (768d), but 768 ÷ 512 = 1.5 — the embeddings are 50% larger, not twice the size. This should read "1.5× larger embeddings." The error is small but notable given the rest of the analysis is numerically precise.
There was a problem hiding this comment.
Fixed in 4e46b1c — corrected to "1.5× larger embeddings" (768 ÷ 512 = 1.5).
|
Addressed Greptile feedback in 4e46b1c:
|
The footnote claimed controls were ~+1-2 pp higher than published values but the only cited example (jina-small +0.4 pp) sat well below 1 pp. Replaces the inflated range with the actual observed deltas (minilm +1.2 pp, jina-small +0.4 pp) and updates the jina-base tolerance to match (~+0.4-1.2 pp instead of +/-1-2 pp).
|
Addressed Greptile's remaining footnote finding in d203e1e:
|
Summary
Closes #1181. Replaces the jina-base placeholder in §8 of the v3.10.1-dev.80 dogfood report with actual recall numbers.
Headline finding: jina-base (768d, general-text) actually loses to jina-small (512d) at every rank cutoff on the codegraph code-identifier corpus. §8 Benchmark Assessment is rewritten to make that the recommendation: stick with jina-small unless you need the 8192-token context window for long identifiers. The
jina-codevariant would likely close the gap but requiresHF_TOKENand was not run.Reproduction methodology
1a6ee7b— the exact source state at which dev.80 was tagged.v3.10.1-dev.81darwin-arm64 tarball. Thev3.10.1-dev.80GitHub release tarball had already been pruned by the time the follow-up ran. The only commit between dev.80 and dev.81 is4d8df7b(CI workflow refactor), so the Rust source is byte-identical — dev.81's native binary is functionally identical to a hypothetical dev.80 native binary..codegraph/graph.dbrebuilt withengine: native, incremental: false, exclude: ['tests/benchmarks/resolution/fixtures/**']to match the build-benchmark methodology that produced the original published numbers.__BENCH_MODEL__=<model>) for each of the three target models — sameMAX_SYMBOLS=1500, sameseededShuffle(arr, 42)as the session-time runs.Corpus-drift control
To verify reproduction integrity,
minilmandjina-smallwere re-run as controls. Both produced numbers ~+1-2 pp higher than the published values:Attributable to a +2-file / +46-node corpus shift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The jina-base row in the table is from the same re-run-time corpus, so it should be read with the same ±1-2 pp tolerance — captured in a footnote on the table.
Interaction with PR #1180
PR #1180 (still open) edits the same line in §8 to clarify the jina-base placeholder text. Once this PR merges, the line PR #1180 modifies no longer exists. Will leave a comment on #1180 pointing at this PR so the maintainer can decide whether to rebase #1180 (its other change — the extractor-count fix on line 201 — is independent and still relevant).
Test plan
generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md); no source/config changes†) renders correctly under the table in GitHub's markdown preview