Skip to content

docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80#1186

Merged
carlos-alm merged 5 commits into
mainfrom
docs/1181-jina-base-benchmark
May 21, 2026
Merged

docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80#1186
carlos-alm merged 5 commits into
mainfrom
docs/1181-jina-base-benchmark

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

Closes #1181. Replaces the jina-base placeholder in §8 of the v3.10.1-dev.80 dogfood report with actual recall numbers.

Model Hit@1 Hit@3 Hit@5 Misses
minilm (384d) 65.4% 86.1% 91.1% 63
jina-small (512d) 77.9% 93.5% 96.3% 23
jina-base (768d) — new 72.9% 91.3% 95.0% 41

Headline finding: jina-base (768d, general-text) actually loses to jina-small (512d) at every rank cutoff on the codegraph code-identifier corpus. §8 Benchmark Assessment is rewritten to make that the recommendation: stick with jina-small unless you need the 8192-token context window for long identifiers. The jina-code variant would likely close the gap but requires HF_TOKEN and was not run.

Reproduction methodology

  • Worktree pinned to commit 1a6ee7b — the exact source state at which dev.80 was tagged.
  • Native binary: v3.10.1-dev.81 darwin-arm64 tarball. The v3.10.1-dev.80 GitHub release tarball had already been pruned by the time the follow-up ran. The only commit between dev.80 and dev.81 is 4d8df7b (CI workflow refactor), so the Rust source is byte-identical — dev.81's native binary is functionally identical to a hypothetical dev.80 native binary.
  • .codegraph/graph.db rebuilt with engine: native, incremental: false, exclude: ['tests/benchmarks/resolution/fixtures/**'] to match the build-benchmark methodology that produced the original published numbers.
  • Embedding benchmark invoked via worker mode (__BENCH_MODEL__=<model>) for each of the three target models — same MAX_SYMBOLS=1500, same seededShuffle(arr, 42) as the session-time runs.

Corpus-drift control

To verify reproduction integrity, minilm and jina-small were re-run as controls. Both produced numbers ~+1-2 pp higher than the published values:

Model Hit@5 published Hit@5 re-run Delta
minilm 91.1% 92.3% +1.2 pp
jina-small 96.3% 96.7% +0.4 pp

Attributable to a +2-file / +46-node corpus shift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The jina-base row in the table is from the same re-run-time corpus, so it should be read with the same ±1-2 pp tolerance — captured in a footnote on the table.

Interaction with PR #1180

PR #1180 (still open) edits the same line in §8 to clarify the jina-base placeholder text. Once this PR merges, the line PR #1180 modifies no longer exists. Will leave a comment on #1180 pointing at this PR so the maintainer can decide whether to rebase #1180 (its other change — the extractor-count fix on line 201 — is independent and still relevant).

Test plan

  • Diff limited to one report file (generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md); no source/config changes
  • Footnote anchor () renders correctly under the table in GitHub's markdown preview
  • All numbers in the new row + §8 cross-checked against the raw worker output JSON

Replaces the placeholder jina-base row in §8 with actual recall numbers
(Hit@1: 72.9%, Hit@3: 91.3%, Hit@5: 95.0%, misses: 41/1500) and rewrites
the assessment to reflect the finding that jina-base (768d) underperforms
jina-small (512d) at every rank cutoff on the code-identifier corpus.

Reproduced against the dev.80 source commit (1a6ee7b) using the
v3.10.1-dev.81 native tarball; the dev.80 tarball had been pruned from
GitHub releases but the only commit between dev.80 and dev.81 is a
CI-workflow refactor (4d8df7b) that leaves the Rust source unchanged.
Re-running minilm and jina-small as controls produced +1-2 pp drift vs
published values, attributable to a +2-file / +46-node corpus shift
between session-time (612 files / 17,873 nodes) and re-run-time
(614 files / 17,919 nodes). The footnote in §8 discloses this so future
readers can read the jina-base row with the same tolerance.

Closes #1181
@claude

claude Bot commented May 21, 2026

Copy link
Copy Markdown

Claude finished @carlos-alm's task —— View job


PR Review Complete

Reviewing PR #1186: docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80

Todo List:

  • Phase 1: Root Cause Analysis - understand the problem being solved
  • Phase 2: Configuration Change Detection (check for risky changes)
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Phase 5: Backlog Compliance Rating
  • Read and analyze the modified dogfood report file
  • Provide comprehensive review feedback

Root Cause Analysis

Problem Being Solved: Issue #1181 requested completion of missing benchmark data in the v3.10.1-dev.80 dogfood report. The original report contained a placeholder for jina-base embedding performance (_benchmark still running at report cut_).

Why This Approach: The author ran the missing jina-base benchmark using rigorous reproduction methodology:

  • Pinned to exact source commit (1a6ee7b) matching dev.80
  • Used functionally identical binary (dev.81, differing only by CI commit 4d8df7b)
  • Controlled for corpus drift by re-running minilm/jina-small baselines
  • Documented methodology and limitations transparently

Risk Assessment: ✅ LOW RISK - Pure documentation backfill with no code changes.

Backlog Compliance

  • Zero-dep: ✓ — No runtime dependencies added
  • Foundation-aligned: ✓ — Documentation improvement, no principle violations
  • Problem-fit: 3 — Useful completeness but doesn't directly address core "lost AI" problem
  • Breaking: No — Documentation-only change
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Review Assessment

Excellent Documentation Quality

  • Rigorous methodology: Author properly controlled for corpus drift (+2 files, +46 nodes) by re-running baseline models
  • Transparent limitations: Footnote clearly explains ±1-2pp tolerance due to corpus shift
  • Actionable insights: Benchmark assessment correctly identifies that jina-base (768d) underperforms jina-small (512d) on code identifiers, updating the recommendation accordingly

Proper Change Scope

  • Single file changed: generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md
  • +5/-2 lines, focused only on the missing benchmark data
  • No code changes, no risky configuration modifications

Scientific Rigor

The key finding (jina-base performs worse than jina-small) is well-supported:

  • Consistent across all rank cutoffs (Hit@1: 72.9% vs 77.9%, Hit@3: 91.3% vs 93.5%, Hit@5: 95.0% vs 96.3%)
  • Methodology accounts for corpus drift by providing baseline controls
  • Explains the counterintuitive result (general-text encoder vs code-tuned variants)

PR Interaction Handling

Author proactively noted interaction with PR #1180, which modifies the same line. Clear communication about rebase needs.

Minor Observations

  • Commit message follows conventional commits format appropriately
  • GitHub issue link resolution works correctly (Closes #1181)
  • Footnote anchor renders properly in GitHub markdown

Critical Concerns

None. This is a straightforward documentation improvement that completes missing benchmark data with appropriate scientific rigor.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality documentation backfill with rigorous methodology, transparent limitations disclosure, and actionable insights. The only reason it's not 5 stars is that it's purely completing existing work rather than adding new value, but it's executed excellently.

The benchmark finding (jina-base underperforming jina-small on code identifiers) is valuable for users choosing embedding models, and the reproduction methodology sets a good standard for benchmark validation.


@greptile-apps

greptile-apps Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR backfills the jina-base (768d) embedding benchmark results that were missing (marked as still running) in the v3.10.1-dev.80 dogfood report. The numbers are added to the Hit@k table with a methodology footnote, and the §8 Benchmark Assessment prose is rewritten to reflect the headline finding that jina-small (512d) outperforms jina-base (768d) on this code-identifier corpus.

  • The new table row and arithmetic are internally consistent (1094/1500 = 72.9%, 1370/1500 = 91.3%, 1425/1500 = 95.0%), and the corpus-drift tolerance footnote is appropriately cautious.
  • Two inaccuracies exist in the new §8 prose: (1) "The +1.3 pp Hit@5 gap holds at every rank cutoff" implies a constant magnitude when the actual gaps are +5.0 pp at Hit@1 and +2.2 pp at Hit@3; (2) the parenthetical (8.9% vs 3.7% miss rate; 63 vs 23 absolute misses) mixes miss rates derived from Hit@5 (which imply 133 and 55 absolute misses at n=1500) with the Misses-column values 63 and 23, which measure a different metric.

Confidence Score: 5/5

Documentation-only change to a single report file; no source, config, or test changes. Safe to merge after addressing the prose inaccuracies.

The change touches only a generated dogfood report. The new table row arithmetic is correct and the methodology footnote is well-explained. The two issues found are in the interpretive prose — both are documentation quality concerns with no impact on any code or system behaviour.

generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md — lines 324-325 in the Benchmark Assessment section have the prose inaccuracies flagged above.

Important Files Changed

Filename Overview
generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md Backfills jina-base (768d) Hit@k numbers in §8, replaces placeholder row, adds footnote explaining corpus-drift methodology, and rewrites the Benchmark Assessment bullets. Two documentation-level inaccuracies in the new text: the "+1.3 pp gap holds at every rank" phrasing misrepresents a direction-only invariant, and the miss-rate/absolute-miss parenthetical mixes two incompatible metrics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["v3.10.1-dev.80 report\n(placeholder in §8 table)"] --> B["Re-run benchmark\n(dev.81 binary, dev.80 source @ 1a6ee7b)"]
    B --> C["minilm control\nHit@5: 92.3% (+1.2 pp vs published)"]
    B --> D["jina-small control\nHit@5: 96.7% (+0.4 pp vs published)"]
    B --> E["jina-base NEW\nHit@5: 95.0%"]
    C --> F["Corpus drift: +2 files / +46 nodes\n612→614 files, 17873→17919 nodes"]
    D --> F
    E --> F
    F --> G["Backfill PR #1181\nReplace placeholder row\nUpdate §8 Assessment"]
    G --> H["Recommendation:\njina-small beats jina-base at every rank\n(Hit@1: +5.0pp, Hit@3: +2.2pp, Hit@5: +1.3pp)"]
Loading

Fix All in Claude Code

Reviews (6): Last reviewed commit: "docs: correct corpus-drift range in dev...." | Re-trigger Greptile

- Native fast-skip preflight (#1054) is firing as expected: 16 ms no-op rebuild matches WASM's, validating the `detectNoChanges` short-circuit.
- The 1-file rebuild gap (WASM 45ms vs Native 67ms) is the inverse of full-build performance — WASM's lighter orchestrator setup wins on tiny incremental work.
- jina-small is the recall sweet spot (96.3% Hit@5 with only 512d vectors) — minilm's 91% Hit@5 leaves embedding misses at 4× the jina-small rate.
- jina-small remains the recall sweet spot — its 96.3% Hit@5 (512d) actually *beats* jina-base's 95.0% (768d) on this code-identifier corpus despite the larger model and 2× larger embeddings. The +1.3 pp Hit@5 gap holds at every rank cutoff (Hit@1: 77.9% vs 72.9%; Hit@3: 93.5% vs 91.3%; misses: 23 vs 41), suggesting the gain from going 512d → 768d is negative for split-identifier queries against a general-text encoder. The code-tuned variants (`jina-code`, `jina-embeddings-v2-base-code`) would likely close the gap — `jina-code` requires `HF_TOKEN` and was not run in this session.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Incorrect embedding size multiplier

The bullet states "2× larger embeddings" when comparing jina-small (512d) to jina-base (768d), but 768 ÷ 512 = 1.5 — the embeddings are 50% larger, not twice the size. This should read "1.5× larger embeddings." The error is small but notable given the rest of the analysis is numerically precise.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e46b1c — corrected to "1.5× larger embeddings" (768 ÷ 512 = 1.5).

@carlos-alm

Copy link
Copy Markdown
Contributor Author

Addressed Greptile feedback in 4e46b1c:

  • §8 line 324: corrected "2× larger embeddings" → "1.5× larger embeddings" (768 ÷ 512 = 1.5).
  • §8 line 325: corrected miss-rate multiplier "roughly 4× the jina-small rate" → "roughly 2.5× the jina-small rate" with the underlying numbers spelled out inline (8.9% vs 3.7% miss rate; 63 vs 23 absolute misses).

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

carlos-alm and others added 2 commits May 21, 2026 15:06
The footnote claimed controls were ~+1-2 pp higher than published values
but the only cited example (jina-small +0.4 pp) sat well below 1 pp.
Replaces the inflated range with the actual observed deltas (minilm
+1.2 pp, jina-small +0.4 pp) and updates the jina-base tolerance to
match (~+0.4-1.2 pp instead of +/-1-2 pp).
@carlos-alm

Copy link
Copy Markdown
Contributor Author

Addressed Greptile's remaining footnote finding in d203e1e:

  • §8 line 317: corrected drift range from "+1–2 pp higher" to "+0.4–1.2 pp higher" (matching the actual observed deltas: minilm +1.2 pp, jina-small +0.4 pp), and updated the jina-base tolerance accordingly. Also spelled out the minilm delta inline next to the existing jina-small example so the range is verifiable from the footnote itself.

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 6ff0572 into main May 21, 2026
21 checks passed
@carlos-alm carlos-alm deleted the docs/1181-jina-base-benchmark branch May 21, 2026 21:47
@github-actions github-actions Bot locked and limited conversation to collaborators May 21, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: complete jina-base (768d) embedding Hit@k benchmark for v3.10.1 dogfood report

1 participant