docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80 by carlos-alm · Pull Request #1186 · optave/ops-codegraph-tool

carlos-alm · 2026-05-21T06:13:48Z

Summary

Closes #1181. Replaces the jina-base placeholder in §8 of the v3.10.1-dev.80 dogfood report with actual recall numbers.

Model	Hit@1	Hit@3	Hit@5	Misses
minilm (384d)	65.4%	86.1%	91.1%	63
jina-small (512d)	77.9%	93.5%	96.3%	23
jina-base (768d) — new	72.9%	91.3%	95.0%	41

Headline finding: jina-base (768d, general-text) actually loses to jina-small (512d) at every rank cutoff on the codegraph code-identifier corpus. §8 Benchmark Assessment is rewritten to make that the recommendation: stick with jina-small unless you need the 8192-token context window for long identifiers. The jina-code variant would likely close the gap but requires HF_TOKEN and was not run.

Reproduction methodology

Worktree pinned to commit 1a6ee7b — the exact source state at which dev.80 was tagged.
Native binary: v3.10.1-dev.81 darwin-arm64 tarball. The v3.10.1-dev.80 GitHub release tarball had already been pruned by the time the follow-up ran. The only commit between dev.80 and dev.81 is 4d8df7b (CI workflow refactor), so the Rust source is byte-identical — dev.81's native binary is functionally identical to a hypothetical dev.80 native binary.
.codegraph/graph.db rebuilt with engine: native, incremental: false, exclude: ['tests/benchmarks/resolution/fixtures/**'] to match the build-benchmark methodology that produced the original published numbers.
Embedding benchmark invoked via worker mode (__BENCH_MODEL__=<model>) for each of the three target models — same MAX_SYMBOLS=1500, same seededShuffle(arr, 42) as the session-time runs.

Corpus-drift control

To verify reproduction integrity, minilm and jina-small were re-run as controls. Both produced numbers ~+1-2 pp higher than the published values:

Model	Hit@5 published	Hit@5 re-run	Delta
minilm	91.1%	92.3%	+1.2 pp
jina-small	96.3%	96.7%	+0.4 pp

Attributable to a +2-file / +46-node corpus shift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The jina-base row in the table is from the same re-run-time corpus, so it should be read with the same ±1-2 pp tolerance — captured in a footnote on the table.

Interaction with PR #1180

PR #1180 (still open) edits the same line in §8 to clarify the jina-base placeholder text. Once this PR merges, the line PR #1180 modifies no longer exists. Will leave a comment on #1180 pointing at this PR so the maintainer can decide whether to rebase #1180 (its other change — the extractor-count fix on line 201 — is independent and still relevant).

Test plan

Diff limited to one report file (generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md); no source/config changes
Footnote anchor (†) renders correctly under the table in GitHub's markdown preview
All numbers in the new row + §8 cross-checked against the raw worker output JSON

Replaces the placeholder jina-base row in §8 with actual recall numbers (Hit@1: 72.9%, Hit@3: 91.3%, Hit@5: 95.0%, misses: 41/1500) and rewrites the assessment to reflect the finding that jina-base (768d) underperforms jina-small (512d) at every rank cutoff on the code-identifier corpus. Reproduced against the dev.80 source commit (1a6ee7b) using the v3.10.1-dev.81 native tarball; the dev.80 tarball had been pruned from GitHub releases but the only commit between dev.80 and dev.81 is a CI-workflow refactor (4d8df7b) that leaves the Rust source unchanged. Re-running minilm and jina-small as controls produced +1-2 pp drift vs published values, attributable to a +2-file / +46-node corpus shift between session-time (612 files / 17,873 nodes) and re-run-time (614 files / 17,919 nodes). The footnote in §8 discloses this so future readers can read the jina-base row with the same tolerance. Closes #1181

claude · 2026-05-21T06:14:02Z

Claude finished @carlos-alm's task —— View job

PR Review Complete

Reviewing PR #1186: docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80

Todo List:

Phase 1: Root Cause Analysis - understand the problem being solved
Phase 2: Configuration Change Detection (check for risky changes)
Phase 3: Test Integrity Assessment
Phase 4: Code Quality Review
Phase 5: Backlog Compliance Rating
Read and analyze the modified dogfood report file
Provide comprehensive review feedback

Root Cause Analysis

Problem Being Solved: Issue #1181 requested completion of missing benchmark data in the v3.10.1-dev.80 dogfood report. The original report contained a placeholder for jina-base embedding performance (_benchmark still running at report cut_).

Why This Approach: The author ran the missing jina-base benchmark using rigorous reproduction methodology:

Pinned to exact source commit (1a6ee7b) matching dev.80
Used functionally identical binary (dev.81, differing only by CI commit 4d8df7b)
Controlled for corpus drift by re-running minilm/jina-small baselines
Documented methodology and limitations transparently

Risk Assessment: ✅ LOW RISK - Pure documentation backfill with no code changes.

Backlog Compliance

Zero-dep: ✓ — No runtime dependencies added
Foundation-aligned: ✓ — Documentation improvement, no principle violations
Problem-fit: 3 — Useful completeness but doesn't directly address core "lost AI" problem
Breaking: No — Documentation-only change
Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Review Assessment

✅ Excellent Documentation Quality

Rigorous methodology: Author properly controlled for corpus drift (+2 files, +46 nodes) by re-running baseline models
Transparent limitations: Footnote clearly explains ±1-2pp tolerance due to corpus shift
Actionable insights: Benchmark assessment correctly identifies that jina-base (768d) underperforms jina-small (512d) on code identifiers, updating the recommendation accordingly

✅ Proper Change Scope

Single file changed: generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md
+5/-2 lines, focused only on the missing benchmark data
No code changes, no risky configuration modifications

✅ Scientific Rigor

The key finding (jina-base performs worse than jina-small) is well-supported:

Consistent across all rank cutoffs (Hit@1: 72.9% vs 77.9%, Hit@3: 91.3% vs 93.5%, Hit@5: 95.0% vs 96.3%)
Methodology accounts for corpus drift by providing baseline controls
Explains the counterintuitive result (general-text encoder vs code-tuned variants)

✅ PR Interaction Handling

Author proactively noted interaction with PR #1180, which modifies the same line. Clear communication about rebase needs.

Minor Observations

Commit message follows conventional commits format appropriately
GitHub issue link resolution works correctly (Closes #1181)
Footnote anchor renders properly in GitHub markdown

Critical Concerns

None. This is a straightforward documentation improvement that completes missing benchmark data with appropriate scientific rigor.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: High-quality documentation backfill with rigorous methodology, transparent limitations disclosure, and actionable insights. The only reason it's not 5 stars is that it's purely completing existing work rather than adding new value, but it's executed excellently.

The benchmark finding (jina-base underperforming jina-small on code identifiers) is valuable for users choosing embedding models, and the reproduction methodology sets a good standard for benchmark validation.

greptile-apps · 2026-05-21T06:17:00Z

Greptile Summary

This PR backfills the jina-base (768d) embedding benchmark results that were missing (marked as still running) in the v3.10.1-dev.80 dogfood report. The numbers are added to the Hit@k table with a methodology footnote, and the §8 Benchmark Assessment prose is rewritten to reflect the headline finding that jina-small (512d) outperforms jina-base (768d) on this code-identifier corpus.

The new table row and arithmetic are internally consistent (1094/1500 = 72.9%, 1370/1500 = 91.3%, 1425/1500 = 95.0%), and the corpus-drift tolerance footnote is appropriately cautious.
Two inaccuracies exist in the new §8 prose: (1) "The +1.3 pp Hit@5 gap holds at every rank cutoff" implies a constant magnitude when the actual gaps are +5.0 pp at Hit@1 and +2.2 pp at Hit@3; (2) the parenthetical (8.9% vs 3.7% miss rate; 63 vs 23 absolute misses) mixes miss rates derived from Hit@5 (which imply 133 and 55 absolute misses at n=1500) with the Misses-column values 63 and 23, which measure a different metric.

Confidence Score: 5/5

Documentation-only change to a single report file; no source, config, or test changes. Safe to merge after addressing the prose inaccuracies.

The change touches only a generated dogfood report. The new table row arithmetic is correct and the methodology footnote is well-explained. The two issues found are in the interpretive prose — both are documentation quality concerns with no impact on any code or system behaviour.

generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md — lines 324-325 in the Benchmark Assessment section have the prose inaccuracies flagged above.

Important Files Changed

Filename	Overview
generated/dogfood/DOGFOOD_REPORT_v3.10.1-dev.80.md	Backfills jina-base (768d) Hit@k numbers in §8, replaces placeholder row, adds footnote explaining corpus-drift methodology, and rewrites the Benchmark Assessment bullets. Two documentation-level inaccuracies in the new text: the "+1.3 pp gap holds at every rank" phrasing misrepresents a direction-only invariant, and the miss-rate/absolute-miss parenthetical mixes two incompatible metrics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["v3.10.1-dev.80 report\n(placeholder in §8 table)"] --> B["Re-run benchmark\n(dev.81 binary, dev.80 source @ 1a6ee7b)"]
    B --> C["minilm control\nHit@5: 92.3% (+1.2 pp vs published)"]
    B --> D["jina-small control\nHit@5: 96.7% (+0.4 pp vs published)"]
    B --> E["jina-base NEW\nHit@5: 95.0%"]
    C --> F["Corpus drift: +2 files / +46 nodes\n612→614 files, 17873→17919 nodes"]
    D --> F
    E --> F
    F --> G["Backfill PR #1181\nReplace placeholder row\nUpdate §8 Assessment"]
    G --> H["Recommendation:\njina-small beats jina-base at every rank\n(Hit@1: +5.0pp, Hit@3: +2.2pp, Hit@5: +1.3pp)"]

_{Reviews (6): Last reviewed commit: "docs: correct corpus-drift range in dev...." | Re-trigger Greptile}

greptile-apps · 2026-05-21T06:17:03Z

 - Native fast-skip preflight (#1054) is firing as expected: 16 ms no-op rebuild matches WASM's, validating the `detectNoChanges` short-circuit.
 - The 1-file rebuild gap (WASM 45ms vs Native 67ms) is the inverse of full-build performance — WASM's lighter orchestrator setup wins on tiny incremental work.
- jina-small is the recall sweet spot (96.3% Hit@5 with only 512d vectors) — minilm's 91% Hit@5 leaves embedding misses at 4× the jina-small rate.
+- jina-small remains the recall sweet spot — its 96.3% Hit@5 (512d) actually *beats* jina-base's 95.0% (768d) on this code-identifier corpus despite the larger model and 2× larger embeddings. The +1.3 pp Hit@5 gap holds at every rank cutoff (Hit@1: 77.9% vs 72.9%; Hit@3: 93.5% vs 91.3%; misses: 23 vs 41), suggesting the gain from going 512d → 768d is negative for split-identifier queries against a general-text encoder. The code-tuned variants (`jina-code`, `jina-embeddings-v2-base-code`) would likely close the gap — `jina-code` requires `HF_TOKEN` and was not run in this session.


Incorrect embedding size multiplier

The bullet states "2× larger embeddings" when comparing jina-small (512d) to jina-base (768d), but 768 ÷ 512 = 1.5 — the embeddings are 50% larger, not twice the size. This should read "1.5× larger embeddings." The error is small but notable given the rest of the analysis is numerically precise.

Fixed in 4e46b1c — corrected to "1.5× larger embeddings" (768 ÷ 512 = 1.5).

…rt (#1186)

carlos-alm · 2026-05-21T16:54:49Z

Addressed Greptile feedback in 4e46b1c:

§8 line 324: corrected "2× larger embeddings" → "1.5× larger embeddings" (768 ÷ 512 = 1.5).
§8 line 325: corrected miss-rate multiplier "roughly 4× the jina-small rate" → "roughly 2.5× the jina-small rate" with the underlying numbers spelled out inline (8.9% vs 3.7% miss rate; 63 vs 23 absolute misses).

carlos-alm · 2026-05-21T16:55:01Z

@greptileai

The footnote claimed controls were ~+1-2 pp higher than published values but the only cited example (jina-small +0.4 pp) sat well below 1 pp. Replaces the inflated range with the actual observed deltas (minilm +1.2 pp, jina-small +0.4 pp) and updates the jina-base tolerance to match (~+0.4-1.2 pp instead of +/-1-2 pp).

carlos-alm · 2026-05-21T21:08:11Z

Addressed Greptile's remaining footnote finding in d203e1e:

§8 line 317: corrected drift range from "~~+1–2 pp higher" to "~~+0.4–1.2 pp higher" (matching the actual observed deltas: minilm +1.2 pp, jina-small +0.4 pp), and updated the jina-base tolerance accordingly. Also spelled out the minilm delta inline next to the existing jina-small example so the range is verifiable from the footnote itself.

carlos-alm · 2026-05-21T21:08:22Z

@greptileai

carlos-alm mentioned this pull request May 21, 2026

docs: dogfood report for v3.10.1-dev.80 #1180

Merged

2 tasks

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

carlos-alm and others added 2 commits May 21, 2026 10:50

Merge branch 'main' into docs/1181-jina-base-benchmark

80bf1c9

docs: correct embedding-size and miss-rate multipliers in dev.80 repo…

4e46b1c

…rt (#1186)

carlos-alm and others added 2 commits May 21, 2026 15:06

Merge branch 'main' into docs/1181-jina-base-benchmark

e9a6316

carlos-alm merged commit 6ff0572 into main May 21, 2026
21 checks passed

carlos-alm deleted the docs/1181-jina-base-benchmark branch May 21, 2026 21:47

github-actions Bot locked and limited conversation to collaborators May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80#1186

docs(dogfood): backfill jina-base Hit@k for v3.10.1-dev.80#1186
carlos-alm merged 5 commits into
mainfrom
docs/1181-jina-base-benchmark

carlos-alm commented May 21, 2026

Uh oh!

claude Bot commented May 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

carlos-alm May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 21, 2026

Summary

Reproduction methodology

Corpus-drift control

Interaction with PR #1180

Test plan

Uh oh!

claude Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete

Todo List:

Root Cause Analysis

Backlog Compliance

Review Assessment

✅ Excellent Documentation Quality

✅ Proper Change Scope

✅ Scientific Rigor

✅ PR Interaction Handling

Minor Observations

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 21, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

carlos-alm commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 21, 2026 •

edited

Loading

greptile-apps Bot commented May 21, 2026 •

edited

Loading