fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score#197
Merged
Merged
Conversation
…; normalize score
Reviewers were issuing "no LaTeX source" / "no bibliography" verdicts on
arXiv-intake papers because they literally never saw the paper content:
* _concat_tex sorted .tex files alphabetically with a 60KB budget. For
a typical arXiv tarball (extra_pkgs.tex ≈ 3KB sorts first; main.tex
≈ 250KB sorts later), the budget got consumed by package
declarations and main.tex was always skipped. The reviewer's prompt
contained 3KB of \usepackage lines and a "(truncated; remaining
files: 2)" footer — no abstract, no methods, no results.
* state/citations/<PROJ>.yaml is never populated for arXiv-intake
papers, so the bibliography section was always "(no citations
recorded)" — even when paper/source/ref.bib was right there with
100+ entries.
* One specialist per project (~1/13) failed pydantic validation
because the LLM picked "accept" verdict but wrote score=0.0 (or
"minor_revision" with score=0.5). The score is purely derived from
the verdict — normalize on parse instead of losing a substantive
review to a numeric formatting slip.
Fixes:
1. _concat_tex now promotes the entry-point file (containing
\documentclass) to the front of the ordering, truncates IT to fit
budget if necessary (vs. silently skipping it), and the default
budget grew from 60KB → 180KB (~45K tokens, leaves room for the
response in a 128K context).
2. _summarize_bibfile fallback: when state/citations is empty, inline
paper/source/*.bib (capped at 30KB) so the reviewer can see what's
cited and judge the reference set.
3. handle_response normalizes score from verdict before validation.
Verified against 8 previously-failing projects (PROJ-564, 565, 566,
568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist
reviews instead of crashing or emitting boilerplate "no source provided"
verdicts. Aggregate verdicts:
* accept : PROJ-564, 565, 566, 576
* minor_revision : PROJ-568, 570, 571
* major_revision_sci: PROJ-578 (correctly flagged "GPT-5.4 /
Claude Sonnet 4.5 / Gemini-3.1-Pro" as unverifiable model names)
Reviews now reference specific Algorithms, Tables, Figures, and
hyperparameters by name — the LLM is reading and reasoning about the
actual paper, not the package preamble.
Adds 9 new unit tests (17 total in test_paper_reviewer_arxiv_intake).
Full unit suite (395 tests) passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremymanning
added a commit
that referenced
this pull request
May 17, 2026
Picks up the 8 previously-failing arxiv-intake papers (PROJ-564, 565, 566, 568, 570, 571, 576, 578) — all now have substantive 13-specialist reviews after PR #197 fixed the LaTeX-prompt truncation bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_concat_texsorted alphabetically with a 60KB budget, so the prompt always containedextra_pkgs.tex(~3KB) +(truncated; remaining files: 2)— andmain.tex(~250KB) was never inlined. Reviewers correctly issuedmajor_revision_writingverdicts citing "no LaTeX source", but they were judging the truncation, not the paper.state/citations/<PROJ>.yamlis never populated for intake projects, so the bibliography section was always "(no citations recorded)" even whenpaper/source/ref.bibhad 100+ entries right there.acceptverdict but wrotescore: 0.0(the validator requiresscore: 0.5for accept).Fixes
_concat_texrewritten: promote the entry-point file (\documentclass) to the front; truncate IT if needed instead of skipping; default budget bumped to 180KB._summarize_bibfilefallback: whenstate/citationsis empty, inlinepaper/source/*.bib(capped 30KB) so the reviewer can judge the reference set.handle_responsenormalizes score from verdict before validation — losing a substantive review to a numeric-formatting slip wasted Dartmouth calls.Verification
Manually re-ran 8 previously-failing arxiv-intake projects (PROJ-564, 565, 566, 568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist reviews instead of crashing or boilerplate:
Reviews now reference Algorithms, Tables, Figures, and hyperparameters by name. The LLM is reading and reasoning about the actual paper, not the package preamble.
Test plan
test_paper_reviewer_arxiv_intake.pypass (8 new tests for truncation + bib + score-normalization)🤖 Generated with Claude Code