fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score by jeremymanning · Pull Request #197 · ContextLab/llmXive

jeremymanning · 2026-05-17T15:23:31Z

Summary

Reviewers on arXiv-intake papers were never seeing the paper content. _concat_tex sorted alphabetically with a 60KB budget, so the prompt always contained extra_pkgs.tex (~3KB) + (truncated; remaining files: 2) — and main.tex (~250KB) was never inlined. Reviewers correctly issued major_revision_writing verdicts citing "no LaTeX source", but they were judging the truncation, not the paper.
state/citations/<PROJ>.yaml is never populated for intake projects, so the bibliography section was always "(no citations recorded)" even when paper/source/ref.bib had 100+ entries right there.
~1/13 specialists per project failed pydantic validation because the LLM picked an accept verdict but wrote score: 0.0 (the validator requires score: 0.5 for accept).

Fixes

_concat_tex rewritten: promote the entry-point file (\documentclass) to the front; truncate IT if needed instead of skipping; default budget bumped to 180KB.
_summarize_bibfile fallback: when state/citations is empty, inline paper/source/*.bib (capped 30KB) so the reviewer can judge the reference set.
handle_response normalizes score from verdict before validation — losing a substantive review to a numeric-formatting slip wasted Dartmouth calls.

Verification

Manually re-ran 8 previously-failing arxiv-intake projects (PROJ-564, 565, 566, 568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist reviews instead of crashing or boilerplate:

Project	Verdict	Highlight
PROJ-564	accept	"Global Skip Connections... resolves tripartite trade-off between compression, fidelity, diffusability"
PROJ-565	accept	"Unified benchmark suite with 2,388 instances and 2,251 preference pairs"
PROJ-566	accept	"Strong systems contribution with validated scaling axes"
PROJ-568	minor_revision	substantive
PROJ-570	minor_revision	"source file contamination detected"
PROJ-571	minor_revision	"missing hyperparameter value for $\beta_k$" — references Eq. 12 and Algorithm 1
PROJ-576	accept	"Strong efficiency / quality trade-off for minute-scale generation"
PROJ-578	major_revision_science	correctly flagged "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro" as unverifiable model names

Reviews now reference Algorithms, Tables, Figures, and hyperparameters by name. The LLM is reading and reasoning about the actual paper, not the package preamble.

Test plan

17 unit tests in test_paper_reviewer_arxiv_intake.py pass (8 new tests for truncation + bib + score-normalization)
Full unit suite (395 tests) passes
Verified manually against all 8 previously-failing arxiv-intake projects on disk
Next paper-review cron tick (every 16h) will confirm the fix sticks under real CI

🤖 Generated with Claude Code

…; normalize score Reviewers were issuing "no LaTeX source" / "no bibliography" verdicts on arXiv-intake papers because they literally never saw the paper content: * _concat_tex sorted .tex files alphabetically with a 60KB budget. For a typical arXiv tarball (extra_pkgs.tex ≈ 3KB sorts first; main.tex ≈ 250KB sorts later), the budget got consumed by package declarations and main.tex was always skipped. The reviewer's prompt contained 3KB of \usepackage lines and a "(truncated; remaining files: 2)" footer — no abstract, no methods, no results. * state/citations/<PROJ>.yaml is never populated for arXiv-intake papers, so the bibliography section was always "(no citations recorded)" — even when paper/source/ref.bib was right there with 100+ entries. * One specialist per project (~1/13) failed pydantic validation because the LLM picked "accept" verdict but wrote score=0.0 (or "minor_revision" with score=0.5). The score is purely derived from the verdict — normalize on parse instead of losing a substantive review to a numeric formatting slip. Fixes: 1. _concat_tex now promotes the entry-point file (containing \documentclass) to the front of the ordering, truncates IT to fit budget if necessary (vs. silently skipping it), and the default budget grew from 60KB → 180KB (~45K tokens, leaves room for the response in a 128K context). 2. _summarize_bibfile fallback: when state/citations is empty, inline paper/source/*.bib (capped at 30KB) so the reviewer can see what's cited and judge the reference set. 3. handle_response normalizes score from verdict before validation. Verified against 8 previously-failing projects (PROJ-564, 565, 566, 568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist reviews instead of crashing or emitting boilerplate "no source provided" verdicts. Aggregate verdicts: * accept : PROJ-564, 565, 566, 576 * minor_revision : PROJ-568, 570, 571 * major_revision_sci: PROJ-578 (correctly flagged "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro" as unverifiable model names) Reviews now reference specific Algorithms, Tables, Figures, and hyperparameters by name — the LLM is reading and reasoning about the actual paper, not the package preamble. Adds 9 new unit tests (17 total in test_paper_reviewer_arxiv_intake). Full unit suite (395 tests) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the 8 previously-failing arxiv-intake papers (PROJ-564, 565, 566, 568, 570, 571, 576, 578) — all now have substantive 13-specialist reviews after PR #197 fixed the LaTeX-prompt truncation bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremymanning merged commit b57889b into main May 17, 2026
5 of 6 checks passed

jeremymanning deleted the fix/paper-reviewer-prompt-truncation branch May 17, 2026 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score#197

fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score#197
jeremymanning merged 1 commit into
mainfrom
fix/paper-reviewer-prompt-truncation

jeremymanning commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremymanning commented May 17, 2026

Summary

Fixes

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant