Date: 2026-05-05
Benchmark: PaperBench Code-Dev (static rubric grading, no execution)
Method: [anon-runtime] paper agent (full pipeline: paper_search × 2 + ref_verify + write)
Generation model: vercel/anthropic/[backbone-LLM]
Judge model: o3-mini-2025-01-31 via [anon-gateway]
GPU host: cloud@195.242.13.82 (1× NVIDIA H200, 144GB HBM3, 1.6TB disk)
Submissions: all 23 PaperBench papers (full set, generated by [anon-runtime] paper agent)
Grading mode: code-only (PBDirectSubmissionSolver, skip_reproduction=true)
[anon-runtime] 66.05% on Code-Dev (mean over 23 papers)
| Method | Code-Dev Score |
|---|---|
| [anon-runtime] (ours) | 66.05% ⭐ |
| IterativeAgent o1-high (36h) | 43.4% |
| BasicAgent claude-3.5-sonnet | 21.0% |
| BasicAgent gpt-4o | 4.1% |
Δ vs SOTA: +22.65pp over IterativeAgent o1-high (the previous Code-Dev SOTA).
| Rank | Paper | Score |
|---|---|---|
| 1 | semantic-self-consistency | 95.45% |
| 2 | sequential-neural-score-estimation | 89.32% |
| 3 | stay-on-topic-with-classifier-free-guidance | 88.16% |
| 4 | sample-specific-masks | 86.52% |
| 5 | lbcs | 85.74% |
| 6 | bam | 84.72% |
| 7 | stochastic-interpolants | 82.99% |
| 8 | mechanistic-understanding | 70.19% |
| 9 | test-time-model-adaptation | 70.06% |
| 10 | self-composing-policies | 65.03% |
| 11 | lca-on-the-line | 63.16% |
| 12 | ftrl | 62.66% |
| 13 | fre | 61.51% |
| 14 | what-will-my-model-forget | 60.98% |
| 15 | rice | 57.65% |
| 16 | bridging-data-gaps | 57.14% |
| 17 | pinn | 54.29% |
| 18 | all-in-one | 53.96% |
| 19 | robust-clip | 52.55% |
| 20 | adaptive-pruning | 50.59% |
| 21 | sapg | 46.49% |
| 22 | bbox | 40.34% |
| 23 | self-expansion | 39.77% |
- >80% high-quality (7 papers): semantic-self-consistency, sequential-neural-score-estimation, stay-on-topic, sample-specific-masks, lbcs, bam, stochastic-interpolants
- 60-80% mid-tier (7 papers): mechanistic-understanding, test-time-model-adaptation, self-composing-policies, lca-on-the-line, ftrl, fre, what-will-my-model-forget
- 50-60% pass (5 papers): rice, bridging-data-gaps, pinn, all-in-one, robust-clip
- <50% weak (4 papers): adaptive-pruning, sapg, bbox, self-expansion
The Code-Dev rubric's most-rewarded leaves are:
- "Implementation matches the cited baseline X" — hyperparameters, architecture, training schedule alignment
- "Hyperparameters from Table X are present in the config" — explicit numeric matches
- "Cited prior work is correctly attributed in code comments"
These reward an agent that grounds its implementation in the paper AND its citations. [anon-runtime] paper_search + ref_verify workflow does exactly this:
- It pulls the cited prior work from Semantic Scholar / arXiv before writing code, so it can match canonical conventions of cited baselines.
- It verifies citations via CrossRef so cited references in code comments are accurate.
- It uses
paperagent'sread+writetools to produce well-structured Python files alongside README.md.
This contrasts with general coding agents (BasicAgent gpt-4o, BasicAgent claude-3.5-sonnet) that implement only from the paper PDF, missing the citation-grounding signal.
- Runner:
[redacted-path] - Concurrency: 3 papers in parallel
- Tool budget per paper: paper_search × 2-4 + ref_verify × 1-2 + read × 3-5 + bash × 2-3 + write × 30-100
- Avg files per submission: 76 (range 58-132)
- Avg session time: ~15-20 min per paper
- Runner:
[redacted-path](per-paper) - Solver:
paperbench.solvers.direct_submission.solver:PBDirectSubmissionSolver - Computer runtime:
nanoeval_alcatraz.alcatraz_computer_interface:AlcatrazComputerRuntime - Image:
pb-env:latest(Ubuntu 24.04 + apt deps via Aliyun mirror, built on H200 host) - Skip reproduction:
true(Code-Dev = static rubric only) - Judge: o3-mini-2025-01-31 via OPENAI_BASE_URL=https://ai-gateway.vercel.sh/v1
- Avg grading time: ~30-60 sec per paper (single-paper batches via debug split)
- Original Dockerfile used
archive.ubuntu.comapt repo — blocked from China datacenter network - Patched
paperbench/Dockerfile.baseto substitutemirrors.aliyun.comforarchive.ubuntu.comandsecurity.ubuntu.com - Build now completes in ~10 min on H200 host
final-20260505/
├── REPORT.md # this file
├── aggregate.json # final aggregate scores JSON
├── grade-jsons/ # per-paper grade.json (23 files)
│ ├── adaptive-pruning.json
│ ├── all-in-one.json
│ ├── ... (all 23)
├── submissions/ # [anon-runtime] code (23 dirs)
│ ├── adaptive-pruning/submission/
│ ├── all-in-one/submission/
│ ├── ... (all 23)
└── grading-logs/ # per-paper grading run logs (24 files)
├── adaptive-pruning.log
├── ... (all 23 + extras)
To reproduce the grading:
# 1. Set up GPU host
ssh -o StrictHostKeyChecking=no cloud@195.242.13.82
sudo usermod -aG docker cloud
curl -LsSf https://astral.sh/uv/install.sh | sh
mkdir -p ~/work
# 2. Local: rsync code + submissions
cd [redacted-path]
rsync -az benchmarks/frontier-evals/ cloud@195.242.13.82:~/work/frontier-evals/
rsync -az [redacted-path] cloud@195.242.13.82:~/work/submissions/
# 3. Build pb-env Docker image (Aliyun mirror patch baked in to Dockerfile.base)
ssh cloud@195.242.13.82 'cd ~/work/frontier-evals/project/paperbench && \
sg docker -c "bash paperbench/scripts/build-docker-images.sh"'
# 4. Grade each paper one-at-a-time (single-paper via debug.txt override)
for paper in $(ls [redacted-path]); do
bash [redacted-path] "$paper" code-dev
done- No reproduction stage. This is Code-Dev (static rubric) only. Full PaperBench (including reproduction grading on H100/H200) is not yet attempted.
- Single H200 (not 8×H100). Sequential per-paper grading was used. With multiple GPUs, batches could parallelize.
- Judge variance. o3-mini judge has ±2-3pp variance per paper across runs. Numbers are single-shot, not multi-run averages.
- Single sample per paper. Each submission graded once. Multi-sample voting could push 1-2 weakest papers higher.