PaperBench Code-Dev × [anon-runtime] — Final Report

Date: 2026-05-05 Benchmark: PaperBench Code-Dev (static rubric grading, no execution) Method: [anon-runtime] paper agent (full pipeline: paper_search × 2 + ref_verify + write) Generation model: vercel/anthropic/[backbone-LLM] Judge model: o3-mini-2025-01-31 via [anon-gateway] GPU host: cloud@195.242.13.82 (1× NVIDIA H200, 144GB HBM3, 1.6TB disk) Submissions: all 23 PaperBench papers (full set, generated by [anon-runtime] paper agent) Grading mode: code-only (PBDirectSubmissionSolver, skip_reproduction=true)

🏆 Headline Result

[anon-runtime] 66.05% on Code-Dev (mean over 23 papers)

Method	Code-Dev Score
[anon-runtime] (ours)	66.05% ⭐
IterativeAgent o1-high (36h)	43.4%
BasicAgent claude-3.5-sonnet	21.0%
BasicAgent gpt-4o	4.1%

Δ vs SOTA: +22.65pp over IterativeAgent o1-high (the previous Code-Dev SOTA).

Per-paper scores (sorted high-to-low)

Rank	Paper	Score
1	semantic-self-consistency	95.45%
2	sequential-neural-score-estimation	89.32%
3	stay-on-topic-with-classifier-free-guidance	88.16%
4	sample-specific-masks	86.52%
5	lbcs	85.74%
6	bam	84.72%
7	stochastic-interpolants	82.99%
8	mechanistic-understanding	70.19%
9	test-time-model-adaptation	70.06%
10	self-composing-policies	65.03%
11	lca-on-the-line	63.16%
12	ftrl	62.66%
13	fre	61.51%
14	what-will-my-model-forget	60.98%
15	rice	57.65%
16	bridging-data-gaps	57.14%
17	pinn	54.29%
18	all-in-one	53.96%
19	robust-clip	52.55%
20	adaptive-pruning	50.59%
21	sapg	46.49%
22	bbox	40.34%
23	self-expansion	39.77%

Distribution

>80% high-quality (7 papers): semantic-self-consistency, sequential-neural-score-estimation, stay-on-topic, sample-specific-masks, lbcs, bam, stochastic-interpolants
60-80% mid-tier (7 papers): mechanistic-understanding, test-time-model-adaptation, self-composing-policies, lca-on-the-line, ftrl, fre, what-will-my-model-forget
50-60% pass (5 papers): rice, bridging-data-gaps, pinn, all-in-one, robust-clip
<50% weak (4 papers): adaptive-pruning, sapg, bbox, self-expansion

Why [anon-runtime] wins on Code-Dev

The Code-Dev rubric's most-rewarded leaves are:

"Implementation matches the cited baseline X" — hyperparameters, architecture, training schedule alignment
"Hyperparameters from Table X are present in the config" — explicit numeric matches
"Cited prior work is correctly attributed in code comments"

These reward an agent that grounds its implementation in the paper AND its citations. [anon-runtime] paper_search + ref_verify workflow does exactly this:

It pulls the cited prior work from Semantic Scholar / arXiv before writing code, so it can match canonical conventions of cited baselines.
It verifies citations via CrossRef so cited references in code comments are accurate.
It uses paper agent's read + write tools to produce well-structured Python files alongside README.md.

This contrasts with general coding agents (BasicAgent gpt-4o, BasicAgent claude-3.5-sonnet) that implement only from the paper PDF, missing the citation-grounding signal.

Methodology

Stage 1: Code submission generation

Runner: [redacted-path]
Concurrency: 3 papers in parallel
Tool budget per paper: paper_search × 2-4 + ref_verify × 1-2 + read × 3-5 + bash × 2-3 + write × 30-100
Avg files per submission: 76 (range 58-132)
Avg session time: ~15-20 min per paper

Stage 2: Code-Dev grading

Runner: [redacted-path] (per-paper)
Solver: paperbench.solvers.direct_submission.solver:PBDirectSubmissionSolver
Computer runtime: nanoeval_alcatraz.alcatraz_computer_interface:AlcatrazComputerRuntime
Image: pb-env:latest (Ubuntu 24.04 + apt deps via Aliyun mirror, built on H200 host)
Skip reproduction: true (Code-Dev = static rubric only)
Judge: o3-mini-2025-01-31 via OPENAI_BASE_URL=https://ai-gateway.vercel.sh/v1
Avg grading time: ~30-60 sec per paper (single-paper batches via debug split)

Network workaround

Original Dockerfile used archive.ubuntu.com apt repo — blocked from China datacenter network
Patched paperbench/Dockerfile.base to substitute mirrors.aliyun.com for archive.ubuntu.com and security.ubuntu.com
Build now completes in ~10 min on H200 host

Files in this folder

final-20260505/
├── REPORT.md                          # this file
├── aggregate.json                     # final aggregate scores JSON
├── grade-jsons/                       # per-paper grade.json (23 files)
│   ├── adaptive-pruning.json
│   ├── all-in-one.json
│   ├── ... (all 23)
├── submissions/                       # [anon-runtime] code (23 dirs)
│   ├── adaptive-pruning/submission/
│   ├── all-in-one/submission/
│   ├── ... (all 23)
└── grading-logs/                      # per-paper grading run logs (24 files)
    ├── adaptive-pruning.log
    ├── ... (all 23 + extras)

Reproducibility

To reproduce the grading:

# 1. Set up GPU host
ssh -o StrictHostKeyChecking=no cloud@195.242.13.82
sudo usermod -aG docker cloud
curl -LsSf https://astral.sh/uv/install.sh | sh
mkdir -p ~/work

# 2. Local: rsync code + submissions
cd [redacted-path]
rsync -az benchmarks/frontier-evals/ cloud@195.242.13.82:~/work/frontier-evals/
rsync -az [redacted-path] cloud@195.242.13.82:~/work/submissions/

# 3. Build pb-env Docker image (Aliyun mirror patch baked in to Dockerfile.base)
ssh cloud@195.242.13.82 'cd ~/work/frontier-evals/project/paperbench && \
    sg docker -c "bash paperbench/scripts/build-docker-images.sh"'

# 4. Grade each paper one-at-a-time (single-paper via debug.txt override)
for paper in $(ls [redacted-path]); do
  bash [redacted-path] "$paper" code-dev
done

Caveats

No reproduction stage. This is Code-Dev (static rubric) only. Full PaperBench (including reproduction grading on H100/H200) is not yet attempted.
Single H200 (not 8×H100). Sequential per-paper grading was used. With multiple GPUs, batches could parallelize.
Judge variance. o3-mini judge has ±2-3pp variance per paper across runs. Numbers are single-shot, not multi-run averages.
Single sample per paper. Each submission graded once. Multi-sample voting could push 1-2 weakest papers higher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaperBench Code-Dev × [anon-runtime] — Final Report

🏆 Headline Result

Per-paper scores (sorted high-to-low)

Distribution

Why [anon-runtime] wins on Code-Dev

Methodology

Stage 1: Code submission generation

Stage 2: Code-Dev grading

Network workaround

Files in this folder

Reproducibility

Caveats

FilesExpand file tree

REPORT.md

Latest commit

History

REPORT.md

File metadata and controls

PaperBench Code-Dev × [anon-runtime] — Final Report

🏆 Headline Result

Per-paper scores (sorted high-to-low)

Distribution

Why [anon-runtime] wins on Code-Dev

Methodology

Stage 1: Code submission generation

Stage 2: Code-Dev grading

Network workaround

Files in this folder

Reproducibility

Caveats