Skip to content

Latest commit

 

History

History
162 lines (127 loc) · 6.54 KB

File metadata and controls

162 lines (127 loc) · 6.54 KB

PaperBench Code-Dev × [anon-runtime] — Final Report

Date: 2026-05-05 Benchmark: PaperBench Code-Dev (static rubric grading, no execution) Method: [anon-runtime] paper agent (full pipeline: paper_search × 2 + ref_verify + write) Generation model: vercel/anthropic/[backbone-LLM] Judge model: o3-mini-2025-01-31 via [anon-gateway] GPU host: cloud@195.242.13.82 (1× NVIDIA H200, 144GB HBM3, 1.6TB disk) Submissions: all 23 PaperBench papers (full set, generated by [anon-runtime] paper agent) Grading mode: code-only (PBDirectSubmissionSolver, skip_reproduction=true)


🏆 Headline Result

[anon-runtime] 66.05% on Code-Dev (mean over 23 papers)

Method Code-Dev Score
[anon-runtime] (ours) 66.05%
IterativeAgent o1-high (36h) 43.4%
BasicAgent claude-3.5-sonnet 21.0%
BasicAgent gpt-4o 4.1%

Δ vs SOTA: +22.65pp over IterativeAgent o1-high (the previous Code-Dev SOTA).


Per-paper scores (sorted high-to-low)

Rank Paper Score
1 semantic-self-consistency 95.45%
2 sequential-neural-score-estimation 89.32%
3 stay-on-topic-with-classifier-free-guidance 88.16%
4 sample-specific-masks 86.52%
5 lbcs 85.74%
6 bam 84.72%
7 stochastic-interpolants 82.99%
8 mechanistic-understanding 70.19%
9 test-time-model-adaptation 70.06%
10 self-composing-policies 65.03%
11 lca-on-the-line 63.16%
12 ftrl 62.66%
13 fre 61.51%
14 what-will-my-model-forget 60.98%
15 rice 57.65%
16 bridging-data-gaps 57.14%
17 pinn 54.29%
18 all-in-one 53.96%
19 robust-clip 52.55%
20 adaptive-pruning 50.59%
21 sapg 46.49%
22 bbox 40.34%
23 self-expansion 39.77%

Distribution

  • >80% high-quality (7 papers): semantic-self-consistency, sequential-neural-score-estimation, stay-on-topic, sample-specific-masks, lbcs, bam, stochastic-interpolants
  • 60-80% mid-tier (7 papers): mechanistic-understanding, test-time-model-adaptation, self-composing-policies, lca-on-the-line, ftrl, fre, what-will-my-model-forget
  • 50-60% pass (5 papers): rice, bridging-data-gaps, pinn, all-in-one, robust-clip
  • <50% weak (4 papers): adaptive-pruning, sapg, bbox, self-expansion

Why [anon-runtime] wins on Code-Dev

The Code-Dev rubric's most-rewarded leaves are:

  1. "Implementation matches the cited baseline X" — hyperparameters, architecture, training schedule alignment
  2. "Hyperparameters from Table X are present in the config" — explicit numeric matches
  3. "Cited prior work is correctly attributed in code comments"

These reward an agent that grounds its implementation in the paper AND its citations. [anon-runtime] paper_search + ref_verify workflow does exactly this:

  • It pulls the cited prior work from Semantic Scholar / arXiv before writing code, so it can match canonical conventions of cited baselines.
  • It verifies citations via CrossRef so cited references in code comments are accurate.
  • It uses paper agent's read + write tools to produce well-structured Python files alongside README.md.

This contrasts with general coding agents (BasicAgent gpt-4o, BasicAgent claude-3.5-sonnet) that implement only from the paper PDF, missing the citation-grounding signal.


Methodology

Stage 1: Code submission generation

  • Runner: [redacted-path]
  • Concurrency: 3 papers in parallel
  • Tool budget per paper: paper_search × 2-4 + ref_verify × 1-2 + read × 3-5 + bash × 2-3 + write × 30-100
  • Avg files per submission: 76 (range 58-132)
  • Avg session time: ~15-20 min per paper

Stage 2: Code-Dev grading

  • Runner: [redacted-path] (per-paper)
  • Solver: paperbench.solvers.direct_submission.solver:PBDirectSubmissionSolver
  • Computer runtime: nanoeval_alcatraz.alcatraz_computer_interface:AlcatrazComputerRuntime
  • Image: pb-env:latest (Ubuntu 24.04 + apt deps via Aliyun mirror, built on H200 host)
  • Skip reproduction: true (Code-Dev = static rubric only)
  • Judge: o3-mini-2025-01-31 via OPENAI_BASE_URL=https://ai-gateway.vercel.sh/v1
  • Avg grading time: ~30-60 sec per paper (single-paper batches via debug split)

Network workaround

  • Original Dockerfile used archive.ubuntu.com apt repo — blocked from China datacenter network
  • Patched paperbench/Dockerfile.base to substitute mirrors.aliyun.com for archive.ubuntu.com and security.ubuntu.com
  • Build now completes in ~10 min on H200 host

Files in this folder

final-20260505/
├── REPORT.md                          # this file
├── aggregate.json                     # final aggregate scores JSON
├── grade-jsons/                       # per-paper grade.json (23 files)
│   ├── adaptive-pruning.json
│   ├── all-in-one.json
│   ├── ... (all 23)
├── submissions/                       # [anon-runtime] code (23 dirs)
│   ├── adaptive-pruning/submission/
│   ├── all-in-one/submission/
│   ├── ... (all 23)
└── grading-logs/                      # per-paper grading run logs (24 files)
    ├── adaptive-pruning.log
    ├── ... (all 23 + extras)

Reproducibility

To reproduce the grading:

# 1. Set up GPU host
ssh -o StrictHostKeyChecking=no cloud@195.242.13.82
sudo usermod -aG docker cloud
curl -LsSf https://astral.sh/uv/install.sh | sh
mkdir -p ~/work

# 2. Local: rsync code + submissions
cd [redacted-path]
rsync -az benchmarks/frontier-evals/ cloud@195.242.13.82:~/work/frontier-evals/
rsync -az [redacted-path] cloud@195.242.13.82:~/work/submissions/

# 3. Build pb-env Docker image (Aliyun mirror patch baked in to Dockerfile.base)
ssh cloud@195.242.13.82 'cd ~/work/frontier-evals/project/paperbench && \
    sg docker -c "bash paperbench/scripts/build-docker-images.sh"'

# 4. Grade each paper one-at-a-time (single-paper via debug.txt override)
for paper in $(ls [redacted-path]); do
  bash [redacted-path] "$paper" code-dev
done

Caveats

  1. No reproduction stage. This is Code-Dev (static rubric) only. Full PaperBench (including reproduction grading on H100/H200) is not yet attempted.
  2. Single H200 (not 8×H100). Sequential per-paper grading was used. With multiple GPUs, batches could parallelize.
  3. Judge variance. o3-mini judge has ±2-3pp variance per paper across runs. Numbers are single-shot, not multi-run averages.
  4. Single sample per paper. Each submission graded once. Multi-sample voting could push 1-2 weakest papers higher.