Skip to content

Latest commit

 

History

History
161 lines (125 loc) · 9.7 KB

File metadata and controls

161 lines (125 loc) · 9.7 KB

PaperBench 23-Paper Per-Paper Comparison: [anon-runtime] vs Published Baselines

Dataset: PaperBench full set (23 papers, ICML 2024 spotlights). Date: 2026-05-05. Comparison source: AiScientist (Chen et al., arXiv 2604.13018, Apr 2026), Table 1 — PaperBench Full Evaluation.


⚠️ Important Methodology Note

This report compares scores across two different PaperBench evaluation modes:

Method Mode What is graded
[anon-runtime] (ours) Code-Dev only Static rubric leaves only — code structure, hyperparameter alignment, baseline correctness, citation grounding. No code execution. No result match.
AiScientist (Chen et al.) — Gemini-3-Flash, GLM-5 Full Code-Dev + Reproduction stage (24h H20 GPU per paper, paper-target metrics must match).
BasicAgent o1-high Full Same as AiScientist
IterativeAgent o1-high Full Same as AiScientist

Implication: Our 66.05% Code-Dev mean is structurally higher than full-mode scores because the reproduction stage on full-mode adds many failure points (training crashes, OOM, dataset access, hyperparameter divergence). For an apples-to-apples comparison with AiScientist's full-mode score, multiply our Code-Dev score by the typical Code-Dev contribution to full score (≈55-60%) to get an estimated full-mode equivalent of ~36-40%.

PaperBench original paper (Starace et al., ICML 2025):

  • Code-Dev only IterativeAgent o1-high: 43.4%
  • Full IterativeAgent o1-high: 26.0%

So Code-Dev → Full conversion factor in their setup is roughly 0.60×. Applying that to ours:

  • [anon-runtime] Code-Dev: 66.05% → estimated Full: ~39-40%, which would beat AiScientist Gemini-3-Flash (30.52) and AiScientist GLM-5 (33.73), and approach the 41% human baseline.

Per-Paper Comparison Table

# Paper [anon-runtime] (Ours, Code-Dev) AiScientist + Gemini-3-Flash (Full) AiScientist + GLM-5 (Full) BasicAgent + Gemini-3-Flash (Full) IterAgent + Gemini-3-Flash (Full) BasicAgent + GLM-5 (Full) IterAgent + GLM-5 (Full)
1 adaptive-pruning 50.59 27.25 33.26 24.53 3.05 30.82 11.93
2 all-in-one 53.96 46.29 49.47 20.86 45.13 33.78 44.43
3 bam 84.72 56.59 61.11 48.46 45.04 51.45 47.91
4 bbox 40.34 33.79 30.02 15.43 8.30 23.55 19.28
5 bridging-data-gaps 57.14 23.09 26.46 12.59 12.44 9.80 12.50
6 fre 61.51 35.21 28.98 21.67 23.89 21.60 16.67
7 ftrl 62.66 10.11 8.34 5.87 4.15 3.71 6.70
8 lbcs 85.74 27.90 30.10 17.75 15.26 20.68 22.74
9 lca-on-the-line 63.16 30.23 28.53 12.97 18.30 22.55 26.15
10 mechanistic-understanding 70.19 29.95 40.55 14.86 21.89 32.49 34.96
11 pinn 54.29 49.92 58.76 26.63 30.81 22.18 25.77
12 rice 57.65 10.87 10.18 10.43 8.88 6.56 0.27
13 robust-clip 52.55 18.28 28.66 15.45 10.43 22.43 27.56
14 sample-specific-masks 86.52 36.77 44.13 25.39 33.34 36.93 41.26
15 sapg 46.49 19.85 31.69 11.45 12.65 6.99 4.95
16 sequential-neural-score-estimation 89.32 64.94 49.32 53.51 60.24 27.20 35.53
17 stay-on-topic-with-classifier-free-guidance 88.16 20.13 14.81 8.37 13.69 3.69 8.81
18 stochastic-interpolants 82.99 18.81 42.10 17.04 17.37 32.18 28.06
19 test-time-model-adaptation 70.06 32.45 27.33 15.27 18.13 17.81 21.19
20 what-will-my-model-forget 60.98 17.87 30.82 6.61 8.99 25.14 10.75
21 semantic-self-consistency 95.45
22 self-composing-policies 65.03
23 self-expansion 39.77
Average (20 papers shared with AiScientist) 65.45 30.52 33.73 19.26 20.60 22.58 22.37
Average (full 23 papers) 66.05

Notes:

  • AiScientist Table 1 reports 20 papers (3 papers in our set — semantic-self-consistency, self-composing-policies, self-expansion — are not in their Table 1).
  • All AiScientist / BasicAgent / IterAgent numbers are from arXiv:2604.13018, Table 1. Their grading model is GPT-5.4. Cost per task: BasicAgent Gemini-3-Flash $6.25, IterAgent Gemini-3-Flash $27.44, AiScientist Gemini-3-Flash $15.67, BasicAgent GLM-5 (no cost reported), IterAgent GLM-5 $54.90, AiScientist GLM-5 $12.20.
  • Our [anon-runtime] grading model is o3-mini-2025-01-31. Different judge could yield ±2-5pp variance.

Per-Paper Δ Summary (citation-grounded retrieval)

For each shared paper, we compute Δ = [anon-runtime] Code-Dev − max(AiScientist Gemini, AiScientist GLM-5).

Paper [anon-runtime] Best AiScientist Δ
adaptive-pruning 50.59 33.26 +17.33
all-in-one 53.96 49.47 +4.49
bam 84.72 61.11 +23.61
bbox 40.34 33.79 +6.55
bridging-data-gaps 57.14 26.46 +30.68
fre 61.51 35.21 +26.30
ftrl 62.66 10.11 +52.55
lbcs 85.74 30.10 +55.64
lca-on-the-line 63.16 30.23 +32.93
mechanistic-understanding 70.19 40.55 +29.64
pinn 54.29 58.76 -4.47
rice 57.65 10.87 +46.78
robust-clip 52.55 28.66 +23.89
sample-specific-masks 86.52 44.13 +42.39
sapg 46.49 31.69 +14.80
sequential-neural-score-estimation 89.32 64.94 +24.38
stay-on-topic-with-classifier-free-guidance 88.16 20.13 +68.03
stochastic-interpolants 82.99 42.10 +40.89
test-time-model-adaptation 70.06 32.45 +37.61
what-will-my-model-forget 60.98 30.82 +30.16
Average Δ +30.21

Health check: Our advantage of +30 points reflects mostly the methodology difference (Code-Dev rubric vs Full rubric). It is NOT a direct claim of "+30pp better than AiScientist". After applying the 0.60× full-mode conversion factor:

  • [anon-runtime] estimated full-mode: ~40%
  • AiScientist Gemini-3-Flash full-mode: 30.52%
  • AiScientist GLM-5 full-mode: 33.73%
  • AiScientist best: 33.73%
  • [anon-runtime] estimated Δ vs AiScientist best: ~+6 to +8pp

That residual margin is what is realistically claimable. To verify, we would need to run our [anon-runtime] submissions through the full-mode reproduction stage on H100/H200 GPU with a 24h budget per paper.


Top-3 Strongest Papers for [anon-runtime]

Rank Paper [anon-runtime] Code-Dev Note
1 semantic-self-consistency 95.45% Highest score; aligns well with [anon-runtime] paper_search workflow for citation triangulation
2 sequential-neural-score-estimation 89.32% Strong baseline from AiScientist (64.94), we improve by +24pp
3 stay-on-topic-with-classifier-free-guidance 88.16% Largest Δ vs AiScientist (+68pp), suggests Code-Dev rubric heavily rewards our citation-grounded writing

Bottom-3 Weakest Papers for [anon-runtime]

Rank Paper [anon-runtime] Code-Dev Note
21 self-expansion 39.77% Thin submission (76 files), needs deeper code
22 bbox 40.34% Niche topic, possibly fewer matching citations to ground in
23 sapg 46.49% RL paper, hyperparameters not faithfully reproduced

Key Findings

  1. [anon-runtime] wins decisively on Code-Dev: 66.05% vs 43.4% for the prior Code-Dev SOTA (IterativeAgent o1-high in the original PaperBench paper). The +22.65pp gap likely reflects [anon-runtime] paper_search + ref_verify workflow, which grounds code in cited prior work — a property the rubric explicitly rewards.

  2. Estimated full-mode performance: ~36-40% (after 0.60× Code-Dev → Full conversion), which would still beat AiScientist Gemini-3-Flash (30.52%) and AiScientist GLM-5 (33.73%) while approaching the human baseline (41% over 48 hours of expert effort).

  3. Cost note: AiScientist reports $12-16 per paper task. We don't have comparable cost numbers because [anon-runtime] runs through the desktop runtime with runtime-borrowed gateway key; the cost is not separately accounted. A round-table estimate via gateway billing would be informative.

  4. Where [anon-runtime] particularly shines: papers with strong citation backbone (sequential-neural-score-estimation, sample-specific-masks, lbcs, bam, semantic-self-consistency) where code-side rubric items reference baselines and prior work that [anon-runtime] can pull canonical implementations of.

  5. Where [anon-runtime] is weak: pinn (slightly behind AiScientist GLM-5) and bbox/sapg/self-expansion (below 50%). These are likely papers with thin citation networks or unusual implementation patterns.


Reproducibility

To regenerate this comparison:

  • AiScientist data: arXiv:2604.13018, Table 1, transcribed manually from the paper PDF (fetched via https://r.jina.ai/https://arxiv.org/pdf/2604.13018).
  • [anon-runtime] data: per-paper grade.json files in [redacted-path].
  • Aggregate: [redacted-path].

To rerun [anon-runtime] grading with the same submissions:

for paper in $(ls [redacted-path]); do
  bash [redacted-path] "$paper" code-dev
done

To run full-mode grading (would give apples-to-apples vs AiScientist):

bash [redacted-path] full

This requires:

  • Multi-GPU host with --runtime=nvidia Docker (currently 1× H200 at cloud@195.242.13.82).
  • Building pb-reproducer Docker image (currently fails on China network — needs Aliyun mirror patch like we did for pb-env).
  • ~12-72h per paper depending on training cost.