Dataset: PaperBench full set (23 papers, ICML 2024 spotlights). Date: 2026-05-05. Comparison source: AiScientist (Chen et al., arXiv 2604.13018, Apr 2026), Table 1 — PaperBench Full Evaluation.
This report compares scores across two different PaperBench evaluation modes:
| Method | Mode | What is graded |
|---|---|---|
| [anon-runtime] (ours) | Code-Dev only | Static rubric leaves only — code structure, hyperparameter alignment, baseline correctness, citation grounding. No code execution. No result match. |
| AiScientist (Chen et al.) — Gemini-3-Flash, GLM-5 | Full | Code-Dev + Reproduction stage (24h H20 GPU per paper, paper-target metrics must match). |
| BasicAgent o1-high | Full | Same as AiScientist |
| IterativeAgent o1-high | Full | Same as AiScientist |
Implication: Our 66.05% Code-Dev mean is structurally higher than full-mode scores because the reproduction stage on full-mode adds many failure points (training crashes, OOM, dataset access, hyperparameter divergence). For an apples-to-apples comparison with AiScientist's full-mode score, multiply our Code-Dev score by the typical Code-Dev contribution to full score (≈55-60%) to get an estimated full-mode equivalent of ~36-40%.
PaperBench original paper (Starace et al., ICML 2025):
- Code-Dev only IterativeAgent o1-high: 43.4%
- Full IterativeAgent o1-high: 26.0%
So Code-Dev → Full conversion factor in their setup is roughly 0.60×. Applying that to ours:
- [anon-runtime] Code-Dev: 66.05% → estimated Full: ~39-40%, which would beat AiScientist Gemini-3-Flash (30.52) and AiScientist GLM-5 (33.73), and approach the 41% human baseline.
| # | Paper | [anon-runtime] (Ours, Code-Dev) | AiScientist + Gemini-3-Flash (Full) | AiScientist + GLM-5 (Full) | BasicAgent + Gemini-3-Flash (Full) | IterAgent + Gemini-3-Flash (Full) | BasicAgent + GLM-5 (Full) | IterAgent + GLM-5 (Full) |
|---|---|---|---|---|---|---|---|---|
| 1 | adaptive-pruning | 50.59 | 27.25 | 33.26 | 24.53 | 3.05 | 30.82 | 11.93 |
| 2 | all-in-one | 53.96 | 46.29 | 49.47 | 20.86 | 45.13 | 33.78 | 44.43 |
| 3 | bam | 84.72 | 56.59 | 61.11 | 48.46 | 45.04 | 51.45 | 47.91 |
| 4 | bbox | 40.34 | 33.79 | 30.02 | 15.43 | 8.30 | 23.55 | 19.28 |
| 5 | bridging-data-gaps | 57.14 | 23.09 | 26.46 | 12.59 | 12.44 | 9.80 | 12.50 |
| 6 | fre | 61.51 | 35.21 | 28.98 | 21.67 | 23.89 | 21.60 | 16.67 |
| 7 | ftrl | 62.66 | 10.11 | 8.34 | 5.87 | 4.15 | 3.71 | 6.70 |
| 8 | lbcs | 85.74 | 27.90 | 30.10 | 17.75 | 15.26 | 20.68 | 22.74 |
| 9 | lca-on-the-line | 63.16 | 30.23 | 28.53 | 12.97 | 18.30 | 22.55 | 26.15 |
| 10 | mechanistic-understanding | 70.19 | 29.95 | 40.55 | 14.86 | 21.89 | 32.49 | 34.96 |
| 11 | pinn | 54.29 | 49.92 | 58.76 | 26.63 | 30.81 | 22.18 | 25.77 |
| 12 | rice | 57.65 | 10.87 | 10.18 | 10.43 | 8.88 | 6.56 | 0.27 |
| 13 | robust-clip | 52.55 | 18.28 | 28.66 | 15.45 | 10.43 | 22.43 | 27.56 |
| 14 | sample-specific-masks | 86.52 | 36.77 | 44.13 | 25.39 | 33.34 | 36.93 | 41.26 |
| 15 | sapg | 46.49 | 19.85 | 31.69 | 11.45 | 12.65 | 6.99 | 4.95 |
| 16 | sequential-neural-score-estimation | 89.32 | 64.94 | 49.32 | 53.51 | 60.24 | 27.20 | 35.53 |
| 17 | stay-on-topic-with-classifier-free-guidance | 88.16 | 20.13 | 14.81 | 8.37 | 13.69 | 3.69 | 8.81 |
| 18 | stochastic-interpolants | 82.99 | 18.81 | 42.10 | 17.04 | 17.37 | 32.18 | 28.06 |
| 19 | test-time-model-adaptation | 70.06 | 32.45 | 27.33 | 15.27 | 18.13 | 17.81 | 21.19 |
| 20 | what-will-my-model-forget | 60.98 | 17.87 | 30.82 | 6.61 | 8.99 | 25.14 | 10.75 |
| 21 | semantic-self-consistency | 95.45 | — | — | — | — | — | — |
| 22 | self-composing-policies | 65.03 | — | — | — | — | — | — |
| 23 | self-expansion | 39.77 | — | — | — | — | — | — |
| Average (20 papers shared with AiScientist) | 65.45 | 30.52 | 33.73 | 19.26 | 20.60 | 22.58 | 22.37 | |
| Average (full 23 papers) | 66.05 | — | — | — | — | — | — |
Notes:
- AiScientist Table 1 reports 20 papers (3 papers in our set —
semantic-self-consistency,self-composing-policies,self-expansion— are not in their Table 1). - All AiScientist / BasicAgent / IterAgent numbers are from arXiv:2604.13018, Table 1. Their grading model is GPT-5.4. Cost per task: BasicAgent Gemini-3-Flash $6.25, IterAgent Gemini-3-Flash $27.44, AiScientist Gemini-3-Flash $15.67, BasicAgent GLM-5 (no cost reported), IterAgent GLM-5 $54.90, AiScientist GLM-5 $12.20.
- Our [anon-runtime] grading model is
o3-mini-2025-01-31. Different judge could yield ±2-5pp variance.
For each shared paper, we compute Δ = [anon-runtime] Code-Dev − max(AiScientist Gemini, AiScientist GLM-5).
| Paper | [anon-runtime] | Best AiScientist | Δ |
|---|---|---|---|
| adaptive-pruning | 50.59 | 33.26 | +17.33 |
| all-in-one | 53.96 | 49.47 | +4.49 |
| bam | 84.72 | 61.11 | +23.61 |
| bbox | 40.34 | 33.79 | +6.55 |
| bridging-data-gaps | 57.14 | 26.46 | +30.68 |
| fre | 61.51 | 35.21 | +26.30 |
| ftrl | 62.66 | 10.11 | +52.55 |
| lbcs | 85.74 | 30.10 | +55.64 |
| lca-on-the-line | 63.16 | 30.23 | +32.93 |
| mechanistic-understanding | 70.19 | 40.55 | +29.64 |
| pinn | 54.29 | 58.76 | -4.47 |
| rice | 57.65 | 10.87 | +46.78 |
| robust-clip | 52.55 | 28.66 | +23.89 |
| sample-specific-masks | 86.52 | 44.13 | +42.39 |
| sapg | 46.49 | 31.69 | +14.80 |
| sequential-neural-score-estimation | 89.32 | 64.94 | +24.38 |
| stay-on-topic-with-classifier-free-guidance | 88.16 | 20.13 | +68.03 |
| stochastic-interpolants | 82.99 | 42.10 | +40.89 |
| test-time-model-adaptation | 70.06 | 32.45 | +37.61 |
| what-will-my-model-forget | 60.98 | 30.82 | +30.16 |
| Average Δ | +30.21 |
Health check: Our advantage of +30 points reflects mostly the methodology difference (Code-Dev rubric vs Full rubric). It is NOT a direct claim of "+30pp better than AiScientist". After applying the 0.60× full-mode conversion factor:
- [anon-runtime] estimated full-mode: ~40%
- AiScientist Gemini-3-Flash full-mode: 30.52%
- AiScientist GLM-5 full-mode: 33.73%
- AiScientist best: 33.73%
- [anon-runtime] estimated Δ vs AiScientist best: ~+6 to +8pp
That residual margin is what is realistically claimable. To verify, we would need to run our [anon-runtime] submissions through the full-mode reproduction stage on H100/H200 GPU with a 24h budget per paper.
| Rank | Paper | [anon-runtime] Code-Dev | Note |
|---|---|---|---|
| 1 | semantic-self-consistency | 95.45% | Highest score; aligns well with [anon-runtime] paper_search workflow for citation triangulation |
| 2 | sequential-neural-score-estimation | 89.32% | Strong baseline from AiScientist (64.94), we improve by +24pp |
| 3 | stay-on-topic-with-classifier-free-guidance | 88.16% | Largest Δ vs AiScientist (+68pp), suggests Code-Dev rubric heavily rewards our citation-grounded writing |
| Rank | Paper | [anon-runtime] Code-Dev | Note |
|---|---|---|---|
| 21 | self-expansion | 39.77% | Thin submission (76 files), needs deeper code |
| 22 | bbox | 40.34% | Niche topic, possibly fewer matching citations to ground in |
| 23 | sapg | 46.49% | RL paper, hyperparameters not faithfully reproduced |
-
[anon-runtime] wins decisively on Code-Dev: 66.05% vs 43.4% for the prior Code-Dev SOTA (IterativeAgent o1-high in the original PaperBench paper). The +22.65pp gap likely reflects [anon-runtime]
paper_search+ref_verifyworkflow, which grounds code in cited prior work — a property the rubric explicitly rewards. -
Estimated full-mode performance: ~36-40% (after 0.60× Code-Dev → Full conversion), which would still beat AiScientist Gemini-3-Flash (30.52%) and AiScientist GLM-5 (33.73%) while approaching the human baseline (41% over 48 hours of expert effort).
-
Cost note: AiScientist reports $12-16 per paper task. We don't have comparable cost numbers because [anon-runtime] runs through the desktop runtime with runtime-borrowed gateway key; the cost is not separately accounted. A round-table estimate via gateway billing would be informative.
-
Where [anon-runtime] particularly shines: papers with strong citation backbone (sequential-neural-score-estimation, sample-specific-masks, lbcs, bam, semantic-self-consistency) where code-side rubric items reference baselines and prior work that [anon-runtime] can pull canonical implementations of.
-
Where [anon-runtime] is weak: pinn (slightly behind AiScientist GLM-5) and bbox/sapg/self-expansion (below 50%). These are likely papers with thin citation networks or unusual implementation patterns.
To regenerate this comparison:
- AiScientist data: arXiv:2604.13018, Table 1, transcribed manually from the paper PDF (fetched via
https://r.jina.ai/https://arxiv.org/pdf/2604.13018). - [anon-runtime] data: per-paper grade.json files in
[redacted-path]. - Aggregate:
[redacted-path].
To rerun [anon-runtime] grading with the same submissions:
for paper in $(ls [redacted-path]); do
bash [redacted-path] "$paper" code-dev
doneTo run full-mode grading (would give apples-to-apples vs AiScientist):
bash [redacted-path] fullThis requires:
- Multi-GPU host with
--runtime=nvidiaDocker (currently 1× H200 atcloud@195.242.13.82). - Building
pb-reproducerDocker image (currently fails on China network — needs Aliyun mirror patch like we did forpb-env). - ~12-72h per paper depending on training cost.