PaperBench 23-Paper Per-Paper Comparison: [anon-runtime] vs Published Baselines

Dataset: PaperBench full set (23 papers, ICML 2024 spotlights). Date: 2026-05-05. Comparison source: AiScientist (Chen et al., arXiv 2604.13018, Apr 2026), Table 1 — PaperBench Full Evaluation.

⚠️ Important Methodology Note

This report compares scores across two different PaperBench evaluation modes:

Method	Mode	What is graded
[anon-runtime] (ours)	Code-Dev only	Static rubric leaves only — code structure, hyperparameter alignment, baseline correctness, citation grounding. No code execution. No result match.
AiScientist (Chen et al.) — Gemini-3-Flash, GLM-5	Full	Code-Dev + Reproduction stage (24h H20 GPU per paper, paper-target metrics must match).
BasicAgent o1-high	Full	Same as AiScientist
IterativeAgent o1-high	Full	Same as AiScientist

Implication: Our 66.05% Code-Dev mean is structurally higher than full-mode scores because the reproduction stage on full-mode adds many failure points (training crashes, OOM, dataset access, hyperparameter divergence). For an apples-to-apples comparison with AiScientist's full-mode score, multiply our Code-Dev score by the typical Code-Dev contribution to full score (≈55-60%) to get an estimated full-mode equivalent of ~36-40%.

PaperBench original paper (Starace et al., ICML 2025):

Code-Dev only IterativeAgent o1-high: 43.4%
Full IterativeAgent o1-high: 26.0%

So Code-Dev → Full conversion factor in their setup is roughly 0.60×. Applying that to ours:

[anon-runtime] Code-Dev: 66.05% → estimated Full: ~39-40%, which would beat AiScientist Gemini-3-Flash (30.52) and AiScientist GLM-5 (33.73), and approach the 41% human baseline.

Per-Paper Comparison Table

#	Paper	[anon-runtime] (Ours, Code-Dev)	AiScientist + Gemini-3-Flash (Full)	AiScientist + GLM-5 (Full)	BasicAgent + Gemini-3-Flash (Full)	IterAgent + Gemini-3-Flash (Full)	BasicAgent + GLM-5 (Full)	IterAgent + GLM-5 (Full)
1	adaptive-pruning	50.59	27.25	33.26	24.53	3.05	30.82	11.93
2	all-in-one	53.96	46.29	49.47	20.86	45.13	33.78	44.43
3	bam	84.72	56.59	61.11	48.46	45.04	51.45	47.91
4	bbox	40.34	33.79	30.02	15.43	8.30	23.55	19.28
5	bridging-data-gaps	57.14	23.09	26.46	12.59	12.44	9.80	12.50
6	fre	61.51	35.21	28.98	21.67	23.89	21.60	16.67
7	ftrl	62.66	10.11	8.34	5.87	4.15	3.71	6.70
8	lbcs	85.74	27.90	30.10	17.75	15.26	20.68	22.74
9	lca-on-the-line	63.16	30.23	28.53	12.97	18.30	22.55	26.15
10	mechanistic-understanding	70.19	29.95	40.55	14.86	21.89	32.49	34.96
11	pinn	54.29	49.92	58.76	26.63	30.81	22.18	25.77
12	rice	57.65	10.87	10.18	10.43	8.88	6.56	0.27
13	robust-clip	52.55	18.28	28.66	15.45	10.43	22.43	27.56
14	sample-specific-masks	86.52	36.77	44.13	25.39	33.34	36.93	41.26
15	sapg	46.49	19.85	31.69	11.45	12.65	6.99	4.95
16	sequential-neural-score-estimation	89.32	64.94	49.32	53.51	60.24	27.20	35.53
17	stay-on-topic-with-classifier-free-guidance	88.16	20.13	14.81	8.37	13.69	3.69	8.81
18	stochastic-interpolants	82.99	18.81	42.10	17.04	17.37	32.18	28.06
19	test-time-model-adaptation	70.06	32.45	27.33	15.27	18.13	17.81	21.19
20	what-will-my-model-forget	60.98	17.87	30.82	6.61	8.99	25.14	10.75
21	semantic-self-consistency	95.45	—	—	—	—	—	—
22	self-composing-policies	65.03	—	—	—	—	—	—
23	self-expansion	39.77	—	—	—	—	—	—
	Average (20 papers shared with AiScientist)	65.45	30.52	33.73	19.26	20.60	22.58	22.37
	Average (full 23 papers)	66.05	—	—	—	—	—	—

Notes:

AiScientist Table 1 reports 20 papers (3 papers in our set — semantic-self-consistency, self-composing-policies, self-expansion — are not in their Table 1).
All AiScientist / BasicAgent / IterAgent numbers are from arXiv:2604.13018, Table 1. Their grading model is GPT-5.4. Cost per task: BasicAgent Gemini-3-Flash $6.25, IterAgent Gemini-3-Flash $27.44, AiScientist Gemini-3-Flash $15.67, BasicAgent GLM-5 (no cost reported), IterAgent GLM-5 $54.90, AiScientist GLM-5 $12.20.
Our [anon-runtime] grading model is o3-mini-2025-01-31. Different judge could yield ±2-5pp variance.

Per-Paper Δ Summary (citation-grounded retrieval)

For each shared paper, we compute Δ = [anon-runtime] Code-Dev − max(AiScientist Gemini, AiScientist GLM-5).

Paper	[anon-runtime]	Best AiScientist	Δ
adaptive-pruning	50.59	33.26	+17.33
all-in-one	53.96	49.47	+4.49
bam	84.72	61.11	+23.61
bbox	40.34	33.79	+6.55
bridging-data-gaps	57.14	26.46	+30.68
fre	61.51	35.21	+26.30
ftrl	62.66	10.11	+52.55
lbcs	85.74	30.10	+55.64
lca-on-the-line	63.16	30.23	+32.93
mechanistic-understanding	70.19	40.55	+29.64
pinn	54.29	58.76	-4.47
rice	57.65	10.87	+46.78
robust-clip	52.55	28.66	+23.89
sample-specific-masks	86.52	44.13	+42.39
sapg	46.49	31.69	+14.80
sequential-neural-score-estimation	89.32	64.94	+24.38
stay-on-topic-with-classifier-free-guidance	88.16	20.13	+68.03
stochastic-interpolants	82.99	42.10	+40.89
test-time-model-adaptation	70.06	32.45	+37.61
what-will-my-model-forget	60.98	30.82	+30.16
Average Δ			+30.21

Health check: Our advantage of +30 points reflects mostly the methodology difference (Code-Dev rubric vs Full rubric). It is NOT a direct claim of "+30pp better than AiScientist". After applying the 0.60× full-mode conversion factor:

[anon-runtime] estimated full-mode: ~40%
AiScientist Gemini-3-Flash full-mode: 30.52%
AiScientist GLM-5 full-mode: 33.73%
AiScientist best: 33.73%
[anon-runtime] estimated Δ vs AiScientist best: ~+6 to +8pp

That residual margin is what is realistically claimable. To verify, we would need to run our [anon-runtime] submissions through the full-mode reproduction stage on H100/H200 GPU with a 24h budget per paper.

Top-3 Strongest Papers for [anon-runtime]

Rank	Paper	[anon-runtime] Code-Dev	Note
1	semantic-self-consistency	95.45%	Highest score; aligns well with [anon-runtime] `paper_search` workflow for citation triangulation
2	sequential-neural-score-estimation	89.32%	Strong baseline from AiScientist (64.94), we improve by +24pp
3	stay-on-topic-with-classifier-free-guidance	88.16%	Largest Δ vs AiScientist (+68pp), suggests Code-Dev rubric heavily rewards our citation-grounded writing

Bottom-3 Weakest Papers for [anon-runtime]

Rank	Paper	[anon-runtime] Code-Dev	Note
21	self-expansion	39.77%	Thin submission (76 files), needs deeper code
22	bbox	40.34%	Niche topic, possibly fewer matching citations to ground in
23	sapg	46.49%	RL paper, hyperparameters not faithfully reproduced

Key Findings

[anon-runtime] wins decisively on Code-Dev: 66.05% vs 43.4% for the prior Code-Dev SOTA (IterativeAgent o1-high in the original PaperBench paper). The +22.65pp gap likely reflects [anon-runtime] paper_search + ref_verify workflow, which grounds code in cited prior work — a property the rubric explicitly rewards.
Estimated full-mode performance: ~36-40% (after 0.60× Code-Dev → Full conversion), which would still beat AiScientist Gemini-3-Flash (30.52%) and AiScientist GLM-5 (33.73%) while approaching the human baseline (41% over 48 hours of expert effort).
Cost note: AiScientist reports $12-16 per paper task. We don't have comparable cost numbers because [anon-runtime] runs through the desktop runtime with runtime-borrowed gateway key; the cost is not separately accounted. A round-table estimate via gateway billing would be informative.
Where [anon-runtime] particularly shines: papers with strong citation backbone (sequential-neural-score-estimation, sample-specific-masks, lbcs, bam, semantic-self-consistency) where code-side rubric items reference baselines and prior work that [anon-runtime] can pull canonical implementations of.
Where [anon-runtime] is weak: pinn (slightly behind AiScientist GLM-5) and bbox/sapg/self-expansion (below 50%). These are likely papers with thin citation networks or unusual implementation patterns.

Reproducibility

To regenerate this comparison:

AiScientist data: arXiv:2604.13018, Table 1, transcribed manually from the paper PDF (fetched via https://r.jina.ai/https://arxiv.org/pdf/2604.13018).
[anon-runtime] data: per-paper grade.json files in [redacted-path].
Aggregate: [redacted-path].

To rerun [anon-runtime] grading with the same submissions:

for paper in $(ls [redacted-path]); do
  bash [redacted-path] "$paper" code-dev
done

To run full-mode grading (would give apples-to-apples vs AiScientist):

bash [redacted-path] full

This requires:

Multi-GPU host with --runtime=nvidia Docker (currently 1× H200 at cloud@195.242.13.82).
Building pb-reproducer Docker image (currently fails on China network — needs Aliyun mirror patch like we did for pb-env).
~12-72h per paper depending on training cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaperBench 23-Paper Per-Paper Comparison: [anon-runtime] vs Published Baselines

⚠️ Important Methodology Note

Per-Paper Comparison Table

Per-Paper Δ Summary (citation-grounded retrieval)

Top-3 Strongest Papers for [anon-runtime]

Bottom-3 Weakest Papers for [anon-runtime]

Key Findings

Reproducibility

FilesExpand file tree

PER_PAPER_COMPARISON.md

Latest commit

History

PER_PAPER_COMPARISON.md

File metadata and controls

PaperBench 23-Paper Per-Paper Comparison: [anon-runtime] vs Published Baselines

⚠️ Important Methodology Note

Per-Paper Comparison Table

Per-Paper Δ Summary (citation-grounded retrieval)

Top-3 Strongest Papers for [anon-runtime]

Bottom-3 Weakest Papers for [anon-runtime]

Key Findings

Reproducibility