Public-ready cleanup of the quality-proxy and perturbation benchmarks by dangng2004 · Pull Request #105 · ChicagoHAI/OpenAIReview

dangng2004 · 2026-06-28T22:13:37Z

Public-ready cleanup of both benchmark studies, so each is documented to match the paper and runs from a fresh clone.

Quality-proxy study (`benchmarks/conference_study/`)

Full 197-paper set regenerates deterministically via select_papers.py.
Model roster fixed to the paper's six backbones; pinned in baseline.yaml + new frontier.yaml. Canonical manifest renamed combined.json -> full.json.
coarse resolves its venv via COARSE_VENV_PYTHON (no hardcoded path) with setup docs.
generate_report.py is the single results entry point: discovers (method, model) cells from a run, merges multiple result dirs, defaults to the roster + the paper's progressive_original variant (flags --all-models, --consolidated), and prints the pairwise-accuracy tables with 95% CIs via ci_auc. Removes the gitignored hand-maintained table config.
Removed the duplicate benchmarks/README.md.

Perturbation benchmark (`benchmarks/perturbation/`)

Both READMEs rewritten around the per-stage flow (perturb_automated.py for generation, then run_benchmark.py --stages prepare/review/score/report).
Committed a canonical configs/default.yaml with the paper's model roster.
No code changes here. The scorer bug fix and the gate/threshold scoring config are Fix substring-match scorer bug; add gate/threshold flags #87's, and land separately once that merges. The scoring docs describe the current behavior until then.

Rename the "outcomes/conference" study to the quality-proxy study to match the paper, and rewrite both READMEs (top-level + conference_study) around the paper's four-proxy design, the real run flow, and reproducibility. Reproducibility: - The full 197-paper set regenerates deterministically via select_papers.py (default output moved from manifests/v1 to manifests/canonical/, written as full.json). The 74-paper frontier subset has no regeneration script and is not shipped; the README points readers to the authors for its manifest. - Fix the model roster everywhere to the paper's six backbones (four efficient + two frontier), dropping exploratory models. Pin models explicitly in baseline.yaml and a new frontier.yaml. - run_study.py and download_papers.py default to the canonical manifest. Usability: - coarse resolves its venv via COARSE_VENV_PYTHON instead of a hardcoded local path, with setup docs. - generate_report.py becomes the single results entry point: it discovers the (method, model) cells from a run's results and prints the paper's pairwise-accuracy tables with 95% CIs via ci_auc, so no hand-maintained table config is needed. - README leads with a cost preview and a smoke-test path; estimate_cost.py carries a TODO for the grok rate. Remove the duplicate benchmarks/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Rewrite both READMEs around the per-stage run flow (perturb_automated.py for generation, then run_benchmark.py --stages prepare/review/score/report), drop the "(v2)" title, and replace the old error taxonomy with the paper's four categories (Surface / Claim / Reasoning / Experimental). Commit a canonical configs/default.yaml (configs/ was gitignored entirely) with the paper's model roster. Docs and config only; the scorer bug fix and gate/threshold scoring config land separately with #87. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e verify Make the perturbation benchmark reproducible from a fresh clone: - Add a Dataset section pointing to the released perturbed papers (zip), and make the released set the recommended reproduction path. - Add eight per-domain configs (full_*.yaml) so `--configs configs/full_*.yaml` reproduces the full 74-paper benchmark in one command; whitelist them in .gitignore. - Fix input_dir to the <domain>/all level (the layout discover_units expects), in default.yaml and the README example; the old results/perturbations path resolved nowhere. - Document the real generation chain (perturb_automated -> verify_existing -> reinject_existing) with correct, repo-root-aware commands. - Demote the LLM verify step to the optional regeneration path (docs only), with a note that the released set was verifier-filtered and manually audited. - gitignore the repo-root results/ that run_benchmark.py writes to. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the contact-the-authors step for the conference (quality-proxy) study with a Google Drive download. The zip bundles manifests/canonical/full.json and frontier.json (~274K); unzip inside benchmarks/conference_study/. full.json still regenerates via select_papers.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dangng2004 force-pushed the benchmark-cleanup branch 3 times, most recently from 4b3bdc3 to 65150ab Compare June 28, 2026 22:52

dangng2004 force-pushed the benchmark-cleanup branch from 65150ab to 40df22e Compare June 28, 2026 22:57

dangng2004 changed the title ~~Clean up the quality-proxy benchmark for public release~~ Public-ready cleanup of the quality-proxy and perturbation benchmarks Jun 29, 2026

dangng2004 force-pushed the benchmark-cleanup branch from e61b155 to 40df22e Compare June 29, 2026 00:20

dangng2004 and others added 4 commits June 28, 2026 19:24

Add Google Drive link for the released perturbation set

4c5c178

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dangng2004 requested a review from chenhaot June 29, 2026 06:11

dangng2004 marked this pull request as ready for review June 29, 2026 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Public-ready cleanup of the quality-proxy and perturbation benchmarks#105

Public-ready cleanup of the quality-proxy and perturbation benchmarks#105
dangng2004 wants to merge 5 commits into
mainfrom
benchmark-cleanup

dangng2004 commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dangng2004 commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quality-proxy study (benchmarks/conference_study/)

Perturbation benchmark (benchmarks/perturbation/)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dangng2004 commented Jun 28, 2026 •

edited

Loading

Quality-proxy study (`benchmarks/conference_study/`)

Perturbation benchmark (`benchmarks/perturbation/`)