Skip to content

Public-ready cleanup of the quality-proxy and perturbation benchmarks#105

Open
dangng2004 wants to merge 5 commits into
mainfrom
benchmark-cleanup
Open

Public-ready cleanup of the quality-proxy and perturbation benchmarks#105
dangng2004 wants to merge 5 commits into
mainfrom
benchmark-cleanup

Conversation

@dangng2004

@dangng2004 dangng2004 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Public-ready cleanup of both benchmark studies, so each is documented to match the paper and runs from a fresh clone.

Quality-proxy study (benchmarks/conference_study/)

  • Full 197-paper set regenerates deterministically via select_papers.py.
  • Model roster fixed to the paper's six backbones; pinned in baseline.yaml + new frontier.yaml. Canonical manifest renamed combined.json -> full.json.
  • coarse resolves its venv via COARSE_VENV_PYTHON (no hardcoded path) with setup docs.
  • generate_report.py is the single results entry point: discovers (method, model) cells from a run, merges multiple result dirs, defaults to the roster + the paper's progressive_original variant (flags --all-models, --consolidated), and prints the pairwise-accuracy tables with 95% CIs via ci_auc. Removes the gitignored hand-maintained table config.
  • Removed the duplicate benchmarks/README.md.

Perturbation benchmark (benchmarks/perturbation/)

  • Both READMEs rewritten around the per-stage flow (perturb_automated.py for generation, then run_benchmark.py --stages prepare/review/score/report).
  • Committed a canonical configs/default.yaml with the paper's model roster.
  • No code changes here. The scorer bug fix and the gate/threshold scoring config are Fix substring-match scorer bug; add gate/threshold flags #87's, and land separately once that merges. The scoring docs describe the current behavior until then.

@dangng2004 dangng2004 force-pushed the benchmark-cleanup branch 3 times, most recently from 4b3bdc3 to 65150ab Compare June 28, 2026 22:52
Rename the "outcomes/conference" study to the quality-proxy study to match the
paper, and rewrite both READMEs (top-level + conference_study) around the
paper's four-proxy design, the real run flow, and reproducibility.

Reproducibility:
- The full 197-paper set regenerates deterministically via select_papers.py
  (default output moved from manifests/v1 to manifests/canonical/, written as
  full.json). The 74-paper frontier subset has no regeneration script and is
  not shipped; the README points readers to the authors for its manifest.
- Fix the model roster everywhere to the paper's six backbones (four efficient
  + two frontier), dropping exploratory models. Pin models explicitly in
  baseline.yaml and a new frontier.yaml.
- run_study.py and download_papers.py default to the canonical manifest.

Usability:
- coarse resolves its venv via COARSE_VENV_PYTHON instead of a hardcoded local
  path, with setup docs.
- generate_report.py becomes the single results entry point: it discovers the
  (method, model) cells from a run's results and prints the paper's
  pairwise-accuracy tables with 95% CIs via ci_auc, so no hand-maintained table
  config is needed.
- README leads with a cost preview and a smoke-test path; estimate_cost.py
  carries a TODO for the grok rate.

Remove the duplicate benchmarks/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dangng2004 dangng2004 changed the title Clean up the quality-proxy benchmark for public release Public-ready cleanup of the quality-proxy and perturbation benchmarks Jun 29, 2026
dangng2004 and others added 4 commits June 28, 2026 19:24
Rewrite both READMEs around the per-stage run flow (perturb_automated.py for
generation, then run_benchmark.py --stages prepare/review/score/report), drop
the "(v2)" title, and replace the old error taxonomy with the paper's four
categories (Surface / Claim / Reasoning / Experimental). Commit a canonical
configs/default.yaml (configs/ was gitignored entirely) with the paper's model
roster.

Docs and config only; the scorer bug fix and gate/threshold scoring config land
separately with #87.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e verify

Make the perturbation benchmark reproducible from a fresh clone:

- Add a Dataset section pointing to the released perturbed papers (zip), and
  make the released set the recommended reproduction path.
- Add eight per-domain configs (full_*.yaml) so `--configs configs/full_*.yaml`
  reproduces the full 74-paper benchmark in one command; whitelist them in
  .gitignore.
- Fix input_dir to the <domain>/all level (the layout discover_units expects),
  in default.yaml and the README example; the old results/perturbations path
  resolved nowhere.
- Document the real generation chain (perturb_automated -> verify_existing ->
  reinject_existing) with correct, repo-root-aware commands.
- Demote the LLM verify step to the optional regeneration path (docs only), with
  a note that the released set was verifier-filtered and manually audited.
- gitignore the repo-root results/ that run_benchmark.py writes to.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the contact-the-authors step for the conference (quality-proxy) study
with a Google Drive download. The zip bundles manifests/canonical/full.json and
frontier.json (~274K); unzip inside benchmarks/conference_study/. full.json still
regenerates via select_papers.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dangng2004 dangng2004 requested a review from chenhaot June 29, 2026 06:11
@dangng2004 dangng2004 marked this pull request as ready for review June 29, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant