Public-ready cleanup of the quality-proxy and perturbation benchmarks#105
Open
dangng2004 wants to merge 5 commits into
Open
Public-ready cleanup of the quality-proxy and perturbation benchmarks#105dangng2004 wants to merge 5 commits into
dangng2004 wants to merge 5 commits into
Conversation
4b3bdc3 to
65150ab
Compare
Rename the "outcomes/conference" study to the quality-proxy study to match the paper, and rewrite both READMEs (top-level + conference_study) around the paper's four-proxy design, the real run flow, and reproducibility. Reproducibility: - The full 197-paper set regenerates deterministically via select_papers.py (default output moved from manifests/v1 to manifests/canonical/, written as full.json). The 74-paper frontier subset has no regeneration script and is not shipped; the README points readers to the authors for its manifest. - Fix the model roster everywhere to the paper's six backbones (four efficient + two frontier), dropping exploratory models. Pin models explicitly in baseline.yaml and a new frontier.yaml. - run_study.py and download_papers.py default to the canonical manifest. Usability: - coarse resolves its venv via COARSE_VENV_PYTHON instead of a hardcoded local path, with setup docs. - generate_report.py becomes the single results entry point: it discovers the (method, model) cells from a run's results and prints the paper's pairwise-accuracy tables with 95% CIs via ci_auc, so no hand-maintained table config is needed. - README leads with a cost preview and a smoke-test path; estimate_cost.py carries a TODO for the grok rate. Remove the duplicate benchmarks/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
65150ab to
40df22e
Compare
e61b155 to
40df22e
Compare
Rewrite both READMEs around the per-stage run flow (perturb_automated.py for generation, then run_benchmark.py --stages prepare/review/score/report), drop the "(v2)" title, and replace the old error taxonomy with the paper's four categories (Surface / Claim / Reasoning / Experimental). Commit a canonical configs/default.yaml (configs/ was gitignored entirely) with the paper's model roster. Docs and config only; the scorer bug fix and gate/threshold scoring config land separately with #87. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e verify Make the perturbation benchmark reproducible from a fresh clone: - Add a Dataset section pointing to the released perturbed papers (zip), and make the released set the recommended reproduction path. - Add eight per-domain configs (full_*.yaml) so `--configs configs/full_*.yaml` reproduces the full 74-paper benchmark in one command; whitelist them in .gitignore. - Fix input_dir to the <domain>/all level (the layout discover_units expects), in default.yaml and the README example; the old results/perturbations path resolved nowhere. - Document the real generation chain (perturb_automated -> verify_existing -> reinject_existing) with correct, repo-root-aware commands. - Demote the LLM verify step to the optional regeneration path (docs only), with a note that the released set was verifier-filtered and manually audited. - gitignore the repo-root results/ that run_benchmark.py writes to. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the contact-the-authors step for the conference (quality-proxy) study with a Google Drive download. The zip bundles manifests/canonical/full.json and frontier.json (~274K); unzip inside benchmarks/conference_study/. full.json still regenerates via select_papers.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Public-ready cleanup of both benchmark studies, so each is documented to match the paper and runs from a fresh clone.
Quality-proxy study (
benchmarks/conference_study/)select_papers.py.baseline.yaml+ newfrontier.yaml. Canonical manifest renamedcombined.json->full.json.coarseresolves its venv viaCOARSE_VENV_PYTHON(no hardcoded path) with setup docs.generate_report.pyis the single results entry point: discovers(method, model)cells from a run, merges multiple result dirs, defaults to the roster + the paper'sprogressive_originalvariant (flags--all-models,--consolidated), and prints the pairwise-accuracy tables with 95% CIs viaci_auc. Removes the gitignored hand-maintained table config.benchmarks/README.md.Perturbation benchmark (
benchmarks/perturbation/)perturb_automated.pyfor generation, thenrun_benchmark.py --stages prepare/review/score/report).configs/default.yamlwith the paper's model roster.