Experiment manifests fix configuration (PF commit, agent version, model, policy, budgets, seed) so that when solve rates differ between baseline and PF-guarded runs, you can attribute the difference to the run mode rather than drift.
Each experiment lives under experiments/<experiment_id>/ with a manifest.json containing:
| Field | Description |
|---|---|
experiment_id |
Short identifier (e.g. exp-step2-lite-smoke). |
pf_commit |
Git SHA of the Provability-Fabric repo at run time. Fill before or after the run so results are tied to a known PF version. |
agent_framework |
openhands or swe-agent. |
agent_commit |
Git SHA or tag of the agent framework (e.g. OpenHands) when installed from source. |
agent_image_tag |
Docker image tag if the agent is run from a container; alternative to agent_commit. |
model |
{ "id": "gpt-4o-mini", "provider": "openai" } – exact model ID and provider. |
dataset |
e.g. swebench_lite. |
dataset_split |
e.g. test. |
policy_pack |
Policy pack name (e.g. swebench_safe_v1) when running PF-guarded. |
budgets |
max_steps, max_tool_calls, timeout_sec – must match policy and runner flags. |
model_params |
temperature, top_p (or null) – fixed for reproducibility. |
seed |
Fixed integer for any RNG (e.g. OpenHands, sampling). |
run_modes |
List of modes for this experiment: ["baseline", "pf_guarded"]. |
You cannot finish Step 2 (real run + harness) on Windows-native: OpenHands uses fcntl, the SWE-bench harness uses resource, and Docker is required by the harness. Move the entire real run + harness loop to WSL (Ubuntu recommended) or Linux.
Before proceeding to runs, verify the minimal environment: see experiments/exp-step2-lite-smoke/env-checklist.md and run python3 experiments/scripts/check_wsl_env.py from the repository root (minimal Debian/GCP images often lack a python shim; after source .venv-wsl/bin/activate, python works). If any check fails, do not proceed.
One-shot orchestration (WSL/Linux): From repo root, run bash experiments/scripts/run-baseline-pf-cycle.sh to execute the full "first valid baseline + PF pair" workflow (env check, manifest fill, baseline run, PF run, validations, harness, replay sample (run_replay_sample.py writes replay_summary.json), compare with gates, delta lists, case extraction, bucketing). Use --update-run-ids to update run-ids.md in experiments/exp-step2-lite-smoke/ when compare passes (and to run export_publish_artifacts, write publish/PUBLISH.md, publish/GOLDEN.ok, publish/RESULTS.md, publish/VERIFY.md, and append to experiments/scale-results-ledger.jsonl); use --triage to run list_delta_cases and extract_case_bundle. Requires pf CLI and OpenHands; see experiments/exp-step2-lite-smoke/commands.md for manual steps and "Without pf CLI" alternatives. Golden cycle: One trusted reference run; see experiments/exp-step2-lite-smoke/golden-cycle.md for the canonical command, required artifacts, and acceptance checks. When gates fail, follow experiments/regression-loop.md. Verification: experiments/VERIFICATION_PLAN.md maps each plan requirement to implementation. Machine verifier: experiments/scripts/verify_publish_bundle.py (no network/Docker); fixture in experiments/fixtures/verify_publish_bundle/; CI: .github/workflows/verify-publish-bundle.yaml. Local tests (no WSL): python experiments/scripts/run_verification_tests.py.
Makefile shortcuts (repo root): make swebench-step2 runs the canonical script (WSL/Linux only). make swebench-compare, make swebench-triage, and make swebench-regressions require BASELINE_RUN_DIR and PF_RUN_DIR (from run-ids.md). make swebench-compare passes the same golden compare gates as run-baseline-pf-cycle.sh Phase 4.3: --require-harness, --require-compliance, --require-patch-apply, and --require-priced-models. swebench-regressions runs list_delta_cases, extract_case_bundle, and bucket_pf_failures_from_cases for the baseline_solved_pf_failed slice. See experiments/exp-step2-lite-smoke/commands.md "Canonical entrypoint and Makefile".
Order of operations for a full SWE-bench + PF cycle (WSL/Linux):
- Env check:
python3 experiments/scripts/check_wsl_env.py(see experiments/exp-step2-lite-smoke/env-checklist.md). - Preflight (optional):
pf bench swebench run --dataset Lite --split test --max_instances 5 --preflightto materialize workspaces and see repo-size hints (see bench/swebench/README.md for runner options). - Baseline run: See experiments/exp-step2-lite-smoke/commands.md Case 1.1 (
pf bench swebench run ... --engine openhands). - PF run: See experiments/exp-step2-lite-smoke/commands.md Case 1.2 (
pf bench swebench run ... --guarded --policy swebench_safe_v1). - Validations:
validate_predictions(baseline + PF),check_no_stub,validate_pf_run; see commands.md. - Harness: Run SWE-bench evaluation on both prediction sets; see commands.md Case 1.3.
- Compare:
compare_runs.pywith--require-harness --require-compliance --require-patch-apply --require-priced-modelsfor golden parity; see commands.md Case 1.3/1.4. Outputcompare.jsonincludesmeta(generated time, run dir paths); usejq '.meta, .patch_apply'— notjq '.summary'(that key does not exist). After overnight runs, re-run compare with the correct--baseline-run-dir/--pf-run-dir.run_health_snapshot.pysummarizes oneruns/<run_id>;smoke_direct_agent_one.shis a one-instancedirect_agentcheck (see bench/swebench/README.md and exp-step2-lite-smoke/env-checklist.md). - Update run IDs and export (on green):
python experiments/scripts/update_run_ids_if_green.py ...(or userun-baseline-pf-cycle.sh --update-run-ids). Writes run-ids.md, runs export_publish_artifacts, publish/GOLDEN.ok, RESULTS.md, VERIFY.md, and appends to experiments/scale-results-ledger.jsonl (see experiments/scale-results-ledger.md and experiments/scripts/append_scale_results_ledger.py).
Runner options (dataset, instance filters, timeouts): bench/swebench/README.md. If baseline or PF runs produce empty patches and the trajectory has only MessageEvent (no ActionEvent), next steps are on the OpenHands side: experiments/exp-step2-lite-smoke/openhands-headless-troubleshooting.md (version check, minimal headless test, model override, GUI comparison). Automated verification of provider routing, eval cleanup scoping, compare strict gates, and timeout defaults: docs/internal/swebench-stabilization-regression-matrix.md (pytest command list and WSL smoke checklist).
- Fill manifest: Set
pf_commit(and optionallyagent_commit) before or after the run. From repo root:To also copy the filled manifest into a run directory (store alongside results):python experiments/scripts/fill_manifest_from_run.py experiments/exp-step2-lite-smoke/manifest.json
The script setspython experiments/scripts/fill_manifest_from_run.py experiments/exp-step2-lite-smoke/manifest.json runs/<run_id>
pf_commitfromgit rev-parse HEADandcreated_at; ifOPENHANDS_COMMITorAGENT_COMMITis set, it fillsagent_commit. - Run baseline and PF-guarded: Exact commands for exp-step2-lite-smoke (including
--openhands-max-iterations,--openhands-timeout, instance list) are in experiments/exp-step2-lite-smoke/commands.md (Case 1.1 and 1.2). Record each run_id. - Store manifest with results: Run the fill script with the run_dir argument so
runs/<run_id>/experiment_manifest.jsonis written and the run is tied to the exact config.
Filling pf_commit and agent_commit (or agent_image_tag) before comparing runs ensures that a 2–4 point solve-rate difference is not due to commit or image drift. Reproducibility: Dataset and OpenHands version drift can also affect solve rates. The runner writes runs/<run_id>/env.json (openhands_version, datasets_version, swebench_version, pip_freeze_hash); compare reports env_drift when baseline and PF envs differ. For golden and release runs, pin datasets, swebench, and openhands versions (e.g. in requirements or CI); see bench/swebench/README.md "Reproducibility".
The directory experiments/exp-step2-lite-stress-large-repos/ contains a manifest and instance_ids.txt for 5–10 known-heavy repos (django, astropy, sympy, scikit-learn, matplotlib, sphinx). Use it for manual or scheduled runs to validate that improvements reduce empty-patch rates; compare empty_patch_reasons_topN in compare.json across runs. Not gated in CI. A scheduled GitHub Action (.github/workflows/bench-swebench-stress-scheduled.yaml) runs weekly (Sunday 03:00 UTC) and on workflow_dispatch; it runs the stress experiment, harness, and compare, then experiments/scripts/summarize_stress_run.py to write stress_summary.json (schema_version, pf_commit, agent_commit, dataset_id, dataset_version, harness_id; timeout_rate_*, wall_clock_s_median/p95, guard_overhead_s_median, empty_patch_reasons_topN, solve rates). The workflow uploads compare.json, stress_summary.json, compare.csv and uploads stress_summary.json as the named artifact stress-summary. A Stress regression alerts step runs check_stress_alerts.py; thresholds are in experiments/config/stress_alerts.yaml (optional; script uses built-in defaults if missing). Ledger row shape is defined in experiments/schemas/scale_results_ledger_row.schema.json (append_scale_results_ledger validates each row before append). See experiments/config/README.md. Schema: experiments/schemas/stress_summary.schema.json. See that directory's README.
experiments/exp-step2-lite-medium-50/ and experiments/exp-step2-lite-fullish-200/ provide 50- and 200-instance slices with the same manifest discipline, harness, and compare gates as smoke. Generate instance_ids.txt with sample_lite_instance_ids.py --count 50 --seed 42 (or 200). See each directory's README for setup and acceptance gates by scale.
The file experiments/exp-step2-lite-smoke/instance_ids.txt contains 20 instance IDs from SWE-bench Lite (test split), one per line. This list is stable for the whole iteration loop (baseline vs PF-guarded, reruns). Do not change it during an experiment iteration.
- Sampling: Random sample with fixed seed (42, matching the manifest) and round-robin across repos so multiple repos are represented. No cherry-picking.
- Regenerate: From repo root, with
datasetsinstalled (pip install datasets):Output is written topython experiments/scripts/sample_lite_instance_ids.py --count 20 --seed 42
experiments/exp-step2-lite-smoke/instance_ids.txt. - Broken-harness exclusion: The script supports
--exclude-file <path>(one instance_id per line). The repo does not currently ship a list of known broken-harness IDs; if you add one, pass it when regenerating. - Runner: Pass IDs as comma-separated, e.g.
--instance_ids $(cat experiments/exp-step2-lite-smoke/instance_ids.txt | tr '\n' ',')(strip trailing comma if needed), or build the list in your experiment script from the file lines.
Use the same SWE-bench harness for both baseline and PF predictions. Step-by-step commands (including wrapper and direct swebench.harness.run_evaluation) are in experiments/exp-step2-lite-smoke/commands.md (Case 1.3). Requires swebench and Docker.
Collect results (pass/fail per instance and failure buckets):
python experiments/scripts/collect_eval_results.py \
runs/exp-step2-lite-smoke/baseline/eval \
runs/exp-step2-lite-smoke/pf/evalUse --json for machine-readable summary; --csv out.csv for per-instance (instance_id, run_label, status).
Aggregates baseline/pf eval, run summaries, cost reports, policy compliance, and patch_apply_check into compare.json and compare.csv. Exact command and options: experiments/exp-step2-lite-smoke/commands.md (Case 1.3 and 1.4). Run compare after the harness (Case 1.3) so that solve rates are populated. Use --require-harness, --require-compliance, --require-patch-apply, and --require-priced-models for the same gates as the canonical cycle (patch_apply.applies_false must be 0 for Step 2 parity). Canonical way to update run-ids.md: run python experiments/scripts/update_run_ids_if_green.py --experiment-dir experiments/exp-step2-lite-smoke --baseline-run-dir <path> --pf-run-dir <path> (or with --instance-ids-file; add --allow-empty-patch when some instances have empty patches). The cycle script (run-baseline-pf-cycle.sh --update-run-ids) invokes this with --allow-empty-patch so runs with some empty-patch instances can still update run-ids. The script runs all gates (validate_predictions, check_no_stub, validate_pf_run, compare_runs with --require-harness --require-compliance --require-patch-apply --require-priced-models) and only then writes or updates experiments/exp-step2-lite-smoke/run-ids.md. When gates pass, the script also runs export_publish_artifacts.py (which produces the bundle shape defined in experiments/scripts/publish_bundle.py) and uses experiments/scripts/publish_docs.py to write publish/PUBLISH.md, publish/GOLDEN.ok (run IDs, pf_commit, timestamp, parity_gate_passed), publish/RESULTS.md, and publish/VERIFY.md so every green run produces a ready-to-share artifact folder. The publish bundle shape and GOLDEN.ok keys are defined in experiments/scripts/publish_bundle.py and used by both verify_publish_bundle.py and export_publish_artifacts.py. The verifier (verify_publish_bundle.py) checks the bundle against those constants and compare_gates.py for compare.json gates. Scale Results Ledger: experiments/scripts/append_scale_results_ledger.py appends one row per run to experiments/scale-results-ledger.jsonl; each row is validated against experiments/schemas/scale_results_ledger_row.schema.json before append. See experiments/SCALE_RESULTS_LEDGER.md. The update_run_ids_if_green script invokes it on green. Outputs (default under --experiment-dir, overridable with --out):
- compare.json: solve rates, delta, cost_per_solved (tokens, wall_clock_s, tool_calls), PF violation rates, replay_success_rate (when replay bundles exist), patch_apply aggregation (total, applies_true, applies_false, errors_topN), violation_reasons_top10, empty_patch_reasons_topN (counts per reason code: agent_no_changes, patch_too_large, diff_timeout, apply_check_failed, workspace_missing_or_failed, guard_denial_prevented_writes), and when both run dirs exist env_drift (differing keys between baseline and PF run
env.json). When eval_metadata.json exists in eval dirs, compare also emits reproducibility fields: dataset_name, split, datasets_version, swebench_version, harness_dataset_id, openhands_version (from run dir env.json). When replay_summary.json exists in the output dir (written by run_replay_sample.py), compare merges replay (sample_size, success_rate, mismatch_count, replay_fail_reasons_topN). run_replay_sample.py also writes replay/instance_results.jsonl (per instance: patch hashes, match, failure_reason); it replays all PF-resolved zero-violation instances when count ≤20, otherwise a deterministic seeded sample (e.g. 25–50). When PF run dir exists, compare adds policy (reason_codes_topN, denied_commands_topN, commands_seen_topN) and denial recovery (denials_total_pf, episodes_aborted_after_denial_pf, recovered_after_denial_pf_rate). The report shape is defined by experiments/schemas/compare_report.schema.json; if thejsonschemapackage is installed, compare_runs validates the written report and exits with error on mismatch. - compare.csv: per-instance rows (including baseline_patch_applies, pf_patch_applies) plus
_summaryrow; pivot-friendly. - harness_eval (compare.json): Per-instance test runtime (seconds) from SWE-bench
run_instance.log(Test runtime: N seconds). Each side hasn_parsed(instances with a parseable log) andn_instances_in_report(total in the harness run report). When the harness fails for some instances (e.g. Docker/network errors such as accept4 failed 110),n_parsedcan be lower thann_instances_in_report; re-run the harness or fix the environment (Phase 1.1, run_swebench_eval) if the gap is large.
JSON schemas: experiments/schemas/compare_report.schema.json (compare output); experiments/schemas/harness_report_min.schema.json (minimum harness report: resolved_ids or resolved_instances, total_instances).
Acceptance gate: compare.json must have numeric baseline.solve_rate and pf.solve_rate (not null). If they are null, run the harness first and ensure reports exist under baseline/eval and pf/eval, then run compare again.
Acceptance gates by scale: The parity gate is pf.solve_rate >= baseline.solve_rate - 0.01 (absolute drop at most 1%). For larger slices (e.g. 50, 200 instances), define once and document: (1) Max acceptable solve-rate drop (absolute, e.g. 0.01 for smoke, 0.02 for medium, 0.03 for fullish). (2) Max acceptable increase in timeouts or empty-patch rate (e.g. stress_summary.json timeout_rate_pf and empty_patch_reasons_topN must not regress beyond a chosen threshold). Put these in the experiment README (e.g. experiments/exp-step2-lite-medium-50/README.md) and in CI or release checklists so regressions are caught.
Omit --baseline-run-dir or --pf-run-dir when only one run exists; metrics are filled where possible.
Run/eval binding (--require-harness): When --require-harness is set, compare_runs enforces that the eval dirs correspond to the run that produced the predictions: it reads run_status.json from the predictions directory (e.g. experiment_dir/baseline/run_status.json), asserts that the provided run dir name equals that run_id, and that eval_metadata.json in each eval dir (written by run_swebench_eval.py after the harness) has the same run_id. Stale-eval check: If the predictions file timestamp is newer than the eval report timestamp, compare fails (re-run harness before compare). If the predictions file has a sidecar predictions.sha256, compare verifies it matches the hash recorded in eval_metadata so that eval was run on the same predictions file. Budget drift: When both baseline and PF run dirs contain experiment_manifest.json, compare asserts that timeout_sec, max_steps, max_tool_calls, model, and model_params match; on mismatch it sets report["budget_drift"] and exits with error. A wrong pairing (e.g. baseline eval with PF run dir) fails with a clear run_id or predictions_sha256 mismatch error.
After a run that writes predictions.jsonl (e.g. baseline or PF-guarded), run:
python experiments/scripts/validate_predictions.py runs/exp-step2-lite-smoke/baseline/predictions.jsonl -n 20 --instance-ids-file experiments/exp-step2-lite-smoke/instance_ids.txtAdd --allow-empty-patch when some instances produced no patch (e.g. OpenHands failed to produce a diff); the cycle script uses this for Phase 2.2/3.2 and for Phase 5a (update_run_ids_if_green). Checks: JSONL line count equals 20; every instance_id appears exactly once and is in the instance list; each model_patch is a non-empty diff (unless --allow-empty-patch); predictions.pfmeta.jsonl (if present) has the same line count and matching instance_id order. If a run_status.json exists in the same directory as the predictions file (written by the runner) and its status is not complete, validation fails unless --allow-partial is set (use that only when intentionally validating a partial or failed run).
Stub check (mandatory for evaluation): If any model.patch under the run dirs contains .swebench_stub, the run is invalid. After baseline and PF runs, run:
python experiments/scripts/check_no_stub.py runs/exp-step2-lite-smoke/baseline runs/exp-step2-lite-smoke/pfIf this exits non-zero, do not use the run for evaluation; the pipeline must use real OpenHands (no stub fallback).
- publish_docs.py – Builds PUBLISH.md, RESULTS.md, and VERIFY.md content (build_publish_md, build_results_md, build_verify_md). Used by update_run_ids_if_green.py; content is testable without I/O.
- publish_bundle.py – Single definition of a valid publish bundle: PUBLISH_BUNDLE_REQUIRED_FILES, PUBLISH_BUNDLE_REQUIRED_DIRS, GOLDEN_OK_REQUIRED_KEYS, EXPORT_PRODUCES_FILES/DIRS. Used by verify_publish_bundle.py and export_publish_artifacts.py.
- compare_gates.py – check_compare_gates(compare) returns list of gate-failure messages (solve rates numeric, patch_apply.applies_false == 0, replay.success_rate present, policy section). Used by verify_publish_bundle.py.
- check_stress_alerts.py – Compares stress_summary.json and compare.json to regression thresholds. Loads thresholds from experiments/config/stress_alerts.yaml (optional; see experiments/config/README.md); use --config to override.
- Schemas: compare_report.schema.json, stress_summary.schema.json, scale_results_ledger_row.schema.json (ledger rows validated before append). Fixtures: experiments/fixtures/verify_publish_bundle/ for the verifier; experiments/fixtures/README.md.
- exp-step2-lite-smoke – First-pass SWE-bench Lite: baseline vs PF-guarded with fixed seed and budgets; policy pack
swebench_safe_v1. Uses the deterministic slice above (20 instance_ids).
The runner runs git apply --check on the produced patch to ensure it applies cleanly at base_commit. The check must run in a clean tree at HEAD (same as base_commit). Previously it ran in the workspace that still had the agent's uncommitted changes, so the patch failed to apply. Resetting the main workspace before the check would fix apply-check but would break replay capture, which reads final file contents from the repo. The fix: run_patch_apply_check uses a temporary git worktree at HEAD: we run git apply --check there and then remove the worktree. The main workspace is never mutated, so apply-check passes when the patch is valid and replay bundle capture (which runs afterward) still sees the agent's edits. The cycle script uses --allow-empty-patch in validate_predictions so validation passes when some instances have empty patches (common with OpenHands).
See docs/internal/pf-solve-rate-debugging.md for the full workflow: identify baseline-solved / PF-failed from compare.csv, extract artifacts (extract script), categorize into five buckets (categorize script), apply fixes in order (agent recovery, allowlist, budget, patch extraction), rerun and re-compare. Run/eval/compare commands for this experiment: experiments/exp-step2-lite-smoke/commands.md.