Skip to content

Commit de913d9

Browse files
juaristi22claude
andcommitted
Wire up the paper-reported benchmark suite: tier manifests, orchestrator, fixes
This commit lands the production scaffolding called for by paper-l0/BENCHMARK_PLAN.md so the L0 / GREG / IPF benchmark can be launched end-to-end against the saved calibration package without further code changes. Tier manifests (paper-l0/benchmarking/manifests/tier*.json) All ten paper-reported manifests, each reading from one shared calibration_package.pkl and differing only in target_filters: tier1_mixed.json L0 + GREG on the full filtered slice (count + dollar combined) over a five-state, ten-district geography subset plus national targets. tier1_ipf.json L0 + IPF on the same slice; IPF retains the authored closed subset and the matched runs use --train-on ipf_retained_authored. tier2_scaling_250 ... tier2_scaling_largest.json Scaling ladder (250 / 500 / 1k / 2.5k / 5k / 10k / largest coherent pre-production subset) over the same package, growing geography coverage to grow the target set instead of the unit universe. tier3_production.json Least-filtered view; failures are reportable results (status=failed with notes) rather than missing rows. All tier manifests opt into method_options.ipf.return_na = true so non- convergence surfaces NaN and the runner converts that into a visible runtime error instead of writing NaN-laden weights to disk. Geographic ID format is the convention the saved package actually uses: state FIPS without leading zeros for single-digit states (e.g. CA = "6", CA-01 = "601"). The two pre-existing example manifests are updated to match. Orchestrator (paper-l0/benchmarking/run_benchmark_suite.py) Exports each manifest, runs every method declared in it, and (when IPF is in the manifest) schedules matched-input rows for the other methods via --train-on ipf_retained_authored --score-on ipf_retained_authored. Aggregates per-manifest summaries into one tier_<n>_summary.csv per tier plus a unified suite_summary.csv. Failures (export-time IPFConversionError, runner non-zero exits, missing summary files) are recorded as status=failed rows with the captured reason in notes; the orchestrator never aborts the suite, which is load-bearing for Tier 3's "completed-or-failed" reporting story. Tier 2 rungs are discovered and ordered by target_filters.max_targets (smallest first; uncapped rungs sort last) so the summary reads top-to-bottom in increasing target count regardless of filename sort order. Both run_benchmark_suite.py and benchmark_cli.py prepend the in-tree repo root to sys.path before importing policyengine_us_data. This avoids being shadowed by an editable install of the same package name pointing at a sibling repo, which previously caused fit_l0_weights() to lose the seed parameter at script-invocation time only (not when imported from another module). IPF runner: returnNA=FALSE design task closed paper-l0/benchmarking/runners/ipf_runner.R now accepts an optional 11th positional argument return_na (default TRUE for backwards compatibility). When return_na is TRUE and surveysd::ipf returns any NaN weights -- the silent-non-convergence path called out in the BENCHMARK_PLAN's "Immediate next design tasks" -- the runner exits with a clear error rather than writing NaN weights. The Python CLI plumbs method_options.ipf.return_na through to the runner so manifests control the behavior declaratively. Modal-artifact ingest (paper-l0/benchmarking/patch_package_paths.py) The Modal-built calibration_package.pkl records absolute paths from the build container (/pipeline/artifacts/<run_id>/...) for metadata.dataset_path and metadata.db_path. The new helper rewrites those paths in-place to point at local copies of the dataset H5 and policy_data.db so the rest of the benchmark scaffold (and especially benchmark_export.build_ipf_inputs, which has hard existence checks on both paths) runs unchanged on a local checkout. Documentation (paper-l0/benchmarking/README.md) New "Paper-reported tiers" section: which manifests belong to which tier, what each method does at each rung, the run_benchmark_suite.py entry point with example invocations, and the per-tier summary CSV layout. Verification done before committing Three end-to-end smoke runs were exercised against the production calibration package (17,736 targets x 5,159,570 cloned units, n_clones=430) to verify the full pipeline: 1. count-only smoke (29 targets, 100 epochs L0): all five rows green -- L0 + GREG full-info, IPF on retained-authored subset, plus matched L0 and GREG runs trained and scored on the IPF subset. 2. mixed-target smoke (count + dollar combined): L0 and GREG both completed end-to-end; this exercises the Tier 1 / Tier 2 mixed-target path that count_like_only=true cannot reach. 3. forced non-convergence (max_iter=5, epsP=1e-20): surveysd::ipf returned NaN weights, the runner stopped with the new "did not converge" message, and the orchestrator recorded status=failed with the full R traceback in notes. The smoke manifests, scratch run dirs, and download scratch were removed from the repo before committing. The plans (paper-l0/BENCHMARK_PLAN.md and outline.md) are intentionally not included in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 420deb4 commit de913d9

17 files changed

Lines changed: 1234 additions & 70 deletions

paper-l0/benchmarking/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -347,3 +347,65 @@ make benchmarking-export MANIFEST=paper-l0/benchmarking/manifests/greg_demo_smal
347347
make benchmarking-run-greg RUN_DIR=paper-l0/benchmarking/runs/greg_demo_small
348348
make benchmarking-run-l0 RUN_DIR=paper-l0/benchmarking/runs/greg_demo_small
349349
```
350+
351+
## Paper-reported tiers
352+
353+
The manifests under `manifests/tier*.json` are the paper-reported benchmark
354+
configurations from `paper-l0/BENCHMARK_PLAN.md`. They all read from the same
355+
saved calibration package and differ only in `target_filters` — the unit
356+
universe, clone count, source dataset, and initial calibration package are
357+
fixed.
358+
359+
| File | Tier | Methods | Scope |
360+
| --- | --- | --- | --- |
361+
| `tier1_mixed.json` | 1 | L0, GREG | Full filtered slice (count + dollar) over a 5-state, 10-district subset plus national targets |
362+
| `tier1_ipf.json` | 1 | L0, IPF | Same slice; IPF retains the authored closed subset |
363+
| `tier2_scaling_250.json``tier2_scaling_10000.json` | 2 | L0, GREG, IPF | Scaling ladder by `max_targets`, expanding geography coverage to grow the target set |
364+
| `tier2_scaling_largest.json` | 2 | L0, GREG, IPF | Largest coherent pre-production subset (no `max_targets` cap) |
365+
| `tier3_production.json` | 3 | L0, GREG, IPF | Least-filtered view; failures are reportable results |
366+
367+
All Tier 2 / Tier 3 manifests set `method_options.ipf.return_na = true` so
368+
non-convergence surfaces NaN weights, which `ipf_runner.R` converts into a
369+
visible runtime error rather than a silent fitted-weight column. A bounded
370+
GREG variant is intentionally out of scope for the current benchmark.
371+
372+
### One-shot orchestration
373+
374+
`run_benchmark_suite.py` exports each manifest, runs every method declared in
375+
it, schedules matched IPF / L0 / GREG comparisons (`--train-on
376+
ipf_retained_authored --score-on ipf_retained_authored`) when IPF is in play,
377+
and aggregates per-tier summary tables.
378+
379+
```bash
380+
# All three tiers end-to-end (requires built calibration_package.pkl).
381+
python paper-l0/benchmarking/run_benchmark_suite.py \
382+
--runs-dir paper-l0/benchmarking/runs
383+
384+
# A single tier.
385+
python paper-l0/benchmarking/run_benchmark_suite.py \
386+
--tier tier_1 \
387+
--runs-dir paper-l0/benchmarking/runs
388+
389+
# A single rung (re-run after a CI failure).
390+
python paper-l0/benchmarking/run_benchmark_suite.py \
391+
--manifest paper-l0/benchmarking/manifests/tier2_scaling_2500.json \
392+
--runs-dir paper-l0/benchmarking/runs
393+
```
394+
395+
Outputs in `--runs-dir`:
396+
397+
- `tier_1_summary.csv`, `tier_2_summary.csv`, `tier_3_summary.csv` — one row
398+
per method per manifest, with status (`completed` / `failed`), runtime,
399+
target / unit counts, and the standard error metrics from
400+
`compute_common_metrics`. Matched IPF / L0 / GREG rows are tagged with
401+
`training_target_set = ipf_retained_authored`.
402+
- `suite_summary.csv` — concatenated view across all tiers.
403+
- `<manifest>/inputs/`, `<manifest>/outputs/` — the per-manifest bundle that
404+
`benchmark_cli.py export` and `run` produce, including
405+
`ipf_conversion_diagnostics.json` whenever IPF was in scope.
406+
407+
Failures (export-time `IPFConversionError`, runner non-zero exit, missing
408+
output files) appear as `status = failed` rows with the captured reason in
409+
`notes`. The orchestrator never aborts the suite — Tier 3 explicitly relies
410+
on this so a GREG out-of-memory or IPF non-convergence is a reportable result
411+
rather than a missing row.

paper-l0/benchmarking/benchmark_cli.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@
77
import time
88
from pathlib import Path
99

10+
# This script can be invoked directly (`python paper-l0/benchmarking/benchmark_cli.py
11+
# ...`), in which case Python sets sys.path[0] to the script's directory and a sibling
12+
# editable-installed `policyengine_us_data` package can shadow the in-tree copy. Pin
13+
# the in-tree repo root ahead of site-packages so `fit_l0_weights` resolves to the
14+
# version in this repo, not whichever editable install pip found first.
15+
_REPO_ROOT = Path(__file__).resolve().parents[2]
16+
if str(_REPO_ROOT) not in sys.path:
17+
sys.path.insert(0, str(_REPO_ROOT))
18+
1019
import numpy as np
1120
import pandas as pd
1221

@@ -216,6 +225,7 @@ def _run_ipf(run_dir: Path):
216225
temp_unit_csv = outputs / "_ipf_unit_metadata.csv"
217226
unit_with_weights.to_csv(temp_unit_csv, index=False)
218227

228+
return_na_flag = "true" if bool(options.get("return_na", False)) else "false"
219229
cmd = [
220230
"Rscript",
221231
str(RUNNERS_DIR / "ipf_runner.R"),
@@ -229,6 +239,7 @@ def _run_ipf(run_dir: Path):
229239
str(float(options.get("epsH", 1e-2))),
230240
household_id_col,
231241
weight_col,
242+
return_na_flag,
232243
]
233244
proc, elapsed = _run_subprocess(cmd)
234245
if proc.returncode != 0:
Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,55 @@
11
{
2-
"name": "greg_demo_small",
3-
"tier": "tier_a",
42
"description": "Example GREG benchmark manifest for a reduced package and a coherent geography subset.",
5-
"package_path": "policyengine_us_data/storage/calibration/calibration_package.pkl",
3+
"external_inputs": {},
4+
"method_options": {
5+
"greg": {
6+
"epsilon": 1e-07,
7+
"maxit": 200
8+
},
9+
"l0": {
10+
"beta": 0.65,
11+
"device": "cpu",
12+
"epochs": 1000,
13+
"lambda_l0": 1e-08,
14+
"lambda_l2": 1e-12,
15+
"learning_rate": 0.15,
16+
"seed": 42
17+
}
18+
},
619
"methods": [
720
"l0",
821
"greg"
922
],
23+
"name": "greg_demo_small",
24+
"package_path": "policyengine_us_data/storage/calibration/calibration_package.pkl",
1025
"target_filters": {
26+
"count_like_only": false,
27+
"district_ids": [
28+
"601",
29+
"602",
30+
"1201",
31+
"1202",
32+
"3601",
33+
"3602",
34+
"4801",
35+
"4802",
36+
"5301",
37+
"5302"
38+
],
1139
"include_geo_levels": [
1240
"national",
1341
"state",
1442
"district"
1543
],
1644
"include_national": true,
45+
"max_targets": 1000,
1746
"state_ids": [
18-
"06",
47+
"6",
1948
"12",
2049
"36",
2150
"48",
2251
"53"
23-
],
24-
"district_ids": [
25-
"0601",
26-
"0602",
27-
"1201",
28-
"1202",
29-
"3601",
30-
"3602",
31-
"4801",
32-
"4802",
33-
"5301",
34-
"5302"
35-
],
36-
"count_like_only": false,
37-
"max_targets": 1000
52+
]
3853
},
39-
"external_inputs": {},
40-
"method_options": {
41-
"l0": {
42-
"lambda_l0": 1e-08,
43-
"epochs": 1000,
44-
"device": "cpu",
45-
"beta": 0.65,
46-
"lambda_l2": 1e-12,
47-
"learning_rate": 0.15,
48-
"seed": 42
49-
},
50-
"greg": {
51-
"maxit": 200,
52-
"epsilon": 1e-07
53-
}
54-
}
55-
}
54+
"tier": "tier_a"
55+
}
Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,43 @@
11
{
2-
"name": "ipf_demo_small",
3-
"tier": "tier_1",
42
"description": "Example IPF benchmark. Uses the same target_filters path as GREG and L0; the IPF converter internally keeps only count-style targets whose constraints resolve through the declared bucket schemas and form a closed categorical margin system. Binary subset families survive only when authored parent totals exist on the exact reduced key; open subset families are dropped.",
5-
"package_path": "policyengine_us_data/storage/calibration/calibration_package.pkl",
3+
"external_inputs": {},
4+
"method_options": {
5+
"ipf": {
6+
"bound": 10.0,
7+
"epsH": 0.01,
8+
"epsP": 0.0001,
9+
"household_id_col": "household_id",
10+
"max_iter": 5000,
11+
"weight_col": "base_weight"
12+
},
13+
"l0": {
14+
"beta": 0.65,
15+
"device": "cpu",
16+
"epochs": 1000,
17+
"lambda_l0": 1e-08,
18+
"lambda_l2": 1e-12,
19+
"learning_rate": 0.15,
20+
"seed": 42
21+
}
22+
},
623
"methods": [
724
"l0",
825
"ipf"
926
],
27+
"name": "ipf_demo_small",
28+
"package_path": "policyengine_us_data/storage/calibration/calibration_package.pkl",
1029
"target_filters": {
30+
"district_ids": [
31+
"601",
32+
"602",
33+
"603",
34+
"604",
35+
"605"
36+
],
1137
"include_geo_levels": [
1238
"district"
1339
],
14-
"include_national": false,
15-
"district_ids": [
16-
"0601",
17-
"0602",
18-
"0603",
19-
"0604",
20-
"0605"
21-
]
40+
"include_national": false
2241
},
23-
"external_inputs": {},
24-
"method_options": {
25-
"l0": {
26-
"lambda_l0": 1e-08,
27-
"epochs": 1000,
28-
"device": "cpu",
29-
"beta": 0.65,
30-
"lambda_l2": 1e-12,
31-
"learning_rate": 0.15,
32-
"seed": 42
33-
},
34-
"ipf": {
35-
"max_iter": 5000,
36-
"bound": 10.0,
37-
"epsP": 1e-04,
38-
"epsH": 1e-02,
39-
"household_id_col": "household_id",
40-
"weight_col": "base_weight"
41-
}
42-
}
43-
}
42+
"tier": "tier_1"
43+
}
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
{
2+
"description": "Tier 1 IPF tractable comparison. Same five-state, ten-district geography slice as tier1_mixed.json. The IPF converter retains the authored subset that forms one coherent closed categorical system; the rest is dropped with explicit diagnostics. L0 and GREG can be matched against IPF's retained subset via --train-on ipf_retained_authored --score-on ipf_retained_authored.",
3+
"external_inputs": {},
4+
"method_options": {
5+
"ipf": {
6+
"bound": 10.0,
7+
"epsH": 0.01,
8+
"epsP": 0.0001,
9+
"household_id_col": "household_id",
10+
"max_iter": 5000,
11+
"return_na": true,
12+
"weight_col": "base_weight"
13+
},
14+
"l0": {
15+
"beta": 0.65,
16+
"device": "cpu",
17+
"epochs": 5000,
18+
"lambda_l0": 1e-08,
19+
"lambda_l2": 1e-12,
20+
"learning_rate": 0.15,
21+
"seed": 42
22+
}
23+
},
24+
"methods": [
25+
"l0",
26+
"ipf"
27+
],
28+
"name": "tier1_ipf",
29+
"package_path": "policyengine_us_data/storage/calibration/calibration_package.pkl",
30+
"target_filters": {
31+
"count_like_only": false,
32+
"district_ids": [
33+
"601",
34+
"602",
35+
"1201",
36+
"1202",
37+
"3601",
38+
"3602",
39+
"4801",
40+
"4802",
41+
"5301",
42+
"5302"
43+
],
44+
"include_geo_levels": [
45+
"national",
46+
"state",
47+
"district"
48+
],
49+
"include_national": true,
50+
"state_ids": [
51+
"6",
52+
"12",
53+
"36",
54+
"48",
55+
"53"
56+
]
57+
},
58+
"tier": "tier_1"
59+
}
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
{
2+
"description": "Tier 1 mixed-target tractable comparison. L0 and GREG fit the full filtered slice (count-like and dollar targets combined) for a coherent five-state, ten-district geography subset plus national targets. Same filtered slice is the input to the matched run reported in tier1_ipf.json.",
3+
"external_inputs": {},
4+
"method_options": {
5+
"greg": {
6+
"epsilon": 1e-07,
7+
"maxit": 200
8+
},
9+
"l0": {
10+
"beta": 0.65,
11+
"device": "cpu",
12+
"epochs": 5000,
13+
"lambda_l0": 1e-08,
14+
"lambda_l2": 1e-12,
15+
"learning_rate": 0.15,
16+
"seed": 42
17+
}
18+
},
19+
"methods": [
20+
"l0",
21+
"greg"
22+
],
23+
"name": "tier1_mixed",
24+
"package_path": "policyengine_us_data/storage/calibration/calibration_package.pkl",
25+
"target_filters": {
26+
"count_like_only": false,
27+
"district_ids": [
28+
"601",
29+
"602",
30+
"1201",
31+
"1202",
32+
"3601",
33+
"3602",
34+
"4801",
35+
"4802",
36+
"5301",
37+
"5302"
38+
],
39+
"include_geo_levels": [
40+
"national",
41+
"state",
42+
"district"
43+
],
44+
"include_national": true,
45+
"state_ids": [
46+
"6",
47+
"12",
48+
"36",
49+
"48",
50+
"53"
51+
]
52+
},
53+
"tier": "tier_1"
54+
}

0 commit comments

Comments
 (0)