Commit de913d9
Wire up the paper-reported benchmark suite: tier manifests, orchestrator, fixes
This commit lands the production scaffolding called for by paper-l0/BENCHMARK_PLAN.md
so the L0 / GREG / IPF benchmark can be launched end-to-end against the saved
calibration package without further code changes.
Tier manifests (paper-l0/benchmarking/manifests/tier*.json)
All ten paper-reported manifests, each reading from one shared
calibration_package.pkl and differing only in target_filters:
tier1_mixed.json L0 + GREG on the full filtered slice (count +
dollar combined) over a five-state, ten-district
geography subset plus national targets.
tier1_ipf.json L0 + IPF on the same slice; IPF retains the
authored closed subset and the matched runs use
--train-on ipf_retained_authored.
tier2_scaling_250 ...
tier2_scaling_largest.json Scaling ladder (250 / 500 / 1k / 2.5k / 5k / 10k
/ largest coherent pre-production subset) over
the same package, growing geography coverage to
grow the target set instead of the unit universe.
tier3_production.json Least-filtered view; failures are reportable
results (status=failed with notes) rather than
missing rows.
All tier manifests opt into method_options.ipf.return_na = true so non-
convergence surfaces NaN and the runner converts that into a visible runtime
error instead of writing NaN-laden weights to disk.
Geographic ID format is the convention the saved package actually uses: state
FIPS without leading zeros for single-digit states (e.g. CA = "6", CA-01 =
"601"). The two pre-existing example manifests are updated to match.
Orchestrator (paper-l0/benchmarking/run_benchmark_suite.py)
Exports each manifest, runs every method declared in it, and (when IPF is in
the manifest) schedules matched-input rows for the other methods via
--train-on ipf_retained_authored --score-on ipf_retained_authored. Aggregates
per-manifest summaries into one tier_<n>_summary.csv per tier plus a unified
suite_summary.csv. Failures (export-time IPFConversionError, runner non-zero
exits, missing summary files) are recorded as status=failed rows with the
captured reason in notes; the orchestrator never aborts the suite, which is
load-bearing for Tier 3's "completed-or-failed" reporting story.
Tier 2 rungs are discovered and ordered by target_filters.max_targets
(smallest first; uncapped rungs sort last) so the summary reads top-to-bottom
in increasing target count regardless of filename sort order.
Both run_benchmark_suite.py and benchmark_cli.py prepend the in-tree repo
root to sys.path before importing policyengine_us_data. This avoids being
shadowed by an editable install of the same package name pointing at a
sibling repo, which previously caused fit_l0_weights() to lose the seed
parameter at script-invocation time only (not when imported from another
module).
IPF runner: returnNA=FALSE design task closed
paper-l0/benchmarking/runners/ipf_runner.R now accepts an optional 11th
positional argument return_na (default TRUE for backwards compatibility).
When return_na is TRUE and surveysd::ipf returns any NaN weights -- the
silent-non-convergence path called out in the BENCHMARK_PLAN's "Immediate
next design tasks" -- the runner exits with a clear error rather than
writing NaN weights. The Python CLI plumbs method_options.ipf.return_na
through to the runner so manifests control the behavior declaratively.
Modal-artifact ingest (paper-l0/benchmarking/patch_package_paths.py)
The Modal-built calibration_package.pkl records absolute paths from the
build container (/pipeline/artifacts/<run_id>/...) for metadata.dataset_path
and metadata.db_path. The new helper rewrites those paths in-place to point
at local copies of the dataset H5 and policy_data.db so the rest of the
benchmark scaffold (and especially benchmark_export.build_ipf_inputs, which
has hard existence checks on both paths) runs unchanged on a local checkout.
Documentation (paper-l0/benchmarking/README.md)
New "Paper-reported tiers" section: which manifests belong to which tier,
what each method does at each rung, the run_benchmark_suite.py entry point
with example invocations, and the per-tier summary CSV layout.
Verification done before committing
Three end-to-end smoke runs were exercised against the production
calibration package (17,736 targets x 5,159,570 cloned units, n_clones=430)
to verify the full pipeline:
1. count-only smoke (29 targets, 100 epochs L0): all five rows green --
L0 + GREG full-info, IPF on retained-authored subset, plus matched L0
and GREG runs trained and scored on the IPF subset.
2. mixed-target smoke (count + dollar combined): L0 and GREG both
completed end-to-end; this exercises the Tier 1 / Tier 2 mixed-target
path that count_like_only=true cannot reach.
3. forced non-convergence (max_iter=5, epsP=1e-20): surveysd::ipf returned
NaN weights, the runner stopped with the new "did not converge"
message, and the orchestrator recorded status=failed with the full R
traceback in notes.
The smoke manifests, scratch run dirs, and download scratch were removed
from the repo before committing. The plans (paper-l0/BENCHMARK_PLAN.md and
outline.md) are intentionally not included in this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 420deb4 commit de913d9
17 files changed
Lines changed: 1234 additions & 70 deletions
File tree
- paper-l0/benchmarking
- manifests
- runners
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
347 | 347 | | |
348 | 348 | | |
349 | 349 | | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
10 | 19 | | |
11 | 20 | | |
12 | 21 | | |
| |||
216 | 225 | | |
217 | 226 | | |
218 | 227 | | |
| 228 | + | |
219 | 229 | | |
220 | 230 | | |
221 | 231 | | |
| |||
229 | 239 | | |
230 | 240 | | |
231 | 241 | | |
| 242 | + | |
232 | 243 | | |
233 | 244 | | |
234 | 245 | | |
| |||
Lines changed: 36 additions & 36 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
3 | | - | |
4 | 2 | | |
5 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
6 | 19 | | |
7 | 20 | | |
8 | 21 | | |
9 | 22 | | |
| 23 | + | |
| 24 | + | |
10 | 25 | | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
11 | 39 | | |
12 | 40 | | |
13 | 41 | | |
14 | 42 | | |
15 | 43 | | |
16 | 44 | | |
| 45 | + | |
17 | 46 | | |
18 | | - | |
| 47 | + | |
19 | 48 | | |
20 | 49 | | |
21 | 50 | | |
22 | 51 | | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
| 52 | + | |
38 | 53 | | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
| 54 | + | |
| 55 | + | |
Lines changed: 32 additions & 32 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
3 | | - | |
4 | 2 | | |
5 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
6 | 23 | | |
7 | 24 | | |
8 | 25 | | |
9 | 26 | | |
| 27 | + | |
| 28 | + | |
10 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
11 | 37 | | |
12 | 38 | | |
13 | 39 | | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
| 40 | + | |
22 | 41 | | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
| 42 | + | |
| 43 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
0 commit comments