Skip to content
This repository was archived by the owner on Jul 2, 2026. It is now read-only.

Commit 4991b1e

Browse files
authored
Merge pull request #611 from PolicyEngine/fix-would-file-blend-and-entity-weights
Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix
2 parents 5c3801b + ae1846b commit 4991b1e

86 files changed

Lines changed: 5001 additions & 1375 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/pipeline.yaml

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
name: Run Pipeline
2+
3+
on:
4+
push:
5+
branches: [main]
6+
workflow_dispatch:
7+
inputs:
8+
gpu:
9+
description: "GPU type for regional calibration"
10+
default: "T4"
11+
type: string
12+
epochs:
13+
description: "Epochs for regional calibration"
14+
default: "1000"
15+
type: string
16+
national_epochs:
17+
description: "Epochs for national calibration"
18+
default: "4000"
19+
type: string
20+
num_workers:
21+
description: "Number of parallel H5 workers"
22+
default: "8"
23+
type: string
24+
skip_national:
25+
description: "Skip national calibration/H5"
26+
default: false
27+
type: boolean
28+
29+
jobs:
30+
pipeline:
31+
runs-on: ubuntu-latest
32+
timeout-minutes: 10
33+
steps:
34+
- uses: actions/checkout@v4
35+
36+
- uses: actions/setup-python@v5
37+
with:
38+
python-version: "3.13"
39+
40+
- name: Install Modal
41+
run: pip install modal
42+
43+
- name: Launch pipeline on Modal
44+
env:
45+
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
46+
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
47+
run: |
48+
ARGS="--action run --branch main"
49+
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
50+
ARGS="$ARGS --gpu ${{ inputs.gpu }}"
51+
ARGS="$ARGS --epochs ${{ inputs.epochs }}"
52+
ARGS="$ARGS --national-epochs ${{ inputs.national_epochs }}"
53+
ARGS="$ARGS --num-workers ${{ inputs.num_workers }}"
54+
if [ "${{ inputs.skip_national }}" = "true" ]; then
55+
ARGS="$ARGS --skip-national"
56+
fi
57+
fi
58+
modal run --detach modal_app/pipeline.py::main $ARGS

Makefile

Lines changed: 50 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
.PHONY: all format test install download upload docker documentation data validate-data calibrate calibrate-build publish-local-area upload-calibration upload-dataset upload-database push-to-modal build-matrices calibrate-modal calibrate-modal-national calibrate-both stage-h5s stage-national-h5 stage-all-h5s pipeline validate-staging validate-staging-full upload-validation check-staging check-sanity clean build paper clean-paper presentations database database-refresh promote-database promote-dataset promote build-h5s validate-local
1+
.PHONY: all format test install download upload docker documentation data validate-data calibrate calibrate-build publish-local-area upload-calibration upload-dataset upload-database push-to-modal build-data-modal build-matrices calibrate-modal calibrate-modal-national calibrate-both stage-h5s stage-national-h5 stage-all-h5s pipeline validate-staging validate-staging-full upload-validation check-staging check-sanity clean build paper clean-paper presentations database database-refresh promote-database promote-dataset promote build-h5s validate-local
22

3-
GPU ?= A100-80GB
4-
EPOCHS ?= 200
3+
GPU ?= T4
4+
EPOCHS ?= 1000
55
NATIONAL_GPU ?= T4
6-
NATIONAL_EPOCHS ?= 200
6+
NATIONAL_EPOCHS ?= 4000
77
BRANCH ?= $(shell git rev-parse --abbrev-ref HEAD)
88
NUM_WORKERS ?= 8
9+
N_CLONES ?= 430
910
VERSION ?=
1011

1112
HF_CLONE_DIR ?= $(HOME)/huggingface/policyengine-us-data
@@ -87,9 +88,11 @@ promote-database:
8788
@echo "Copied DB and raw_inputs to HF clone. Now cd to HF repo, commit, and push."
8889

8990
promote-dataset:
90-
cp policyengine_us_data/storage/source_imputed_stratified_extended_cps_2024.h5 \
91-
$(HF_CLONE_DIR)/calibration/source_imputed_stratified_extended_cps.h5
92-
@echo "Copied dataset to HF clone. Now cd to HF repo, commit, and push."
91+
python -c "from policyengine_us_data.utils.huggingface import upload; \
92+
upload('policyengine_us_data/storage/source_imputed_stratified_extended_cps_2024.h5', \
93+
'policyengine/policyengine-us-data', \
94+
'calibration/source_imputed_stratified_extended_cps.h5')"
95+
@echo "Dataset promoted to HF."
9396

9497
data: download
9598
python policyengine_us_data/utils/uprating.py
@@ -155,68 +158,66 @@ upload-database:
155158
@echo "Database uploaded to HF."
156159

157160
push-to-modal:
158-
modal volume put local-area-staging \
161+
modal volume put pipeline-artifacts \
159162
policyengine_us_data/storage/calibration/calibration_weights.npy \
160-
calibration_inputs/calibration/calibration_weights.npy --force
161-
modal volume put local-area-staging \
162-
policyengine_us_data/storage/calibration/stacked_blocks.npy \
163-
calibration_inputs/calibration/stacked_blocks.npy --force
164-
modal volume put local-area-staging \
165-
policyengine_us_data/storage/calibration/stacked_takeup.npz \
166-
calibration_inputs/calibration/stacked_takeup.npz --force
167-
modal volume put local-area-staging \
163+
artifacts/calibration_weights.npy --force
164+
modal volume put pipeline-artifacts \
168165
policyengine_us_data/storage/calibration/policy_data.db \
169-
calibration_inputs/calibration/policy_data.db --force
170-
modal volume put local-area-staging \
171-
policyengine_us_data/storage/calibration/geo_labels.json \
172-
calibration_inputs/calibration/geo_labels.json --force
173-
modal volume put local-area-staging \
166+
artifacts/policy_data.db --force
167+
modal volume put pipeline-artifacts \
174168
policyengine_us_data/storage/source_imputed_stratified_extended_cps_2024.h5 \
175-
calibration_inputs/calibration/source_imputed_stratified_extended_cps.h5 --force
176-
@echo "All calibration inputs pushed to Modal volume."
169+
artifacts/source_imputed_stratified_extended_cps.h5 --force
170+
@echo "All pipeline artifacts pushed to Modal volume."
177171

178172
build-matrices:
179-
modal run modal_app/remote_calibration_runner.py::build_package \
180-
--branch $(BRANCH)
173+
modal run --detach modal_app/remote_calibration_runner.py::build_package \
174+
--branch $(BRANCH) --county-level --n-clones $(N_CLONES)
181175

182176
calibrate-modal:
183-
modal run modal_app/remote_calibration_runner.py::main \
177+
modal run --detach modal_app/remote_calibration_runner.py::main \
184178
--branch $(BRANCH) --gpu $(GPU) --epochs $(EPOCHS) \
179+
--beta 0.65 --lambda-l0 1e-7 --lambda-l2 1e-8 --log-freq 500 \
180+
--target-config policyengine_us_data/calibration/target_config.yaml \
185181
--push-results
186182

187183
calibrate-modal-national:
188-
modal run modal_app/remote_calibration_runner.py::main \
184+
modal run --detach modal_app/remote_calibration_runner.py::main \
189185
--branch $(BRANCH) --gpu $(NATIONAL_GPU) \
190186
--epochs $(NATIONAL_EPOCHS) \
187+
--beta 0.65 --lambda-l0 1e-4 --lambda-l2 1e-12 --log-freq 500 \
188+
--target-config policyengine_us_data/calibration/target_config.yaml \
191189
--push-results --national
192190

193191
calibrate-both:
194192
$(MAKE) calibrate-modal & $(MAKE) calibrate-modal-national & wait
195193

196194
stage-h5s:
197-
modal run modal_app/local_area.py::main \
198-
--branch $(BRANCH) --num-workers $(NUM_WORKERS) \
199-
$(if $(SKIP_DOWNLOAD),--skip-download)
195+
modal run --detach modal_app/local_area.py::main \
196+
--branch $(BRANCH) --num-workers $(NUM_WORKERS) --n-clones $(N_CLONES)
200197

201198
stage-national-h5:
202-
modal run modal_app/local_area.py::main_national \
203-
--branch $(BRANCH)
199+
modal run --detach modal_app/local_area.py::main_national \
200+
--branch $(BRANCH) --n-clones $(N_CLONES)
204201

205202
stage-all-h5s:
206203
$(MAKE) stage-h5s & $(MAKE) stage-national-h5 & wait
207204

208205
promote:
206+
@echo "This will run the full Modal promote pipeline (local_area.py::main_promote)."
207+
@read -p "Are you sure? [y/N] " confirm && [ "$$confirm" = "y" ] || (echo "Aborted."; exit 1)
209208
$(eval VERSION := $(or $(VERSION),$(shell python -c "import tomllib; print(tomllib.load(open('pyproject.toml','rb'))['project']['version'])")))
210-
modal run modal_app/local_area.py::main_promote \
209+
modal run --detach modal_app/local_area.py::main_promote \
211210
--branch $(BRANCH) --version $(VERSION)
212211

213212
validate-staging:
214213
python -m policyengine_us_data.calibration.validate_staging \
215-
--area-type states --output validation_results.csv
214+
--area-type states --output validation_results.csv \
215+
$(if $(RUN_ID),--run-id $(RUN_ID))
216216

217217
validate-staging-full:
218218
python -m policyengine_us_data.calibration.validate_staging \
219-
--area-type states,districts --output validation_results.csv
219+
--area-type states,districts --output validation_results.csv \
220+
$(if $(RUN_ID),--run-id $(RUN_ID))
220221

221222
upload-validation:
222223
python -c "from policyengine_us_data.utils.huggingface import upload; \
@@ -225,18 +226,23 @@ upload-validation:
225226
'calibration/logs/validation_results.csv')"
226227

227228
check-staging:
228-
python -m policyengine_us_data.calibration.check_staging_sums
229+
python -m policyengine_us_data.calibration.check_staging_sums \
230+
$(if $(RUN_ID),--run-id $(RUN_ID))
229231

230232
check-sanity:
231233
python -m policyengine_us_data.calibration.validate_staging \
232-
--sanity-only --area-type states --areas NC
233-
234-
pipeline: data upload-dataset build-matrices calibrate-both stage-all-h5s
235-
@echo ""
236-
@echo "========================================"
237-
@echo "Pipeline complete. H5s are in HF staging."
238-
@echo "Run 'Promote Local Area H5 Files' workflow in GitHub to publish."
239-
@echo "========================================"
234+
--sanity-only --area-type states --areas NC \
235+
$(if $(RUN_ID),--run-id $(RUN_ID))
236+
237+
build-data-modal:
238+
modal run --detach modal_app/data_build.py::main --branch $(BRANCH) --upload --skip-tests
239+
240+
pipeline:
241+
modal run --detach modal_app.pipeline::main \
242+
--action run --branch $(BRANCH) --gpu $(GPU) \
243+
--epochs $(EPOCHS) --national-gpu $(NATIONAL_GPU) \
244+
--national-epochs $(NATIONAL_EPOCHS) \
245+
--num-workers $(NUM_WORKERS) --n-clones $(N_CLONES)
240246

241247
clean:
242248
rm -f policyengine_us_data/storage/*.h5

docs/local_area_calibration_setup.ipynb

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"\n",
1010
"This notebook demonstrates the clone-based calibration pipeline: how raw CPS records become a calibration matrix and, ultimately, CD-level stacked datasets.\n",
1111
"\n",
12-
"The paradigm shift from the old approach: instead of replicating every household into every congressional district, we **clone** each record N times and assign each clone a **random census block** drawn from a population-weighted distribution. Each clone inherits a state, CD, and block \u2014 and gets re-simulated under the rules of its assigned state.\n",
12+
"The paradigm shift from the old approach: instead of replicating every household into every congressional district, we **clone** each record N times and assign each clone a **random census block** drawn from a population-weighted distribution. Each clone inherits a state, CD, and block and gets re-simulated under the rules of its assigned state.\n",
1313
"\n",
1414
"We follow one household (`record_idx=8629`, household_id 128694, SNAP \\$18,396) through the entire pipeline:\n",
1515
"1. Clone and assign geography\n",
@@ -19,7 +19,7 @@
1919
"5. Build the calibration matrix\n",
2020
"6. Create stacked datasets from calibrated weights\n",
2121
"\n",
22-
"**Companion notebook:** [calibration_internals.ipynb](calibration_internals.ipynb) covers the *finished* matrix \u2014 row/column anatomy, target groups, sparsity. This notebook covers the *process* that creates it and what happens after (stacked datasets).\n",
22+
"**Companion notebook:** [calibration_internals.ipynb](calibration_internals.ipynb) covers the *finished* matrix row/column anatomy, target groups, sparsity. This notebook covers the *process* that creates it and what happens after (stacked datasets).\n",
2323
"\n",
2424
"**Requirements:** `policy_data.db`, `block_cd_distributions.csv.gz`, and the stratified CPS h5 file in `STORAGE_FOLDER`."
2525
]
@@ -56,7 +56,6 @@
5656
"from policyengine_us_data.storage import STORAGE_FOLDER\n",
5757
"from policyengine_us_data.calibration.clone_and_assign import (\n",
5858
" assign_random_geography,\n",
59-
" GeographyAssignment,\n",
6059
" load_global_block_distribution,\n",
6160
")\n",
6261
"from policyengine_us_data.calibration.unified_matrix_builder import (\n",
@@ -303,13 +302,13 @@
303302
"id": "cell-9",
304303
"metadata": {},
305304
"source": [
306-
"## Section 3: Inside `_simulate_clone` \u2014 State-Swap\n",
305+
"## Section 3: Inside `_simulate_clone` State-Swap\n",
307306
"\n",
308307
"For each clone, `_simulate_clone` does four things:\n",
309308
"1. Creates a **fresh** `Microsimulation` from the base dataset\n",
310309
"2. Overwrites `state_fips` with the clone's assigned states\n",
311310
"3. Optionally calls a `sim_modifier` (e.g., takeup re-randomization)\n",
312-
"4. **Clears cached formulas** via `get_calculated_variables` \u2014 preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP\n",
311+
"4. **Clears cached formulas** via `get_calculated_variables` preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP\n",
313312
"\n",
314313
"Let's reproduce this manually for clone 0."
315314
]
@@ -476,7 +475,7 @@
476475
"\n",
477476
"When assembling the calibration matrix, each target row only \"sees\" columns (clones) whose geography matches the target's geography. This is implemented via `state_to_cols` and `cd_to_cols` dictionaries built from the `GeographyAssignment`.\n",
478477
"\n",
479-
"This is step 3 of `build_matrix` \u2014 reproduced here for transparency."
478+
"This is step 3 of `build_matrix` reproduced here for transparency."
480479
]
481480
},
482481
{
@@ -585,7 +584,7 @@
585584
"source": [
586585
"## Section 5: Takeup Re-randomization\n",
587586
"\n",
588-
"The base CPS has fixed takeup decisions (e.g., \"this household takes up SNAP\"). But when we clone a household into different census blocks, each block should have independently drawn takeup \u2014 otherwise every clone of a SNAP-participating household would still participate, regardless of geography.\n",
587+
"The base CPS has fixed takeup decisions (e.g., \"this household takes up SNAP\"). But when we clone a household into different census blocks, each block should have independently drawn takeup otherwise every clone of a SNAP-participating household would still participate, regardless of geography.\n",
589588
"\n",
590589
"`rerandomize_takeup` solves this: for each census block, it uses `seeded_rng(variable_name, salt=block_geoid)` to draw new takeup booleans. The seed is deterministic per (variable, block) pair, so results are reproducible."
591590
]
@@ -763,7 +762,7 @@
763762
"id": "cell-22",
764763
"metadata": {},
765764
"source": [
766-
"In the full pipeline, `rerandomize_takeup` is passed to `build_matrix` as a `sim_modifier` callback. For each clone, after `state_fips` is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn't \u2014 matching the statistical reality that takeup varies by geography."
765+
"In the full pipeline, `rerandomize_takeup` is passed to `build_matrix` as a `sim_modifier` callback. For each clone, after `state_fips` is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn't matching the statistical reality that takeup varies by geography."
767766
]
768767
},
769768
{
@@ -871,9 +870,9 @@
871870
"source": [
872871
"## Section 7: From Weights to Datasets\n",
873872
"\n",
874-
"`create_sparse_cd_stacked_dataset` takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation \u2014 loading the base dataset, assigning `state_fips` for the target CD's state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state's rules (e.g., a $70 SNAP household from ME may get $0 under AK rules).\n",
873+
"`create_sparse_cd_stacked_dataset` takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation loading the base dataset, assigning `state_fips` for the target CD's state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state's rules (e.g., a $70 SNAP household from ME may get $0 under AK rules).\n",
875874
"\n",
876-
"**Format gap:** The calibration produces weights in clone layout `(n_records * n_clones,)` where each clone maps to one specific CD via the `GeographyAssignment`. The stacked dataset builder expects CD layout `(n_cds * n_households,)` where every CD has a weight slot for every household. Converting between these \u2014 accumulating clone weights into their assigned CDs \u2014 is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works."
875+
"**Format gap:** The calibration produces weights in clone layout `(n_records * n_clones,)` where each clone maps to one specific CD via the `GeographyAssignment`. The stacked dataset builder expects CD layout `(n_cds * n_households,)` where every CD has a weight slot for every household. Converting between these accumulating clone weights into their assigned CDs is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works."
877876
]
878877
},
879878
{
@@ -1012,9 +1011,9 @@
10121011
"\n",
10131012
"Overflow check:\n",
10141013
" Max person ID after reindexing: 5,025,365\n",
1015-
" Max person ID \u00d7 100: 502,536,500\n",
1014+
" Max person ID × 100: 502,536,500\n",
10161015
" int32 max: 2,147,483,647\n",
1017-
" \u2713 No overflow risk!\n",
1016+
" No overflow risk!\n",
10181017
"\n",
10191018
"Creating Dataset from combined DataFrame...\n",
10201019
"Building simulation from Dataset...\n",
@@ -1134,12 +1133,12 @@
11341133
"\n",
11351134
"The clone-based calibration pipeline has six stages:\n",
11361135
"\n",
1137-
"1. **Clone + assign geography** \u2014 `assign_random_geography()` creates N copies of each CPS record, each with a population-weighted random census block.\n",
1138-
"2. **Simulate** \u2014 `_simulate_clone()` sets each clone's `state_fips` and recalculates state-dependent benefits.\n",
1139-
"3. **Geographic masking** \u2014 `state_to_cols` / `cd_to_cols` restrict each target row to geographically relevant columns.\n",
1140-
"4. **Re-randomize takeup** \u2014 `rerandomize_takeup()` draws new takeup per census block, breaking the fixed-takeup assumption.\n",
1141-
"5. **Build matrix** \u2014 `UnifiedMatrixBuilder.build_matrix()` assembles the sparse CSR matrix from all clones.\n",
1142-
"6. **Stacked datasets** \u2014 `create_sparse_cd_stacked_dataset()` converts calibrated weights into CD-level h5 files.\n",
1136+
"1. **Clone + assign geography** `assign_random_geography()` creates N copies of each CPS record, each with a population-weighted random census block.\n",
1137+
"2. **Simulate** `_simulate_clone()` sets each clone's `state_fips` and recalculates state-dependent benefits.\n",
1138+
"3. **Geographic masking** `state_to_cols` / `cd_to_cols` restrict each target row to geographically relevant columns.\n",
1139+
"4. **Re-randomize takeup** `rerandomize_takeup()` draws new takeup per census block, breaking the fixed-takeup assumption.\n",
1140+
"5. **Build matrix** `UnifiedMatrixBuilder.build_matrix()` assembles the sparse CSR matrix from all clones.\n",
1141+
"6. **Stacked datasets** `create_sparse_cd_stacked_dataset()` converts calibrated weights into CD-level h5 files.\n",
11431142
"\n",
11441143
"For matrix diagnostics (row/column anatomy, target groups, sparsity analysis), see [calibration_internals.ipynb](calibration_internals.ipynb)."
11451144
]

0 commit comments

Comments
 (0)