|
9 | 9 | "\n", |
10 | 10 | "This notebook demonstrates the clone-based calibration pipeline: how raw CPS records become a calibration matrix and, ultimately, CD-level stacked datasets.\n", |
11 | 11 | "\n", |
12 | | - "The paradigm shift from the old approach: instead of replicating every household into every congressional district, we **clone** each record N times and assign each clone a **random census block** drawn from a population-weighted distribution. Each clone inherits a state, CD, and block \u2014 and gets re-simulated under the rules of its assigned state.\n", |
| 12 | + "The paradigm shift from the old approach: instead of replicating every household into every congressional district, we **clone** each record N times and assign each clone a **random census block** drawn from a population-weighted distribution. Each clone inherits a state, CD, and block — and gets re-simulated under the rules of its assigned state.\n", |
13 | 13 | "\n", |
14 | 14 | "We follow one household (`record_idx=8629`, household_id 128694, SNAP \\$18,396) through the entire pipeline:\n", |
15 | 15 | "1. Clone and assign geography\n", |
|
19 | 19 | "5. Build the calibration matrix\n", |
20 | 20 | "6. Create stacked datasets from calibrated weights\n", |
21 | 21 | "\n", |
22 | | - "**Companion notebook:** [calibration_internals.ipynb](calibration_internals.ipynb) covers the *finished* matrix \u2014 row/column anatomy, target groups, sparsity. This notebook covers the *process* that creates it and what happens after (stacked datasets).\n", |
| 22 | + "**Companion notebook:** [calibration_internals.ipynb](calibration_internals.ipynb) covers the *finished* matrix — row/column anatomy, target groups, sparsity. This notebook covers the *process* that creates it and what happens after (stacked datasets).\n", |
23 | 23 | "\n", |
24 | 24 | "**Requirements:** `policy_data.db`, `block_cd_distributions.csv.gz`, and the stratified CPS h5 file in `STORAGE_FOLDER`." |
25 | 25 | ] |
|
56 | 56 | "from policyengine_us_data.storage import STORAGE_FOLDER\n", |
57 | 57 | "from policyengine_us_data.calibration.clone_and_assign import (\n", |
58 | 58 | " assign_random_geography,\n", |
59 | | - " GeographyAssignment,\n", |
60 | 59 | " load_global_block_distribution,\n", |
61 | 60 | ")\n", |
62 | 61 | "from policyengine_us_data.calibration.unified_matrix_builder import (\n", |
|
303 | 302 | "id": "cell-9", |
304 | 303 | "metadata": {}, |
305 | 304 | "source": [ |
306 | | - "## Section 3: Inside `_simulate_clone` \u2014 State-Swap\n", |
| 305 | + "## Section 3: Inside `_simulate_clone` — State-Swap\n", |
307 | 306 | "\n", |
308 | 307 | "For each clone, `_simulate_clone` does four things:\n", |
309 | 308 | "1. Creates a **fresh** `Microsimulation` from the base dataset\n", |
310 | 309 | "2. Overwrites `state_fips` with the clone's assigned states\n", |
311 | 310 | "3. Optionally calls a `sim_modifier` (e.g., takeup re-randomization)\n", |
312 | | - "4. **Clears cached formulas** via `get_calculated_variables` \u2014 preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP\n", |
| 311 | + "4. **Clears cached formulas** via `get_calculated_variables` — preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP\n", |
313 | 312 | "\n", |
314 | 313 | "Let's reproduce this manually for clone 0." |
315 | 314 | ] |
|
476 | 475 | "\n", |
477 | 476 | "When assembling the calibration matrix, each target row only \"sees\" columns (clones) whose geography matches the target's geography. This is implemented via `state_to_cols` and `cd_to_cols` dictionaries built from the `GeographyAssignment`.\n", |
478 | 477 | "\n", |
479 | | - "This is step 3 of `build_matrix` \u2014 reproduced here for transparency." |
| 478 | + "This is step 3 of `build_matrix` — reproduced here for transparency." |
480 | 479 | ] |
481 | 480 | }, |
482 | 481 | { |
|
585 | 584 | "source": [ |
586 | 585 | "## Section 5: Takeup Re-randomization\n", |
587 | 586 | "\n", |
588 | | - "The base CPS has fixed takeup decisions (e.g., \"this household takes up SNAP\"). But when we clone a household into different census blocks, each block should have independently drawn takeup \u2014 otherwise every clone of a SNAP-participating household would still participate, regardless of geography.\n", |
| 587 | + "The base CPS has fixed takeup decisions (e.g., \"this household takes up SNAP\"). But when we clone a household into different census blocks, each block should have independently drawn takeup — otherwise every clone of a SNAP-participating household would still participate, regardless of geography.\n", |
589 | 588 | "\n", |
590 | 589 | "`rerandomize_takeup` solves this: for each census block, it uses `seeded_rng(variable_name, salt=block_geoid)` to draw new takeup booleans. The seed is deterministic per (variable, block) pair, so results are reproducible." |
591 | 590 | ] |
|
763 | 762 | "id": "cell-22", |
764 | 763 | "metadata": {}, |
765 | 764 | "source": [ |
766 | | - "In the full pipeline, `rerandomize_takeup` is passed to `build_matrix` as a `sim_modifier` callback. For each clone, after `state_fips` is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn't \u2014 matching the statistical reality that takeup varies by geography." |
| 765 | + "In the full pipeline, `rerandomize_takeup` is passed to `build_matrix` as a `sim_modifier` callback. For each clone, after `state_fips` is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn't — matching the statistical reality that takeup varies by geography." |
767 | 766 | ] |
768 | 767 | }, |
769 | 768 | { |
|
871 | 870 | "source": [ |
872 | 871 | "## Section 7: From Weights to Datasets\n", |
873 | 872 | "\n", |
874 | | - "`create_sparse_cd_stacked_dataset` takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation \u2014 loading the base dataset, assigning `state_fips` for the target CD's state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state's rules (e.g., a $70 SNAP household from ME may get $0 under AK rules).\n", |
| 873 | + "`create_sparse_cd_stacked_dataset` takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation — loading the base dataset, assigning `state_fips` for the target CD's state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state's rules (e.g., a $70 SNAP household from ME may get $0 under AK rules).\n", |
875 | 874 | "\n", |
876 | | - "**Format gap:** The calibration produces weights in clone layout `(n_records * n_clones,)` where each clone maps to one specific CD via the `GeographyAssignment`. The stacked dataset builder expects CD layout `(n_cds * n_households,)` where every CD has a weight slot for every household. Converting between these \u2014 accumulating clone weights into their assigned CDs \u2014 is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works." |
| 875 | + "**Format gap:** The calibration produces weights in clone layout `(n_records * n_clones,)` where each clone maps to one specific CD via the `GeographyAssignment`. The stacked dataset builder expects CD layout `(n_cds * n_households,)` where every CD has a weight slot for every household. Converting between these — accumulating clone weights into their assigned CDs — is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works." |
877 | 876 | ] |
878 | 877 | }, |
879 | 878 | { |
|
1012 | 1011 | "\n", |
1013 | 1012 | "Overflow check:\n", |
1014 | 1013 | " Max person ID after reindexing: 5,025,365\n", |
1015 | | - " Max person ID \u00d7 100: 502,536,500\n", |
| 1014 | + " Max person ID × 100: 502,536,500\n", |
1016 | 1015 | " int32 max: 2,147,483,647\n", |
1017 | | - " \u2713 No overflow risk!\n", |
| 1016 | + " ✓ No overflow risk!\n", |
1018 | 1017 | "\n", |
1019 | 1018 | "Creating Dataset from combined DataFrame...\n", |
1020 | 1019 | "Building simulation from Dataset...\n", |
|
1134 | 1133 | "\n", |
1135 | 1134 | "The clone-based calibration pipeline has six stages:\n", |
1136 | 1135 | "\n", |
1137 | | - "1. **Clone + assign geography** \u2014 `assign_random_geography()` creates N copies of each CPS record, each with a population-weighted random census block.\n", |
1138 | | - "2. **Simulate** \u2014 `_simulate_clone()` sets each clone's `state_fips` and recalculates state-dependent benefits.\n", |
1139 | | - "3. **Geographic masking** \u2014 `state_to_cols` / `cd_to_cols` restrict each target row to geographically relevant columns.\n", |
1140 | | - "4. **Re-randomize takeup** \u2014 `rerandomize_takeup()` draws new takeup per census block, breaking the fixed-takeup assumption.\n", |
1141 | | - "5. **Build matrix** \u2014 `UnifiedMatrixBuilder.build_matrix()` assembles the sparse CSR matrix from all clones.\n", |
1142 | | - "6. **Stacked datasets** \u2014 `create_sparse_cd_stacked_dataset()` converts calibrated weights into CD-level h5 files.\n", |
| 1136 | + "1. **Clone + assign geography** — `assign_random_geography()` creates N copies of each CPS record, each with a population-weighted random census block.\n", |
| 1137 | + "2. **Simulate** — `_simulate_clone()` sets each clone's `state_fips` and recalculates state-dependent benefits.\n", |
| 1138 | + "3. **Geographic masking** — `state_to_cols` / `cd_to_cols` restrict each target row to geographically relevant columns.\n", |
| 1139 | + "4. **Re-randomize takeup** — `rerandomize_takeup()` draws new takeup per census block, breaking the fixed-takeup assumption.\n", |
| 1140 | + "5. **Build matrix** — `UnifiedMatrixBuilder.build_matrix()` assembles the sparse CSR matrix from all clones.\n", |
| 1141 | + "6. **Stacked datasets** — `create_sparse_cd_stacked_dataset()` converts calibrated weights into CD-level h5 files.\n", |
1143 | 1142 | "\n", |
1144 | 1143 | "For matrix diagnostics (row/column anatomy, target groups, sparsity analysis), see [calibration_internals.ipynb](calibration_internals.ipynb)." |
1145 | 1144 | ] |
|
0 commit comments