PolicyEngine
diff --git a/‎docs/local_area_calibration_setup.ipynb‎
Lines changed: 17 additions & 18 deletions b/‎docs/local_area_calibration_setup.ipynb‎
Lines changed: 17 additions & 18 deletions
diff --git a/‎modal_app/data_build.py‎
Lines changed: 18 additions & 10 deletions b/‎modal_app/data_build.py‎
Lines changed: 18 additions & 10 deletions
diff --git a/‎modal_app/pipeline.py‎
Lines changed: 11 additions & 13 deletions b/‎modal_app/pipeline.py‎
Lines changed: 11 additions & 13 deletions
diff --git a/‎modal_app/worker_script.py‎
Lines changed: 0 additions & 1 deletion b/‎modal_app/worker_script.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎paper/scripts/calculate_target_performance.py‎
Lines changed: 1 addition & 1 deletion b/‎paper/scripts/calculate_target_performance.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎paper/scripts/generate_all_tables.py‎
Lines changed: 0 additions & 2 deletions b/‎paper/scripts/generate_all_tables.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎paper/scripts/generate_validation_metrics.py‎
Lines changed: 0 additions & 1 deletion b/‎paper/scripts/generate_validation_metrics.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎paper/scripts/markdown_to_latex.py‎
Lines changed: 0 additions & 1 deletion b/‎paper/scripts/markdown_to_latex.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎policyengine_us_data/calibration/calibration_utils.py‎
Lines changed: 0 additions & 1 deletion b/‎policyengine_us_data/calibration/calibration_utils.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎policyengine_us_data/calibration/clone_and_assign.py‎
Lines changed: 6 additions & 0 deletions b/‎policyengine_us_data/calibration/clone_and_assign.py‎
Lines changed: 6 additions & 0 deletions
@@ -9,7 +9,7 @@
     "\n",
     "This notebook demonstrates the clone-based calibration pipeline: how raw CPS records become a calibration matrix and, ultimately, CD-level stacked datasets.\n",
     "\n",
-    "The paradigm shift from the old approach: instead of replicating every household into every congressional district, we **clone** each record N times and assign each clone a **random census block** drawn from a population-weighted distribution. Each clone inherits a state, CD, and block \u2014 and gets re-simulated under the rules of its assigned state.\n",
+    "The paradigm shift from the old approach: instead of replicating every household into every congressional district, we **clone** each record N times and assign each clone a **random census block** drawn from a population-weighted distribution. Each clone inherits a state, CD, and block — and gets re-simulated under the rules of its assigned state.\n",
     "\n",
     "We follow one household (`record_idx=8629`, household_id 128694, SNAP \\$18,396) through the entire pipeline:\n",
     "1. Clone and assign geography\n",
@@ -19,7 +19,7 @@
     "5. Build the calibration matrix\n",
     "6. Create stacked datasets from calibrated weights\n",
     "\n",
-    "**Companion notebook:** [calibration_internals.ipynb](calibration_internals.ipynb) covers the *finished* matrix \u2014 row/column anatomy, target groups, sparsity. This notebook covers the *process* that creates it and what happens after (stacked datasets).\n",
+    "**Companion notebook:** [calibration_internals.ipynb](calibration_internals.ipynb) covers the *finished* matrix — row/column anatomy, target groups, sparsity. This notebook covers the *process* that creates it and what happens after (stacked datasets).\n",
     "\n",
     "**Requirements:** `policy_data.db`, `block_cd_distributions.csv.gz`, and the stratified CPS h5 file in `STORAGE_FOLDER`."
    ]
@@ -56,7 +56,6 @@
     "from policyengine_us_data.storage import STORAGE_FOLDER\n",
     "from policyengine_us_data.calibration.clone_and_assign import (\n",
     "    assign_random_geography,\n",
-    "    GeographyAssignment,\n",
     "    load_global_block_distribution,\n",
     ")\n",
     "from policyengine_us_data.calibration.unified_matrix_builder import (\n",
@@ -303,13 +302,13 @@
    "id": "cell-9",
    "metadata": {},
    "source": [
-    "## Section 3: Inside `_simulate_clone` \u2014 State-Swap\n",
+    "## Section 3: Inside `_simulate_clone` — State-Swap\n",
     "\n",
     "For each clone, `_simulate_clone` does four things:\n",
     "1. Creates a **fresh** `Microsimulation` from the base dataset\n",
     "2. Overwrites `state_fips` with the clone's assigned states\n",
     "3. Optionally calls a `sim_modifier` (e.g., takeup re-randomization)\n",
-    "4. **Clears cached formulas** via `get_calculated_variables` \u2014 preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP\n",
+    "4. **Clears cached formulas** via `get_calculated_variables` — preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP\n",
     "\n",
     "Let's reproduce this manually for clone 0."
    ]
@@ -476,7 +475,7 @@
     "\n",
     "When assembling the calibration matrix, each target row only \"sees\" columns (clones) whose geography matches the target's geography. This is implemented via `state_to_cols` and `cd_to_cols` dictionaries built from the `GeographyAssignment`.\n",
     "\n",
-    "This is step 3 of `build_matrix` \u2014 reproduced here for transparency."
+    "This is step 3 of `build_matrix` — reproduced here for transparency."
    ]
   },
   {
@@ -585,7 +584,7 @@
    "source": [
     "## Section 5: Takeup Re-randomization\n",
     "\n",
-    "The base CPS has fixed takeup decisions (e.g., \"this household takes up SNAP\"). But when we clone a household into different census blocks, each block should have independently drawn takeup \u2014 otherwise every clone of a SNAP-participating household would still participate, regardless of geography.\n",
+    "The base CPS has fixed takeup decisions (e.g., \"this household takes up SNAP\"). But when we clone a household into different census blocks, each block should have independently drawn takeup — otherwise every clone of a SNAP-participating household would still participate, regardless of geography.\n",
     "\n",
     "`rerandomize_takeup` solves this: for each census block, it uses `seeded_rng(variable_name, salt=block_geoid)` to draw new takeup booleans. The seed is deterministic per (variable, block) pair, so results are reproducible."
    ]
@@ -763,7 +762,7 @@
    "id": "cell-22",
    "metadata": {},
    "source": [
-    "In the full pipeline, `rerandomize_takeup` is passed to `build_matrix` as a `sim_modifier` callback. For each clone, after `state_fips` is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn't \u2014 matching the statistical reality that takeup varies by geography."
+    "In the full pipeline, `rerandomize_takeup` is passed to `build_matrix` as a `sim_modifier` callback. For each clone, after `state_fips` is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn't — matching the statistical reality that takeup varies by geography."
    ]
   },
   {
@@ -871,9 +870,9 @@
    "source": [
     "## Section 7: From Weights to Datasets\n",
     "\n",
-    "`create_sparse_cd_stacked_dataset` takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation \u2014 loading the base dataset, assigning `state_fips` for the target CD's state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state's rules (e.g., a $70 SNAP household from ME may get $0 under AK rules).\n",
+    "`create_sparse_cd_stacked_dataset` takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation — loading the base dataset, assigning `state_fips` for the target CD's state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state's rules (e.g., a $70 SNAP household from ME may get $0 under AK rules).\n",
     "\n",
-    "**Format gap:** The calibration produces weights in clone layout `(n_records * n_clones,)` where each clone maps to one specific CD via the `GeographyAssignment`. The stacked dataset builder expects CD layout `(n_cds * n_households,)` where every CD has a weight slot for every household. Converting between these \u2014 accumulating clone weights into their assigned CDs \u2014 is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works."
+    "**Format gap:** The calibration produces weights in clone layout `(n_records * n_clones,)` where each clone maps to one specific CD via the `GeographyAssignment`. The stacked dataset builder expects CD layout `(n_cds * n_households,)` where every CD has a weight slot for every household. Converting between these — accumulating clone weights into their assigned CDs — is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works."
    ]
   },
   {
@@ -1012,9 +1011,9 @@
       "\n",
       "Overflow check:\n",
       "  Max person ID after reindexing: 5,025,365\n",
-      "  Max person ID \u00d7 100: 502,536,500\n",
+      "  Max person ID × 100: 502,536,500\n",
       "  int32 max: 2,147,483,647\n",
-      "  \u2713 No overflow risk!\n",
+      "  ✓ No overflow risk!\n",
       "\n",
       "Creating Dataset from combined DataFrame...\n",
       "Building simulation from Dataset...\n",
@@ -1134,12 +1133,12 @@
     "\n",
     "The clone-based calibration pipeline has six stages:\n",
     "\n",
-    "1. **Clone + assign geography** \u2014 `assign_random_geography()` creates N copies of each CPS record, each with a population-weighted random census block.\n",
-    "2. **Simulate** \u2014 `_simulate_clone()` sets each clone's `state_fips` and recalculates state-dependent benefits.\n",
-    "3. **Geographic masking** \u2014 `state_to_cols` / `cd_to_cols` restrict each target row to geographically relevant columns.\n",
-    "4. **Re-randomize takeup** \u2014 `rerandomize_takeup()` draws new takeup per census block, breaking the fixed-takeup assumption.\n",
-    "5. **Build matrix** \u2014 `UnifiedMatrixBuilder.build_matrix()` assembles the sparse CSR matrix from all clones.\n",
-    "6. **Stacked datasets** \u2014 `create_sparse_cd_stacked_dataset()` converts calibrated weights into CD-level h5 files.\n",
+    "1. **Clone + assign geography** — `assign_random_geography()` creates N copies of each CPS record, each with a population-weighted random census block.\n",
+    "2. **Simulate** — `_simulate_clone()` sets each clone's `state_fips` and recalculates state-dependent benefits.\n",
+    "3. **Geographic masking** — `state_to_cols` / `cd_to_cols` restrict each target row to geographically relevant columns.\n",
+    "4. **Re-randomize takeup** — `rerandomize_takeup()` draws new takeup per census block, breaking the fixed-takeup assumption.\n",
+    "5. **Build matrix** — `UnifiedMatrixBuilder.build_matrix()` assembles the sparse CSR matrix from all clones.\n",
+    "6. **Stacked datasets** — `create_sparse_cd_stacked_dataset()` converts calibrated weights into CD-level h5 files.\n",
     "\n",
     "For matrix diagnostics (row/column anatomy, target groups, sparsity analysis), see [calibration_internals.ipynb](calibration_internals.ipynb)."
    ]
 
@@ -591,18 +591,26 @@ def build_datasets(
 
     # Copy pipeline artifacts to shared volume before tests so that a test
     # failure does not block downstream calibration steps.
-    # Files selected:
-    #   - source_imputed H5: main dataset for calibration and local area builds
-    #   - policy_data.db: calibration target database
-    #   - calibration_weights.npy: pre-existing weights for re-runs (if present)
-    #   - build_log.txt: persistent build log with provenance
     print("Copying pipeline artifacts to shared volume...")
     artifacts_dir = Path(PIPELINE_MOUNT) / "artifacts"
     artifacts_dir.mkdir(parents=True, exist_ok=True)
-    shutil.copy2(
-        "policyengine_us_data/storage/source_imputed_stratified_extended_cps_2024.h5",
-        artifacts_dir / "source_imputed_stratified_extended_cps.h5",
-    )
+
+    # Copy all intermediate H5 datasets for lineage tracing
+    for output in SCRIPT_OUTPUTS.values():
+        paths = output if isinstance(output, list) else [output]
+        for p in paths:
+            src = Path(p)
+            if src.suffix == ".h5" and src.exists():
+                shutil.copy2(src, artifacts_dir / src.name)
+                print(
+                    f"  Copied {src.name} ({src.stat().st_size / 1024 / 1024:.1f} MB)"
+                )
+
+    # Yearless alias for pipeline consumers (remote_calibration_runner, local_area)
+    si = artifacts_dir / "source_imputed_stratified_extended_cps_2024.h5"
+    if si.exists():
+        shutil.copy2(si, artifacts_dir / "source_imputed_stratified_extended_cps.h5")
+
     shutil.copy2(
         "policyengine_us_data/storage/calibration/policy_data.db",
         artifacts_dir / "policy_data.db",
@@ -613,7 +621,7 @@ def build_datasets(
             cal_weights,
             artifacts_dir / "calibration_weights.npy",
         )
-        print("Copied existing calibration_weights.npy to pipeline volume")
+        print("  Copied calibration_weights.npy")
     shutil.copy2(log_path, artifacts_dir / "build_log.txt")
     log_file.close()
     pipeline_volume.commit()
 
@@ -305,21 +305,19 @@ def stage_base_datasets(
     """
     artifacts = Path(ARTIFACTS_DIR)
 
-    source_imputed = artifacts / "source_imputed_stratified_extended_cps.h5"
-    policy_db = artifacts / "policy_data.db"
-
     files_with_paths = []
-    if source_imputed.exists():
-        files_with_paths.append(
-            (
-                str(source_imputed),
-                "calibration/source_imputed_stratified_extended_cps.h5",
-            )
-        )
-        print(f"  source_imputed: {source_imputed.stat().st_size:,} bytes")
-    else:
-        print("  WARNING: source_imputed not found, skipping")
 
+    # Stage all intermediate H5 datasets for lineage tracing
+    # source_imputed* goes to calibration/ (promote expects that path)
+    for h5_file in sorted(artifacts.glob("*.h5")):
+        if h5_file.name.startswith("source_imputed"):
+            repo_path = f"calibration/{h5_file.name}"
+        else:
+            repo_path = f"datasets/{h5_file.name}"
+        files_with_paths.append((str(h5_file), repo_path))
+        print(f"  {h5_file.name} -> {repo_path}: {h5_file.stat().st_size:,} bytes")
+
+    policy_db = artifacts / "policy_data.db"
     if policy_db.exists():
         files_with_paths.append((str(policy_db), "calibration/policy_data.db"))
         print(f"  policy_data.db: {policy_db.stat().st_size:,} bytes")
 
@@ -250,7 +250,6 @@ def main():
         from policyengine_us_data.calibration.validate_staging import (
             _query_all_active_targets,
             _batch_stratum_constraints,
-            CSV_COLUMNS,
         )
         from policyengine_us_data.calibration.unified_calibration import (
             load_target_config,
 
@@ -9,7 +9,7 @@
 import numpy as np
 from pathlib import Path
 import json
-from typing import Dict, List, Tuple
+from typing import Dict, List
 
 
 def calculate_target_achievement(
 
@@ -6,9 +6,7 @@
 """
 
 import pandas as pd
-import numpy as np
 from pathlib import Path
-import os
 
 
 def format_number(value, decimals=3):
 
@@ -7,7 +7,6 @@
 """
 
 import pandas as pd
-import numpy as np
 from policyengine_us import Microsimulation
 from policyengine_us_data.datasets.cps.enhanced_cps import EnhancedCPS
 from policyengine_us_data.datasets.cps.cps import CPS
 
@@ -6,7 +6,6 @@
 """
 
 import re
-import os
 from pathlib import Path
 
 
 
@@ -491,7 +491,6 @@ def get_cd_index_mapping(db_uri: str = None):
         tuple: (cd_to_index dict, index_to_cd dict, cds_ordered list)
     """
     from sqlalchemy import create_engine, text
-    from pathlib import Path
     from policyengine_us_data.storage import STORAGE_FOLDER
 
     if db_uri is None:
 
@@ -51,6 +51,12 @@ def load_global_block_distribution():
 
     df = pd.read_csv(csv_path, dtype={"block_geoid": str})
 
+    # Normalize at-large districts: Census uses 00 (and 98 for DC) → 01
+    district_num = df["cd_geoid"] % 100
+    state_fips_col = df["cd_geoid"] // 100
+    at_large = (district_num == 0) | ((state_fips_col == 11) & (district_num == 98))
+    df.loc[at_large, "cd_geoid"] = state_fips_col[at_large] * 100 + 1
+
     block_geoids = df["block_geoid"].values
     cd_geoids = np.array(df["cd_geoid"].astype(str).tolist())
     state_fips = np.array([int(b[:2]) for b in block_geoids])
Original file line number	Diff line number	Diff line change
`@@ -250,7 +250,6 @@ def main():`
`250`	`250`	`from policyengine_us_data.calibration.validate_staging import (`
`251`	`251`	`_query_all_active_targets,`
`252`	`252`	`_batch_stratum_constraints,`
`253`		`- CSV_COLUMNS,`
`254`	`253`	`)`
`255`	`254`	`from policyengine_us_data.calibration.unified_calibration import (`
`256`	`255`	`load_target_config,`