Skip to content

Commit c8cac6a

Browse files
committed
review and update internals docs
1 parent 11d0315 commit c8cac6a

8 files changed

Lines changed: 1002 additions & 736 deletions

docs/internals/README.md

Lines changed: 46 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ Internal notebooks for the policyengine-us-data calibration pipeline. Not publis
66

77
## Notebooks
88

9-
| Notebook | Stages | Inputs & runtime |
9+
| Notebook | Stages | Required files / inputs |
1010
|---|---|---|
11-
| [`data_build_internals.ipynb`](data_build_internals.ipynb) | Stage 1: build_datasets | <1 min — small inputs (20 records, 3 clones); donor QRF cells need ACS/SIPP/SCF files |
12-
| [`calibration_package_internals.ipynb`](calibration_package_internals.ipynb) | Stage 2: build_package | <1 min — Part 1 uses a toy sparse matrix; Parts 2–5 use static excerpts or fast toy demos |
13-
| [`local_dataset_assembly_internals.ipynb`](local_dataset_assembly_internals.ipynb) | Stages 3–4: fit_weights, publish_and_stage | <2 min — L0 toy runs in <30s; diagnostic cells need a completed run's CSV output |
11+
| [`data_build_internals.ipynb`](data_build_internals.ipynb) | Stage 1: build_datasets | donor QRF cells need ACS/SIPP/SCF files |
12+
| [`calibration_package_internals.ipynb`](calibration_package_internals.ipynb) | Stage 2: build_package | Part 1 uses a toy sparse matrix; Parts 2–5 use static excerpts or toy demos |
13+
| [`local_dataset_assembly_internals.ipynb`](local_dataset_assembly_internals.ipynb) | Stages 3–4: fit_weights, publish_and_stage | L0 toy run; diagnostic cells need a completed run's CSV output |
1414

1515
### Which notebook to open
1616

@@ -187,3 +187,45 @@ modal run modal_app/pipeline.py::main \
187187
```
188188

189189
Promote moves staged H5s to their production paths on HuggingFace. It does not re-run any computation. After promotion, the run's `status` in `meta.json` changes to `"promoted"`.
190+
191+
---
192+
193+
## File reference
194+
195+
> **Note:** This reference reflects the codebase as of the time of writing. File responsibilities may shift as the pipeline evolves — use this as a starting point, then read the file to confirm.
196+
197+
### `policyengine_us_data/calibration/`
198+
199+
| File | Purpose |
200+
|---|---|
201+
| `unified_calibration.py` | Main calibration entry point: clones CPS, assigns geography, builds matrix, runs L0 optimizer, saves weights. Start here for the end-to-end flow. |
202+
| `unified_matrix_builder.py` | Builds the sparse calibration matrix. Per-state simulation, clone loop, domain constraints, takeup re-randomization, COO assembly. |
203+
| `clone_and_assign.py` | Clones CPS records N times, assigns each clone a random census block with no-CD-collision constraint and AGI-conditional routing. |
204+
| `block_assignment.py` | Per-CD block assignment and geographic variable derivation (county, tract, CBSA, SLDU, SLDL, place, PUMA, VTD, ZCTA) from block GEOIDs. |
205+
| `county_assignment.py` | Legacy/fallback: assigns counties within CDs using P(county \| CD). Only called by `block_assignment.py::_generate_fallback_blocks()` when a CD is missing from the pre-computed block distribution (primarily in tests). Not used in production pipeline runs. |
206+
| `puf_impute.py` | PUF cloning: doubles the dataset, imputes 70+ tax variables via sequential QRF, reconciles Social Security sub-components. |
207+
| `source_impute.py` | Re-imputes housing and asset variables from ACS, SIPP, and SCF donor surveys using QRF. |
208+
| `create_source_imputed_cps.py` | Standalone script that runs `source_impute.py` on the stratified extended CPS to produce the dataset used by calibration. |
209+
| `create_stratified_cps.py` | Creates a stratified CPS sample preserving all high-income households while maintaining low-income diversity. |
210+
| `publish_local_area.py` | Builds per-area H5 files (states, districts, cities) from calibrated weights. Weight expansion, entity cloning, geography override, SPM recalculation, takeup draws. |
211+
| `calibration_utils.py` | Shared utilities: state mappings, SPM threshold calculation, geographic adjustment factors, target group functions, initial weight computation. |
212+
| `target_config.yaml` | Include rules that gate which DB targets enter calibration (applied post-matrix-build). The training config. |
213+
| `target_config_full.yaml` | Broader include rules used for validation — includes targets not in the training set for holdout evaluation. |
214+
| `validate_staging.py` | Validates built H5 files by running `sim.calculate()` and comparing weighted aggregates against DB targets. Produces `validation_results.csv`. |
215+
| `validate_national_h5.py` | Validates the national `US.h5` against known national totals and runs structural sanity checks. |
216+
| `validate_package.py` | Validates a calibration package (matrix + targets) before uploading to Modal — checks structure, achievability, and provenance. |
217+
| `sanity_checks.py` | Structural integrity checks on H5 files: weights, monetary variable ranges, takeup booleans, entity ID consistency. |
218+
| `check_staging_sums.py` | Standalone CLI utility (not part of the automated pipeline): sums key variables across all 51 state H5 files and compares to national references. Run manually via `make check-staging` or `python -m ...`. |
219+
| `promote_local_h5s.py` | Standalone CLI utility (not part of the automated pipeline): promotes locally-built H5 files to production via HuggingFace staging and GCS upload. Used for manual local builds outside Modal. |
220+
221+
### `modal_app/`
222+
223+
| File | Purpose |
224+
|---|---|
225+
| `pipeline.py` | End-to-end pipeline orchestrator: chains dataset build → matrix build → weight fitting → H5 publish → promote. Manages run IDs, resume, and diagnostics upload. |
226+
| `data_build.py` | Modal app for Stage 1: parallel dataset building (CPS extraction, PUF cloning, source imputation) with checkpoint persistence. |
227+
| `remote_calibration_runner.py` | Modal app for Stages 2–3: builds calibration package and/or runs L0 optimizer on GPU. Supports `build_package` and `fit_from_package` workflows. |
228+
| `local_area.py` | Modal app for Stage 4: parallel H5 building with distributed worker coordination, LPT scheduling, and validation aggregation. |
229+
| `worker_script.py` | Subprocess worker called by `local_area.py` to build individual H5 files. Runs in a separate process to avoid import conflicts. |
230+
| `images.py` | Defines pre-baked Modal container images with source code, dependencies, and Git metadata for reproducibility. |
231+
| `resilience.py` | Retry and resume utilities for Modal workflows (exponential backoff, idempotent step execution). |

0 commit comments

Comments
 (0)