You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/internals/README.md
+46-4Lines changed: 46 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,11 @@ Internal notebooks for the policyengine-us-data calibration pipeline. Not publis
6
6
7
7
## Notebooks
8
8
9
-
| Notebook | Stages |Inputs & runtime|
9
+
| Notebook | Stages |Required files / inputs|
10
10
|---|---|---|
11
-
|[`data_build_internals.ipynb`](data_build_internals.ipynb)| Stage 1: build_datasets |<1 min — small inputs (20 records, 3 clones); donor QRF cells need ACS/SIPP/SCF files |
12
-
|[`calibration_package_internals.ipynb`](calibration_package_internals.ipynb)| Stage 2: build_package |<1 min — Part 1 uses a toy sparse matrix; Parts 2–5 use static excerpts or fast toy demos |
13
-
|[`local_dataset_assembly_internals.ipynb`](local_dataset_assembly_internals.ipynb)| Stages 3–4: fit_weights, publish_and_stage |<2 min — L0 toy runs in <30s; diagnostic cells need a completed run's CSV output |
|[`calibration_package_internals.ipynb`](calibration_package_internals.ipynb)| Stage 2: build_package | Part 1 uses a toy sparse matrix; Parts 2–5 use static excerpts or toy demos |
13
+
|[`local_dataset_assembly_internals.ipynb`](local_dataset_assembly_internals.ipynb)| Stages 3–4: fit_weights, publish_and_stage | L0 toy run; diagnostic cells need a completed run's CSV output |
14
14
15
15
### Which notebook to open
16
16
@@ -187,3 +187,45 @@ modal run modal_app/pipeline.py::main \
187
187
```
188
188
189
189
Promote moves staged H5s to their production paths on HuggingFace. It does not re-run any computation. After promotion, the run's `status` in `meta.json` changes to `"promoted"`.
190
+
191
+
---
192
+
193
+
## File reference
194
+
195
+
> **Note:** This reference reflects the codebase as of the time of writing. File responsibilities may shift as the pipeline evolves — use this as a starting point, then read the file to confirm.
196
+
197
+
### `policyengine_us_data/calibration/`
198
+
199
+
| File | Purpose |
200
+
|---|---|
201
+
|`unified_calibration.py`| Main calibration entry point: clones CPS, assigns geography, builds matrix, runs L0 optimizer, saves weights. Start here for the end-to-end flow. |
|`clone_and_assign.py`| Clones CPS records N times, assigns each clone a random census block with no-CD-collision constraint and AGI-conditional routing. |
|`county_assignment.py`| Legacy/fallback: assigns counties within CDs using P(county \| CD). Only called by `block_assignment.py::_generate_fallback_blocks()` when a CD is missing from the pre-computed block distribution (primarily in tests). Not used in production pipeline runs. |
206
+
|`puf_impute.py`| PUF cloning: doubles the dataset, imputes 70+ tax variables via sequential QRF, reconciles Social Security sub-components. |
207
+
|`source_impute.py`| Re-imputes housing and asset variables from ACS, SIPP, and SCF donor surveys using QRF. |
208
+
|`create_source_imputed_cps.py`| Standalone script that runs `source_impute.py` on the stratified extended CPS to produce the dataset used by calibration. |
209
+
|`create_stratified_cps.py`| Creates a stratified CPS sample preserving all high-income households while maintaining low-income diversity. |
|`calibration_utils.py`| Shared utilities: state mappings, SPM threshold calculation, geographic adjustment factors, target group functions, initial weight computation. |
212
+
|`target_config.yaml`| Include rules that gate which DB targets enter calibration (applied post-matrix-build). The training config. |
213
+
|`target_config_full.yaml`| Broader include rules used for validation — includes targets not in the training set for holdout evaluation. |
214
+
|`validate_staging.py`| Validates built H5 files by running `sim.calculate()` and comparing weighted aggregates against DB targets. Produces `validation_results.csv`. |
215
+
|`validate_national_h5.py`| Validates the national `US.h5` against known national totals and runs structural sanity checks. |
216
+
|`validate_package.py`| Validates a calibration package (matrix + targets) before uploading to Modal — checks structure, achievability, and provenance. |
217
+
|`sanity_checks.py`| Structural integrity checks on H5 files: weights, monetary variable ranges, takeup booleans, entity ID consistency. |
218
+
|`check_staging_sums.py`| Standalone CLI utility (not part of the automated pipeline): sums key variables across all 51 state H5 files and compares to national references. Run manually via `make check-staging` or `python -m ...`. |
219
+
|`promote_local_h5s.py`| Standalone CLI utility (not part of the automated pipeline): promotes locally-built H5 files to production via HuggingFace staging and GCS upload. Used for manual local builds outside Modal. |
0 commit comments