Skip to content

Commit c1740cc

Browse files
authored
Merge pull request #700 from PolicyEngine/fix/build-database-from-source
Always build policy_data.db from source. Both Maria and I worked on this and Anthony gave me his blessing via Slack. Given this is my last week and things need to move, I am bypassing the reviewer check mark and merging.
2 parents a821f4a + e99a256 commit c1740cc

12 files changed

Lines changed: 65 additions & 127 deletions

File tree

Makefile

Lines changed: 12 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1-
.PHONY: all format test test-unit test-integration install download upload docker documentation data validate-data calibrate calibrate-build publish-local-area upload-calibration upload-dataset upload-database push-to-modal build-data-modal build-matrices calibrate-modal calibrate-modal-national calibrate-both stage-h5s stage-national-h5 stage-all-h5s pipeline validate-staging validate-staging-full upload-validation check-staging check-sanity clean build paper clean-paper presentations database database-refresh promote-database promote-dataset promote build-h5s validate-local refresh-soi-targets push-pr-branch
1+
.PHONY: all format test test-unit test-integration install download upload docker documentation data validate-data calibrate calibrate-build publish-local-area upload-calibration upload-dataset push-to-modal build-data-modal build-matrices calibrate-modal calibrate-modal-national calibrate-both stage-h5s stage-national-h5 stage-all-h5s pipeline validate-staging validate-staging-full upload-validation check-staging check-sanity clean build paper clean-paper presentations database database-refresh promote-dataset promote build-h5s validate-local refresh-soi-targets push-pr-branch
22

33
SOI_SOURCE_YEAR ?= 2021
44
SOI_TARGET_YEAR ?= 2023
55

6+
YEAR ?= 2024
7+
68
GPU ?= T4
79
EPOCHS ?= 1000
810
NATIONAL_GPU ?= T4
@@ -75,38 +77,29 @@ documentation-dev:
7577
database:
7678
rm -f policyengine_us_data/storage/calibration/policy_data.db
7779
python policyengine_us_data/db/create_database_tables.py
78-
python policyengine_us_data/db/create_initial_strata.py
79-
python policyengine_us_data/db/etl_national_targets.py
80-
python policyengine_us_data/db/etl_age.py
81-
python policyengine_us_data/db/etl_medicaid.py
82-
python policyengine_us_data/db/etl_snap.py
83-
python policyengine_us_data/db/etl_state_income_tax.py
84-
python policyengine_us_data/db/etl_irs_soi.py
85-
python policyengine_us_data/db/etl_pregnancy.py
80+
python policyengine_us_data/db/create_initial_strata.py --year $(YEAR)
81+
python policyengine_us_data/db/etl_national_targets.py --year $(YEAR)
82+
python policyengine_us_data/db/etl_age.py --year $(YEAR)
83+
python policyengine_us_data/db/etl_medicaid.py --year $(YEAR)
84+
python policyengine_us_data/db/etl_snap.py --year $(YEAR)
85+
python policyengine_us_data/db/etl_state_income_tax.py --year $(YEAR)
86+
python policyengine_us_data/db/etl_irs_soi.py --year $(YEAR)
87+
python policyengine_us_data/db/etl_pregnancy.py --year $(YEAR)
8688
python policyengine_us_data/db/validate_database.py
8789

8890
database-refresh:
8991
rm -f policyengine_us_data/storage/calibration/policy_data.db
9092
rm -rf policyengine_us_data/storage/calibration/raw_inputs/
9193
$(MAKE) database
9294

93-
promote-database:
94-
sqlite3 policyengine_us_data/storage/calibration/policy_data.db "PRAGMA wal_checkpoint(TRUNCATE);"
95-
cp policyengine_us_data/storage/calibration/policy_data.db \
96-
$(HF_CLONE_DIR)/calibration/policy_data.db
97-
rm -rf $(HF_CLONE_DIR)/calibration/raw_inputs
98-
cp -r policyengine_us_data/storage/calibration/raw_inputs \
99-
$(HF_CLONE_DIR)/calibration/raw_inputs
100-
@echo "Copied DB and raw_inputs to HF clone. Now cd to HF repo, commit, and push."
101-
10295
promote-dataset:
10396
python -c "from policyengine_us_data.utils.huggingface import upload; \
10497
upload('policyengine_us_data/storage/source_imputed_stratified_extended_cps_2024.h5', \
10598
'policyengine/policyengine-us-data', \
10699
'calibration/source_imputed_stratified_extended_cps.h5')"
107100
@echo "Dataset promoted to HF."
108101

109-
data: download
102+
data: download database
110103
python policyengine_us_data/utils/uprating.py
111104
python policyengine_us_data/datasets/acs/acs.py
112105
python policyengine_us_data/datasets/cps/cps.py
@@ -174,13 +167,6 @@ upload-dataset:
174167
'calibration/source_imputed_stratified_extended_cps.h5')"
175168
@echo "Dataset uploaded to HF."
176169

177-
upload-database:
178-
python -c "from policyengine_us_data.utils.huggingface import upload; \
179-
upload('policyengine_us_data/storage/calibration/policy_data.db', \
180-
'policyengine/policyengine-us-data', \
181-
'calibration/policy_data.db')"
182-
@echo "Database uploaded to HF."
183-
184170
push-to-modal:
185171
modal volume put pipeline-artifacts \
186172
policyengine_us_data/storage/calibration/calibration_weights.npy \
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Build policy_data.db from source instead of downloading from HuggingFace, replace H5 dataset dependency with a --year CLI flag for all database ETL scripts, fix Modal data build ordering (CPS before PUF), and add missing heapq import in local area builder.

modal_app/data_build.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -435,7 +435,14 @@ def build_datasets(
435435
env=env,
436436
log_file=log_file,
437437
)
438-
# Checkpoint policy_data.db immediately after download so it survives
438+
# Build policy_data.db from source
439+
subprocess.run(
440+
["uv", "run", "make", "database"],
441+
check=True,
442+
cwd="/root/policyengine-us-data",
443+
env=env,
444+
)
445+
# Checkpoint policy_data.db immediately after build so it survives
439446
# test failures and can be restored on retries.
440447
save_checkpoint(
441448
branch,

modal_app/images.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ def _base_image(extras: list[str] | None = None):
5252
extra_flags = " ".join(f"--extra {e}" for e in (extras or []))
5353
return (
5454
modal.Image.debian_slim(python_version="3.14")
55-
.apt_install("git")
55+
.apt_install("git", "make")
5656
.pip_install("uv>=0.8")
5757
.add_local_dir(
5858
str(REPO_ROOT),

modal_app/local_area.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
modal run modal_app/local_area.py --branch=main --num-workers=8
1212
"""
1313

14+
import heapq
1415
import json
1516
import os
1617
import subprocess

modal_app/pipeline.py

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -880,11 +880,12 @@ def run_pipeline(
880880
print("\n[Step 3/5] Fit weights (skipped - completed)")
881881

882882
# ── Step 4: Build H5s + stage + diagnostics (parallel) ──
883-
# Per plan: all four tasks run in parallel:
884883
# 4a. coordinate_publish (regional H5s)
885884
# 4b. coordinate_national_publish (national H5)
886885
# 4c. stage_base_datasets (datasets → HF staging)
887-
# 4d. upload_run_diagnostics (diagnostics → HF)
886+
# 4d. upload_run_diagnostics (calibration diagnostics → HF)
887+
# 4e. _write_validation_diagnostics (after H5 builds)
888+
# 4f. upload_run_diagnostics (validation diagnostics → HF)
888889
if not _step_completed(meta, "publish_and_stage"):
889890
print(
890891
"\n[Step 4/5] Building H5s, staging datasets, "
@@ -918,16 +919,12 @@ def run_pipeline(
918919
f" → coordinate_national_publish fc: {national_h5_handle.object_id}"
919920
)
920921

921-
# While H5 builds run, stage base datasets
922-
# and upload diagnostics in this container
922+
# While H5 builds run, stage base datasets in this container
923923
pipeline_volume.reload()
924924

925925
print(" Staging base datasets to HF...")
926926
stage_base_datasets(run_id, version, branch)
927927

928-
print(" Uploading run diagnostics...")
929-
upload_run_diagnostics(run_id, branch)
930-
931928
# Now wait for H5 builds to finish
932929
print(" Waiting for regional H5 build...")
933930
regional_h5_result = regional_h5_handle.get()
@@ -964,6 +961,10 @@ def run_pipeline(
964961
vol=pipeline_volume,
965962
)
966963

964+
# Upload validation diagnostics (written after H5 builds)
965+
print(" Uploading validation diagnostics...")
966+
upload_run_diagnostics(run_id, branch)
967+
967968
_record_step(
968969
meta,
969970
"publish_and_stage",

policyengine_us_data/db/DATABASE_GUIDE.md

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ cd ~/devl/sep/policyengine-us-data
1515

1616
make database # Build (uses cached downloads if available)
1717
make database-refresh # Force re-download all sources and rebuild
18-
make promote-database # Copy DB + raw inputs to HuggingFace clone
1918
```
2019

2120
### Pipeline Stages
@@ -44,19 +43,6 @@ Set `PE_REFRESH_RAW=1` to force re-download:
4443
PE_REFRESH_RAW=1 make database
4544
```
4645

47-
### Promotion to HuggingFace
48-
49-
After building and validating:
50-
```bash
51-
make promote-database
52-
cd ~/devl/huggingface/policyengine-us-data
53-
git add calibration/policy_data.db calibration/raw_inputs/
54-
git commit -m "Update policy_data.db - <description>"
55-
git push
56-
```
57-
58-
This copies both the database and the raw inputs that built it, preserving provenance in the HF repo's git history.
59-
6046
### Recovery
6147

6248
If a step fails mid-pipeline, delete the database and re-run. With cached downloads this takes ~10-15 minutes:
@@ -286,4 +272,4 @@ ORDER BY geographic_id;
286272

287273
`policyengine_us_data/storage/calibration/policy_data.db`
288274

289-
Downloaded from HuggingFace by `download_private_prerequisites.py` and `download_calibration_inputs()` in `utils/huggingface.py`.
275+
Built from source via `make database`. See [Building the Database](#building-the-database) above.

policyengine_us_data/db/etl_national_targets.py

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,19 @@
1313
RETIREMENT_CONTRIBUTION_TARGETS,
1414
)
1515
from policyengine_us_data.utils.db import (
16-
DEFAULT_DATASET,
16+
DEFAULT_YEAR,
1717
etl_argparser,
1818
)
1919

2020

21-
def extract_national_targets(dataset: str = DEFAULT_DATASET):
21+
def extract_national_targets(year: int = DEFAULT_YEAR):
2222
"""
2323
Extract national calibration targets from various sources.
2424
2525
Parameters
2626
----------
27-
dataset : str
28-
Path to the calibration dataset (local path or HuggingFace URL).
29-
The time period is derived from the dataset's default_calculation_period.
27+
year : int
28+
Target year for calibration data.
3029
3130
Returns
3231
-------
@@ -38,15 +37,14 @@ def extract_national_targets(dataset: str = DEFAULT_DATASET):
3837
- conditional_count_targets: Enrollment counts requiring constraints
3938
- cbo_targets: List of CBO projection targets
4039
- treasury_targets: List of Treasury/JCT targets
41-
- time_period: The year derived from the dataset
40+
- time_period: The target year
4241
"""
43-
from policyengine_us import Microsimulation
42+
from policyengine_us import CountryTaxBenefitSystem
4443

45-
print(f"Loading dataset: {dataset}")
46-
sim = Microsimulation(dataset=dataset)
44+
time_period = year
45+
print(f"Using time_period: {time_period}")
4746

48-
time_period = int(sim.default_calculation_period)
49-
print(f"Derived time_period from dataset: {time_period}")
47+
tax_benefit_system = CountryTaxBenefitSystem()
5048

5149
# Hardcoded dollar targets are specific to 2024 and should be
5250
# labeled as such. Only CBO/Treasury parameter lookups use the
@@ -400,7 +398,7 @@ def extract_national_targets(dataset: str = DEFAULT_DATASET):
400398
for variable_name in cbo_vars:
401399
param_name = cbo_param_name_map.get(variable_name, variable_name)
402400
try:
403-
value = sim.tax_benefit_system.parameters(
401+
value = tax_benefit_system.parameters(
404402
time_period
405403
).calibration.gov.cbo._children[param_name]
406404
cbo_targets.append(
@@ -420,7 +418,7 @@ def extract_national_targets(dataset: str = DEFAULT_DATASET):
420418

421419
# Treasury/JCT targets (EITC) - use time_period derived from dataset
422420
try:
423-
eitc_value = sim.tax_benefit_system.parameters.calibration.gov.treasury.tax_expenditures.eitc(
421+
eitc_value = tax_benefit_system.parameters.calibration.gov.treasury.tax_expenditures.eitc(
424422
time_period
425423
)
426424
treasury_targets = [
@@ -883,11 +881,11 @@ def load_national_targets(
883881

884882
def main():
885883
"""Main ETL pipeline for national targets."""
886-
args, _ = etl_argparser("ETL for national calibration targets")
884+
_, year = etl_argparser("ETL for national calibration targets")
887885

888886
# Extract
889887
print("Extracting national targets...")
890-
raw_targets = extract_national_targets(dataset=args.dataset)
888+
raw_targets = extract_national_targets(year=year)
891889
time_period = raw_targets["time_period"]
892890
print(f"Using time_period={time_period} for CBO/Treasury targets")
893891

Lines changed: 1 addition & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,5 @@
1-
import os
2-
31
from pathlib import Path
4-
from policyengine_us_data.db.create_database_tables import (
5-
refresh_views_for_db_path,
6-
)
2+
73
from policyengine_us_data.utils.huggingface import download
84

95
FOLDER = Path(__file__).parent
@@ -26,16 +22,3 @@
2622
local_folder=FOLDER,
2723
version=None,
2824
)
29-
if os.environ.get("SKIP_POLICY_DB_DOWNLOAD"):
30-
print(
31-
"SKIP_POLICY_DB_DOWNLOAD set — skipping "
32-
"policy_data.db download from HuggingFace"
33-
)
34-
else:
35-
download(
36-
repo="policyengine/policyengine-us-data",
37-
repo_filename="calibration/policy_data.db",
38-
local_folder=FOLDER,
39-
version=None,
40-
)
41-
refresh_views_for_db_path(FOLDER / "calibration" / "policy_data.db")

policyengine_us_data/utils/db.py

Lines changed: 9 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
import argparse
2-
from pathlib import Path
32
from typing import Dict, List, Optional, Tuple
43

54
from sqlmodel import Session, select
@@ -9,54 +8,37 @@
98
Stratum,
109
StratumConstraint,
1110
)
12-
from policyengine_us_data.storage import STORAGE_FOLDER
1311

14-
DEFAULT_DATASET = str(STORAGE_FOLDER / "source_imputed_stratified_extended_cps_2024.h5")
12+
DEFAULT_YEAR = 2024
1513

1614

1715
def etl_argparser(
1816
description: str,
1917
extra_args_fn=None,
2018
) -> Tuple[argparse.Namespace, int]:
21-
"""Shared argument parsing and dataset-year derivation for ETL scripts.
19+
"""Shared argument parsing for ETL scripts.
2220
2321
Args:
2422
description: Description for the argparse help text.
2523
extra_args_fn: Optional callable that receives the parser to add
2624
extra arguments before parsing.
2725
2826
Returns:
29-
(args, year) where *year* is derived from the dataset's
30-
``default_calculation_period``.
27+
(args, year) tuple.
3128
"""
3229
parser = argparse.ArgumentParser(description=description)
3330
parser.add_argument(
34-
"--dataset",
35-
default=DEFAULT_DATASET,
36-
help=(
37-
"Source dataset (local path or HuggingFace URL). "
38-
"The year is derived from the dataset's "
39-
"default_calculation_period. Default: %(default)s"
40-
),
31+
"--year",
32+
type=int,
33+
default=DEFAULT_YEAR,
34+
help="Target year for calibration data. Default: %(default)s",
4135
)
4236
if extra_args_fn is not None:
4337
extra_args_fn(parser)
4438

4539
args = parser.parse_args()
46-
47-
if not args.dataset.startswith("hf://") and not Path(args.dataset).exists():
48-
raise FileNotFoundError(
49-
f"Dataset not found: {args.dataset}\n"
50-
f"Either build it locally (`make data`) or pass a "
51-
f"HuggingFace URL via --dataset hf://policyengine/..."
52-
)
53-
54-
from policyengine_us import Microsimulation
55-
56-
print(f"Loading dataset: {args.dataset}")
57-
sim = Microsimulation(dataset=args.dataset)
58-
year = int(sim.default_calculation_period)
59-
print(f"Derived year from dataset: {year}")
40+
year = args.year
41+
print(f"Using year: {year}")
6042

6143
return args, year
6244

0 commit comments

Comments
 (0)