Skip to content

Commit 73428fe

Browse files
baogorekclaude
andauthored
Address PR #505 review feedback (#519)
* Address PR #505 review feedback - Extract shared etl_argparser() into utils/db.py to eliminate repeated boilerplate across 7 ETL scripts - Label hardcoded dollar targets with HARDCODED_YEAR = 2024 instead of dynamic time_period; add warnings.warn when dataset year differs - Delete dead get_pseudo_input_variables() and update callers - Switch DEFAULT_DATASET to local storage path for local-first workflow - Add promote-dataset Makefile target and HF_CLONE_DIR variable - Add SOI Congress-session constants with RuntimeError guard for future tax-year bumps - Update Makefile comments for stratified CPS parameters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add helpful error when local dataset file is missing On a fresh checkout without `make data`, the local DEFAULT_DATASET won't exist. Give a clear FileNotFoundError suggesting `make data` or `--dataset hf://...` instead of a cryptic load failure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 86d01ce commit 73428fe

15 files changed

Lines changed: 194 additions & 279 deletions

File tree

Makefile

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
.PHONY: all format test install download upload docker documentation data publish-local-area clean build paper clean-paper presentations database database-refresh promote-database
1+
.PHONY: all format test install download upload docker documentation data publish-local-area clean build paper clean-paper presentations database database-refresh promote-database promote-dataset
2+
3+
HF_CLONE_DIR = $(HOME)/devl/huggingface/policyengine-us-data
24

35
all: data test
46

@@ -72,12 +74,17 @@ database-refresh:
7274

7375
promote-database:
7476
cp policyengine_us_data/storage/calibration/policy_data.db \
75-
$(HOME)/devl/huggingface/policyengine-us-data/calibration/policy_data.db
76-
rm -rf $(HOME)/devl/huggingface/policyengine-us-data/calibration/raw_inputs
77+
$(HF_CLONE_DIR)/calibration/policy_data.db
78+
rm -rf $(HF_CLONE_DIR)/calibration/raw_inputs
7779
cp -r policyengine_us_data/storage/calibration/raw_inputs \
78-
$(HOME)/devl/huggingface/policyengine-us-data/calibration/raw_inputs
80+
$(HF_CLONE_DIR)/calibration/raw_inputs
7981
@echo "Copied DB and raw_inputs to HF clone. Now cd to HF repo, commit, and push."
8082

83+
promote-dataset:
84+
cp policyengine_us_data/storage/stratified_extended_cps_2024.h5 \
85+
$(HF_CLONE_DIR)/calibration/stratified_extended_cps.h5
86+
@echo "Copied dataset to HF clone. Now cd to HF repo, commit, and push."
87+
8188
data: download
8289
python policyengine_us_data/utils/uprating.py
8390
python policyengine_us_data/datasets/acs/acs.py
@@ -87,6 +94,10 @@ data: download
8794
python policyengine_us_data/datasets/cps/extended_cps.py
8895
python policyengine_us_data/datasets/cps/enhanced_cps.py
8996
python policyengine_us_data/datasets/cps/small_enhanced_cps.py
97+
# 12000: number of households our GPUs can handle (found via trial and error).
98+
# --top=99.5: include only top 0.5% (vs default 1%) to preserve
99+
# representation of lower-income households.
100+
# --seed=3526: reproducible stratified sampling.
90101
python policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py 12000 --top=99.5 --seed=3526
91102

92103
publish-local-area:

changelog_entry.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
- bump: patch
2+
changes:
3+
changed:
4+
- Switch DEFAULT_DATASET to local storage path for database ETL scripts
5+
- Extract shared etl_argparser() to reduce boilerplate across 7 ETL scripts
6+
- Delete dead get_pseudo_input_variables() function
7+
added:
8+
- promote-dataset Makefile target
9+
- Year-mismatch warning in national targets ETL
10+
- Congress-session constants and warning in SOI district puller

policyengine_us_data/datasets/cps/local_area_calibration/calibration_utils.py

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -248,30 +248,6 @@ def get_calculated_variables(sim) -> List[str]:
248248
return result
249249

250250

251-
def get_pseudo_input_variables(sim) -> set:
252-
"""
253-
Identify pseudo-input variables that should NOT be saved to H5 files.
254-
255-
NOTE: This function currently returns an empty set. The original logic
256-
excluded variables with 'adds' or 'subtracts' attributes, but analysis
257-
showed that in CPS data, these variables contain authoritative stored
258-
data that does NOT match their component variables:
259-
260-
- pre_tax_contributions: components are all 0, aggregate has imputed values
261-
- tax_exempt_pension_income: aggregate has 135M, components only 20M
262-
- taxable_pension_income: aggregate has 82M, components only 29M
263-
- interest_deduction: aggregate has 41M, components are 0
264-
265-
The 'adds' attribute defines how to CALCULATE these values, but in CPS
266-
data the stored values are the authoritative source. Excluding them and
267-
recalculating from components produces incorrect results.
268-
269-
For geo-stacking, entity ID reindexing preserves within-entity
270-
relationships, so aggregation within a person or tax_unit remains valid.
271-
"""
272-
return set()
273-
274-
275251
def apply_op(values: np.ndarray, op: str, val: str) -> np.ndarray:
276252
"""Apply constraint operation to values array."""
277253
try:

policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,6 @@
1313
from policyengine_us import Microsimulation
1414
from policyengine_core.data.dataset import Dataset
1515
from policyengine_core.enums import Enum
16-
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
17-
get_pseudo_input_variables,
18-
)
1916

2017

2118
def create_stratified_cps_dataset(
@@ -225,16 +222,6 @@ def create_stratified_cps_dataset(
225222

226223
# Only save input variables (not calculated/derived variables)
227224
input_vars = set(sim.input_variables)
228-
229-
# Filter out pseudo-inputs: variables with adds/subtracts that aggregate
230-
# formula-based components. These have stale values that corrupt calculations.
231-
pseudo_inputs = get_pseudo_input_variables(sim)
232-
if pseudo_inputs:
233-
print(f"Excluding {len(pseudo_inputs)} pseudo-input variables:")
234-
for var in sorted(pseudo_inputs):
235-
print(f" - {var}")
236-
input_vars = input_vars - pseudo_inputs
237-
238225
print(f"Found {len(input_vars)} input variables to save")
239226

240227
for variable in stratified_sim.tax_benefit_system.variables:

policyengine_us_data/datasets/cps/local_area_calibration/stacked_dataset_builder.py

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
1515
get_all_cds_from_database,
1616
get_calculated_variables,
17-
get_pseudo_input_variables,
1817
STATE_CODES,
1918
STATE_FIPS_TO_NAME,
2019
STATE_FIPS_TO_CODE,
@@ -624,17 +623,7 @@ def create_sparse_cd_stacked_dataset(
624623

625624
# Only save input variables (not calculated/derived variables)
626625
# Calculated variables like state_name, state_code will be recalculated on load
627-
input_vars = set(base_sim.input_variables)
628-
629-
# Filter out pseudo-inputs: variables with adds/subtracts that aggregate
630-
# formula-based components. These have stale values that corrupt calculations.
631-
pseudo_inputs = get_pseudo_input_variables(base_sim)
632-
if pseudo_inputs:
633-
print(f"Excluding {len(pseudo_inputs)} pseudo-input variables:")
634-
for var in sorted(pseudo_inputs):
635-
print(f" - {var}")
636-
637-
vars_to_save = input_vars - pseudo_inputs
626+
vars_to_save = set(base_sim.input_variables)
638627
print(f"Found {len(vars_to_save)} input variables to save")
639628

640629
# congressional_district_geoid isn't in the original microdata and has no formula,

policyengine_us_data/db/create_initial_strata.py

Lines changed: 2 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import argparse
21
import logging
32
from typing import Dict
43

@@ -7,12 +6,11 @@
76
from sqlmodel import Session, create_engine
87

98
from policyengine_us_data.storage import STORAGE_FOLDER
10-
11-
DEFAULT_DATASET = "hf://policyengine/policyengine-us-data/calibration/stratified_extended_cps.h5"
129
from policyengine_us_data.db.create_database_tables import (
1310
Stratum,
1411
StratumConstraint,
1512
)
13+
from policyengine_us_data.utils.db import etl_argparser
1614
from policyengine_us_data.utils.raw_cache import (
1715
is_cached,
1816
save_json,
@@ -71,27 +69,7 @@ def fetch_congressional_districts(year):
7169

7270

7371
def main():
74-
parser = argparse.ArgumentParser(
75-
description="Create initial geographic strata for calibration"
76-
)
77-
parser.add_argument(
78-
"--dataset",
79-
default=DEFAULT_DATASET,
80-
help=(
81-
"Source dataset (local path or HuggingFace URL). "
82-
"The year for Census API calls is derived from the dataset's "
83-
"default_calculation_period. Default: %(default)s"
84-
),
85-
)
86-
args = parser.parse_args()
87-
88-
# Derive year from dataset
89-
from policyengine_us import Microsimulation
90-
91-
print(f"Loading dataset: {args.dataset}")
92-
sim = Microsimulation(dataset=args.dataset)
93-
year = int(sim.default_calculation_period)
94-
print(f"Derived year from dataset: {year}")
72+
_, year = etl_argparser("Create initial geographic strata for calibration")
9573

9674
# State FIPS to name/abbreviation mapping
9775
STATE_NAMES = {

policyengine_us_data/db/etl_age.py

Lines changed: 6 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,20 @@
1-
import argparse
2-
31
import pandas as pd
42
import numpy as np
53
from sqlmodel import Session, create_engine, select
64

75
from policyengine_us_data.storage import STORAGE_FOLDER
8-
9-
DEFAULT_DATASET = "hf://policyengine/policyengine-us-data/calibration/stratified_extended_cps.h5"
10-
116
from policyengine_us_data.db.create_database_tables import (
127
Stratum,
138
StratumConstraint,
149
Target,
1510
SourceType,
1611
)
1712
from policyengine_us_data.utils.census import get_census_docs, pull_acs_table
18-
from policyengine_us_data.utils.db import parse_ucgid, get_geographic_strata
13+
from policyengine_us_data.utils.db import (
14+
parse_ucgid,
15+
get_geographic_strata,
16+
etl_argparser,
17+
)
1918
from policyengine_us_data.utils.db_metadata import (
2019
get_or_create_source,
2120
get_or_create_variable_group,
@@ -287,27 +286,7 @@ def load_age_data(df_long, geo, year):
287286

288287

289288
def main():
290-
parser = argparse.ArgumentParser(
291-
description="ETL for age calibration targets"
292-
)
293-
parser.add_argument(
294-
"--dataset",
295-
default=DEFAULT_DATASET,
296-
help=(
297-
"Source dataset (local path or HuggingFace URL). "
298-
"The year for Census API calls is derived from the dataset's "
299-
"default_calculation_period. Default: %(default)s"
300-
),
301-
)
302-
args = parser.parse_args()
303-
304-
# Derive year from dataset
305-
from policyengine_us import Microsimulation
306-
307-
print(f"Loading dataset: {args.dataset}")
308-
sim = Microsimulation(dataset=args.dataset)
309-
year = int(sim.default_calculation_period)
310-
print(f"Derived year from dataset: {year}")
289+
_, year = etl_argparser("ETL for age calibration targets")
311290

312291
# --- ETL: Extract, Transform, Load ----
313292

policyengine_us_data/db/etl_irs_soi.py

Lines changed: 27 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import argparse
21
import logging
32
from typing import Optional
43

@@ -8,19 +7,6 @@
87
from sqlmodel import Session, create_engine, select
98

109
from policyengine_us_data.storage import STORAGE_FOLDER
11-
12-
DEFAULT_DATASET = "hf://policyengine/policyengine-us-data/calibration/stratified_extended_cps.h5"
13-
14-
# IRS SOI data is typically available ~2 years after the tax year
15-
IRS_SOI_LAG_YEARS = 2
16-
from policyengine_us_data.utils.raw_cache import (
17-
is_cached,
18-
cache_path,
19-
save_bytes,
20-
)
21-
22-
logger = logging.getLogger(__name__)
23-
2410
from policyengine_us_data.db.create_database_tables import (
2511
Stratum,
2612
StratumConstraint,
@@ -34,6 +20,7 @@
3420
get_stratum_parent,
3521
parse_ucgid,
3622
get_geographic_strata,
23+
etl_argparser,
3724
)
3825
from policyengine_us_data.utils.db_metadata import (
3926
get_or_create_source,
@@ -44,6 +31,17 @@
4431
from policyengine_us_data.storage.calibration_targets.make_district_mapping import (
4532
get_district_mapping,
4633
)
34+
from policyengine_us_data.utils.raw_cache import (
35+
is_cached,
36+
cache_path,
37+
save_bytes,
38+
)
39+
40+
logger = logging.getLogger(__name__)
41+
42+
43+
# IRS SOI data is typically available ~2 years after the tax year
44+
IRS_SOI_LAG_YEARS = 2
4745

4846
"""See the 22incddocguide.docx manual from the IRS SOI"""
4947
# Language in the doc: '$10,000 under $25,000' means >= $10,000 and < $25,000
@@ -1236,40 +1234,23 @@ def load_soi_data(long_dfs, year):
12361234

12371235

12381236
def main():
1239-
parser = argparse.ArgumentParser(
1240-
description="ETL for IRS SOI calibration targets"
1241-
)
1242-
parser.add_argument(
1243-
"--dataset",
1244-
default=DEFAULT_DATASET,
1245-
help=(
1246-
"Source dataset (local path or HuggingFace URL). "
1247-
"The year for IRS SOI data is derived from the dataset's "
1248-
"default_calculation_period minus IRS_SOI_LAG_YEARS. "
1249-
"Default: %(default)s"
1250-
),
1251-
)
1252-
parser.add_argument(
1253-
"--lag",
1254-
type=int,
1255-
default=IRS_SOI_LAG_YEARS,
1256-
help=(
1257-
"Years to subtract from dataset year for IRS SOI data "
1258-
"(default: %(default)s, since IRS data is ~2 years behind)"
1259-
),
1260-
)
1261-
args = parser.parse_args()
1262-
1263-
# Derive year from dataset with lag applied
1264-
from policyengine_us import Microsimulation
1237+
def add_lag_arg(parser):
1238+
parser.add_argument(
1239+
"--lag",
1240+
type=int,
1241+
default=IRS_SOI_LAG_YEARS,
1242+
help=(
1243+
"Years to subtract from dataset year for IRS SOI data "
1244+
"(default: %(default)s, since IRS data is ~2 years behind)"
1245+
),
1246+
)
12651247

1266-
print(f"Loading dataset: {args.dataset}")
1267-
sim = Microsimulation(dataset=args.dataset)
1268-
dataset_year = int(sim.default_calculation_period)
1269-
year = dataset_year - args.lag
1270-
print(
1271-
f"Dataset year: {dataset_year}, IRS SOI year: {year} (lag={args.lag})"
1248+
args, dataset_year = etl_argparser(
1249+
"ETL for IRS SOI calibration targets",
1250+
extra_args_fn=add_lag_arg,
12721251
)
1252+
year = dataset_year - args.lag
1253+
print(f"IRS SOI year: {year} (lag={args.lag})")
12731254

12741255
# Extract -----------------------
12751256
raw_df = extract_soi_data()

0 commit comments

Comments
 (0)