Skip to content

Commit 79c36cd

Browse files
baogorekclaude
andauthored
Fix stale calibration targets by deriving time_period from dataset (#505)
* Fix stale calibration targets by deriving time_period from dataset - Remove hardcoded CBO_YEAR and TREASURY_YEAR constants - Add --dataset CLI argument to etl_national_targets.py - Derive time_period from sim.default_calculation_period - Default to HuggingFace production dataset The dataset itself is now the single source of truth for the calibration year, preventing future drift when updating to new base years. Closes #503 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use income_tax_positive for CBO calibration in loss.py The CBO income_tax parameter represents positive-only receipts (refundable credit payments in excess of liability are classified as outlays, not negative receipts). Using income_tax_positive matches this definition. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --dataset argument to all database ETL scripts All ETL scripts now derive their target year from the dataset's default_calculation_period instead of hardcoding years. This ensures all calibration targets stay synchronized when updating to a new base year annually. Updated scripts: - create_initial_strata.py - etl_age.py - etl_irs_soi.py (with configurable --lag for IRS data delay) - etl_medicaid.py - etl_snap.py - etl_state_income_tax.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add 119th Congress district code support for 2024 ACS data - Update parse_ucgid to recognize both 5001800US (118th) and 5001900US (119th Congress) - Expand Puerto Rico and territory filters to handle both Congress code formats - Update TERRITORY_UCGIDS and NON_VOTING_GEO_IDS with 119th Congress codes This ensures consistent redistricting alignment: 2024 ACS data uses 119th Congress codes natively, and IRS SOI data is converted via the 116th→119th mapping matrix. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Remove seed-related changes to reduce PR scope Revert deterministic hash-based medicaid/SSI seed logic in cps.py, update Makefile seed to 3526. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Upgrade policyengine-us to 1.550.1 in uv.lock Needed for income_tax_positive variable used in loss.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Cherry-pick ACA PTC targets from PR #508 and update changelog Adds aca_ptc ingestion from IRS SOI data (code 85530) to etl_irs_soi.py and updates DATABASE_GUIDE.md to reflect stratum_group_id 119. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Split local area publish into build+stage and promote phases Prevents silent no-op promotes by detecting when HF commits don't change HEAD. Adds separate promote workflow for manual gate before pushing staging files to production. Also bumps calibration epochs from 200 to 250. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 5953a08 commit 79c36cd

21 files changed

Lines changed: 513 additions & 123 deletions
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
name: Promote Local Area H5 Files
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
version:
7+
description: 'Version to promote (e.g. 1.23.0)'
8+
required: true
9+
type: string
10+
branch:
11+
description: 'Branch to use for repo setup'
12+
required: false
13+
default: 'main'
14+
type: string
15+
16+
jobs:
17+
promote-local-area:
18+
runs-on: ubuntu-latest
19+
permissions:
20+
contents: read
21+
env:
22+
HUGGING_FACE_TOKEN: ${{ secrets.HUGGING_FACE_TOKEN }}
23+
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
24+
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
25+
26+
steps:
27+
- name: Checkout repo
28+
uses: actions/checkout@v4
29+
30+
- name: Set up Python
31+
uses: actions/setup-python@v5
32+
with:
33+
python-version: '3.13'
34+
35+
- name: Install Modal CLI
36+
run: pip install modal
37+
38+
- name: Promote staged files to production
39+
run: |
40+
VERSION="${{ github.event.inputs.version }}"
41+
BRANCH="${{ github.event.inputs.branch }}"
42+
echo "Promoting version ${VERSION} from branch ${BRANCH}"
43+
modal run modal_app/local_area.py::main_promote --version="${VERSION}" --branch="${BRANCH}"

.github/workflows/local_area_publish.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ jobs:
4949
- name: Install Modal CLI
5050
run: pip install modal
5151

52-
- name: Run local area publishing on Modal
52+
- name: Run local area build and stage on Modal
5353
run: |
5454
NUM_WORKERS="${{ github.event.inputs.num_workers || '8' }}"
5555
SKIP_UPLOAD="${{ github.event.inputs.skip_upload || 'false' }}"
@@ -63,3 +63,13 @@ jobs:
6363
6464
echo "Running: $CMD"
6565
$CMD
66+
67+
- name: Post-build summary
68+
if: success()
69+
run: |
70+
echo "## Build + Stage Complete" >> $GITHUB_STEP_SUMMARY
71+
echo "" >> $GITHUB_STEP_SUMMARY
72+
echo "Files have been uploaded to GCS and staged on HuggingFace." >> $GITHUB_STEP_SUMMARY
73+
echo "" >> $GITHUB_STEP_SUMMARY
74+
echo "### Next step: Promote to production" >> $GITHUB_STEP_SUMMARY
75+
echo "Trigger the **Promote Local Area H5 Files** workflow with the version from the build output." >> $GITHUB_STEP_SUMMARY

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ data: download
8787
python policyengine_us_data/datasets/cps/extended_cps.py
8888
python policyengine_us_data/datasets/cps/enhanced_cps.py
8989
python policyengine_us_data/datasets/cps/small_enhanced_cps.py
90-
python policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py 10500
90+
python policyengine_us_data/datasets/cps/local_area_calibration/create_stratified_cps.py 12000 --top=99.5 --seed=3526
9191

9292
publish-local-area:
9393
python policyengine_us_data/datasets/cps/local_area_calibration/publish_local_area.py

changelog_entry.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
- date: 2026-02-02
2+
type: fixed
3+
description: Fix stale calibration targets by deriving time_period from dataset across all ETL scripts, using income_tax_positive for CBO calibration, and adding 119th Congress district code support for consistent redistricting alignment
4+
- date: 2026-02-07
5+
type: added
6+
description: Add ACA Premium Tax Credit targets from IRS SOI data (cherry-picked from PR #508)

modal_app/local_area.py

Lines changed: 97 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -259,14 +259,11 @@ def validate_staging(branch: str, version: str) -> Dict:
259259
memory=8192,
260260
timeout=14400,
261261
)
262-
def atomic_upload(branch: str, version: str, manifest: Dict) -> str:
262+
def upload_to_staging(branch: str, version: str, manifest: Dict) -> str:
263263
"""
264-
Upload files using staging approach for atomic deployment.
264+
Upload files to GCS (production) and HuggingFace (staging only).
265265
266-
1. Upload to GCS (direct, overwrites existing)
267-
2. Upload to HuggingFace staging/ folder
268-
3. Atomically promote staging/ to production paths
269-
4. Clean up staging/
266+
Promote must be run separately via promote_publish.
270267
"""
271268
setup_gcp_credentials()
272269
setup_repo(branch)
@@ -286,8 +283,6 @@ def atomic_upload(branch: str, version: str, manifest: Dict) -> str:
286283
from policyengine_us_data.utils.data_upload import (
287284
upload_local_area_file,
288285
upload_to_staging_hf,
289-
promote_staging_to_production_hf,
290-
cleanup_staging_hf,
291286
)
292287
293288
manifest = json.loads('''{manifest_json}''')
@@ -306,11 +301,9 @@ def atomic_upload(branch: str, version: str, manifest: Dict) -> str:
306301
print(f"Verified {{verification['verified']}} files")
307302
308303
files_with_paths = []
309-
rel_paths = []
310304
for rel_path in manifest["files"].keys():
311305
local_path = version_dir / rel_path
312306
files_with_paths.append((local_path, rel_path))
313-
rel_paths.append(rel_path)
314307
315308
# Upload to GCS (direct to production paths)
316309
print(f"Uploading {{len(files_with_paths)}} files to GCS...")
@@ -331,12 +324,73 @@ def atomic_upload(branch: str, version: str, manifest: Dict) -> str:
331324
hf_count = upload_to_staging_hf(files_with_paths, version)
332325
print(f"Uploaded {{hf_count}} files to HuggingFace staging/")
333326
334-
# Atomically promote staging to production
335-
print("Promoting staging/ to production (atomic commit)...")
327+
print(f"Staged version {{version}} for promotion")
328+
""",
329+
],
330+
text=True,
331+
env=os.environ.copy(),
332+
)
333+
334+
if result.returncode != 0:
335+
raise RuntimeError(f"Upload failed: {result.stderr}")
336+
337+
return (
338+
f"Staged version {version} with {len(manifest['files'])} files. "
339+
f"Run promote workflow to publish to HuggingFace production."
340+
)
341+
342+
343+
@app.function(
344+
image=image,
345+
secrets=[hf_secret],
346+
volumes={VOLUME_MOUNT: staging_volume},
347+
memory=4096,
348+
timeout=3600,
349+
)
350+
def promote_publish(branch: str = "main", version: str = "") -> str:
351+
"""
352+
Promote staged files from HF staging/ to production paths, then cleanup.
353+
354+
Reads the manifest from the Modal staging volume to determine which
355+
files to promote.
356+
"""
357+
setup_repo(branch)
358+
359+
staging_dir = Path(VOLUME_MOUNT)
360+
staging_volume.reload()
361+
362+
manifest_path = staging_dir / version / "manifest.json"
363+
if not manifest_path.exists():
364+
raise RuntimeError(
365+
f"No manifest found at {manifest_path}. "
366+
f"Run build+stage workflow first."
367+
)
368+
369+
with open(manifest_path) as f:
370+
manifest = json.load(f)
371+
372+
rel_paths_json = json.dumps(list(manifest["files"].keys()))
373+
374+
result = subprocess.run(
375+
[
376+
"uv",
377+
"run",
378+
"python",
379+
"-c",
380+
f"""
381+
import json
382+
from policyengine_us_data.utils.data_upload import (
383+
promote_staging_to_production_hf,
384+
cleanup_staging_hf,
385+
)
386+
387+
rel_paths = json.loads('''{rel_paths_json}''')
388+
version = "{version}"
389+
390+
print(f"Promoting {{len(rel_paths)}} files from staging/ to production...")
336391
promoted = promote_staging_to_production_hf(rel_paths, version)
337392
print(f"Promoted {{promoted}} files to production")
338393
339-
# Clean up staging
340394
print("Cleaning up staging/...")
341395
cleaned = cleanup_staging_hf(rel_paths, version)
342396
print(f"Cleaned up {{cleaned}} files from staging/")
@@ -349,9 +403,9 @@ def atomic_upload(branch: str, version: str, manifest: Dict) -> str:
349403
)
350404

351405
if result.returncode != 0:
352-
raise RuntimeError(f"Upload failed: {result.stderr}")
406+
raise RuntimeError(f"Promote failed: {result.stderr}")
353407

354-
return f"Successfully published version {version} with {len(manifest['files'])} files"
408+
return f"Successfully promoted version {version} with {len(manifest['files'])} files"
355409

356410

357411
@app.function(
@@ -544,10 +598,24 @@ def coordinate_publish(
544598
f"WARNING: Expected {expected_total} files, found {actual_total}"
545599
)
546600

547-
print("\nStarting atomic upload...")
548-
result = atomic_upload.remote(
601+
print("\nStarting upload to staging...")
602+
result = upload_to_staging.remote(
549603
branch=branch, version=version, manifest=manifest
550604
)
605+
print(result)
606+
607+
print("\n" + "=" * 60)
608+
print("BUILD + STAGE COMPLETE")
609+
print("=" * 60)
610+
print(
611+
f"To promote to HuggingFace production, run the "
612+
f"'Promote Local Area H5 Files' workflow with version={version}"
613+
)
614+
print(
615+
"Or run manually: modal run modal_app/local_area.py::main_promote "
616+
f"--version={version}"
617+
)
618+
print("=" * 60)
551619

552620
return result
553621

@@ -565,3 +633,15 @@ def main(
565633
skip_upload=skip_upload,
566634
)
567635
print(result)
636+
637+
638+
@app.local_entrypoint()
639+
def main_promote(
640+
version: str = "",
641+
branch: str = "main",
642+
):
643+
"""Promote staged files to HuggingFace production."""
644+
if not version:
645+
raise ValueError("--version is required")
646+
result = promote_publish.remote(branch=branch, version=version)
647+
print(result)

policyengine_us_data/datasets/cps/enhanced_cps.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ def generate(self):
196196
loss_matrix_clean,
197197
targets_array_clean,
198198
log_path="calibration_log.csv",
199-
epochs=200,
199+
epochs=250,
200200
seed=1456,
201201
)
202202
data["household_weight"][year] = optimised_weights

policyengine_us_data/datasets/cps/local_area_calibration/calibration_utils.py

Lines changed: 17 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -252,39 +252,24 @@ def get_pseudo_input_variables(sim) -> set:
252252
"""
253253
Identify pseudo-input variables that should NOT be saved to H5 files.
254254
255-
A pseudo-input is a variable that:
256-
- Appears in sim.input_variables (has stored values)
257-
- Has 'adds' or 'subtracts' attribute
258-
- At least one component has a formula (is calculated)
259-
260-
These variables have stale pre-computed values that corrupt calculations
261-
when reloaded, because the stored value overrides the formula.
255+
NOTE: This function currently returns an empty set. The original logic
256+
excluded variables with 'adds' or 'subtracts' attributes, but analysis
257+
showed that in CPS data, these variables contain authoritative stored
258+
data that does NOT match their component variables:
259+
260+
- pre_tax_contributions: components are all 0, aggregate has imputed values
261+
- tax_exempt_pension_income: aggregate has 135M, components only 20M
262+
- taxable_pension_income: aggregate has 82M, components only 29M
263+
- interest_deduction: aggregate has 41M, components are 0
264+
265+
The 'adds' attribute defines how to CALCULATE these values, but in CPS
266+
data the stored values are the authoritative source. Excluding them and
267+
recalculating from components produces incorrect results.
268+
269+
For geo-stacking, entity ID reindexing preserves within-entity
270+
relationships, so aggregation within a person or tax_unit remains valid.
262271
"""
263-
tbs = sim.tax_benefit_system
264-
pseudo_inputs = set()
265-
266-
for var_name in sim.input_variables:
267-
var = tbs.variables.get(var_name)
268-
if not var:
269-
continue
270-
271-
adds = getattr(var, "adds", None)
272-
if adds and isinstance(adds, list):
273-
for component in adds:
274-
comp_var = tbs.variables.get(component)
275-
if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
276-
pseudo_inputs.add(var_name)
277-
break
278-
279-
subtracts = getattr(var, "subtracts", None)
280-
if subtracts and isinstance(subtracts, list):
281-
for component in subtracts:
282-
comp_var = tbs.variables.get(component)
283-
if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
284-
pseudo_inputs.add(var_name)
285-
break
286-
287-
return pseudo_inputs
272+
return set()
288273

289274

290275
def apply_op(values: np.ndarray, op: str, val: str) -> np.ndarray:

policyengine_us_data/db/DATABASE_GUIDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ The `stratum_group_id` field categorizes strata:
110110
| 5 | Medicaid | Medicaid enrollment strata |
111111
| 6 | EITC | EITC recipients by qualifying children |
112112
| 7 | State Income Tax | State-level income tax collections (Census STC) |
113-
| 100-118 | IRS Conditional | Each IRS variable paired with conditional count constraints |
113+
| 100-119 | IRS Conditional | Each IRS variable paired with conditional count constraints (includes ACA PTC at 119) |
114114

115115
### Conditional Strata (IRS SOI)
116116

policyengine_us_data/db/create_initial_strata.py

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import argparse
12
import logging
23
from typing import Dict
34

@@ -6,6 +7,8 @@
67
from sqlmodel import Session, create_engine
78

89
from policyengine_us_data.storage import STORAGE_FOLDER
10+
11+
DEFAULT_DATASET = "hf://policyengine/policyengine-us-data/calibration/stratified_extended_cps.h5"
912
from policyengine_us_data.db.create_database_tables import (
1013
Stratum,
1114
StratumConstraint,
@@ -68,6 +71,28 @@ def fetch_congressional_districts(year):
6871

6972

7073
def main():
74+
parser = argparse.ArgumentParser(
75+
description="Create initial geographic strata for calibration"
76+
)
77+
parser.add_argument(
78+
"--dataset",
79+
default=DEFAULT_DATASET,
80+
help=(
81+
"Source dataset (local path or HuggingFace URL). "
82+
"The year for Census API calls is derived from the dataset's "
83+
"default_calculation_period. Default: %(default)s"
84+
),
85+
)
86+
args = parser.parse_args()
87+
88+
# Derive year from dataset
89+
from policyengine_us import Microsimulation
90+
91+
print(f"Loading dataset: {args.dataset}")
92+
sim = Microsimulation(dataset=args.dataset)
93+
year = int(sim.default_calculation_period)
94+
print(f"Derived year from dataset: {year}")
95+
7196
# State FIPS to name/abbreviation mapping
7297
STATE_NAMES = {
7398
1: "Alabama (AL)",
@@ -123,8 +148,7 @@ def main():
123148
56: "Wyoming (WY)",
124149
}
125150

126-
# Fetch congressional district data for year 2023
127-
year = 2023
151+
# Fetch congressional district data
128152
cd_df = fetch_congressional_districts(year)
129153

130154
DATABASE_URL = (

0 commit comments

Comments
 (0)