Skip to content

Commit 46b3609

Browse files
MaxGhenisclaude
andauthored
Add state income tax calibration targets from Census STC (#497)
Adds ETL pipeline for state-level individual income tax collections from Census Bureau's Annual Survey of State Government Tax Collections (STC) using FY2023 data for all 50 states + DC ($531B total). Recreated from PR #493 rebased onto main. Closes #492 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent f780aac commit 46b3609

5 files changed

Lines changed: 399 additions & 3 deletions

File tree

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ database:
6161
python policyengine_us_data/db/etl_age.py
6262
python policyengine_us_data/db/etl_medicaid.py
6363
python policyengine_us_data/db/etl_snap.py
64+
python policyengine_us_data/db/etl_state_income_tax.py
6465
python policyengine_us_data/db/etl_irs_soi.py
6566
python policyengine_us_data/db/validate_database.py
6667

changelog_entry.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
- bump: minor
2+
changes:
3+
added:
4+
- Add state income tax calibration targets from Census STC FY2023 data

policyengine_us_data/datasets/cps/local_area_calibration/fit_calibration_weights.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,10 +105,11 @@
105105
targets_df, X_sparse, household_id_mapping = builder.build_matrix(
106106
sim,
107107
target_filter={
108-
"stratum_group_ids": [4],
108+
"stratum_group_ids": [4, 7], # 4=SNAP households, 7=state income tax
109109
"variables": [
110110
"health_insurance_premiums_without_medicare_part_b",
111111
"snap",
112+
"state_income_tax", # Census STC state income tax collections
112113
],
113114
},
114115
)

policyengine_us_data/db/DATABASE_GUIDE.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ make promote-database # Copy DB + raw inputs to HuggingFace clone
3030
| 4 | `etl_age.py` | Census ACS 1-year | Age distribution: 18 bins x 488 geographies |
3131
| 5 | `etl_medicaid.py` | Census ACS + CMS | Medicaid enrollment (admin state-level, survey district-level) |
3232
| 6 | `etl_snap.py` | USDA FNS + Census ACS | SNAP participation (admin state-level, survey district-level) |
33-
| 7 | `etl_irs_soi.py` | IRS | Tax variables, EITC by child count, AGI brackets, conditional strata |
34-
| 8 | `validate_database.py` | No | Checks all target variables exist in policyengine-us |
33+
| 7 | `etl_state_income_tax.py` | No | State income tax collections (Census STC FY2023, hardcoded) |
34+
| 8 | `etl_irs_soi.py` | IRS | Tax variables, EITC by child count, AGI brackets, conditional strata |
35+
| 9 | `validate_database.py` | No | Checks all target variables exist in policyengine-us |
3536

3637
### Raw Input Caching
3738

@@ -108,6 +109,7 @@ The `stratum_group_id` field categorizes strata:
108109
| 4 | SNAP | SNAP recipient strata |
109110
| 5 | Medicaid | Medicaid enrollment strata |
110111
| 6 | EITC | EITC recipients by qualifying children |
112+
| 7 | State Income Tax | State-level income tax collections (Census STC) |
111113
| 100-118 | IRS Conditional | Each IRS variable paired with conditional count constraints |
112114

113115
### Conditional Strata (IRS SOI)
@@ -216,6 +218,7 @@ SELECT
216218
WHEN 4 THEN 'SNAP'
217219
WHEN 5 THEN 'Medicaid'
218220
WHEN 6 THEN 'EITC'
221+
WHEN 7 THEN 'State Income Tax'
219222
END AS group_name,
220223
COUNT(*) AS stratum_count
221224
FROM strata

0 commit comments

Comments
 (0)