Skip to content

Commit 97735f0

Browse files
authored
Fix EITC calibration bugs: 2-kid bucket, surviving-spouse mask, validation reference, stale SOI targets (#767)
1. EITC by-kids 2-kid bucket The EITC-by-kids calibration in `build_loss_matrix` treated IRS SOI rows 2 and 3 as nested (`>=`) when the source data is actually exclusive per IRS Pub 1304 Table 2.5 (0, 1, 2, "3 or more"). The 2-kid bucket double-counted with the 3+ bucket, leaving the calibrator no pressure on exact-2-kid EITC recipients and causing a 27% aggregate undercount ($49B vs $67B Treasury target for TY2024). Changing `<2` to `<3` makes rows 0/1/2 exact-match and leaves row 3 as `>=` (correct, since EITC caps qualifying children at 3). 2. SURVIVING_SPOUSE mask The SOI loop masked the IRS "Married Filing Jointly/Surviving Spouse" row to only PE's JOINT filing status, orphaning ~1.58M surviving- spouse tax units (0.8% of total) from joint-filer AGI-band targets. Fix matches both JOINT and SURVIVING_SPOUSE via `np.isin`, consistent with `soi.py:302-303` and `ctc_diagnostics.py:19-20`. 3. EITC validation reference `validate_national_h5.py` used `~$60B` as the sanity reference for EITC, but the 2024 Treasury Tax Expenditure target is $67B. Tightened so an underconverged calibration is less likely to silently pass validation. 4. Stale SOI filer-count targets SOI_FILER_COUNTS_2015 was a hardcoded dict of 7 TY2015 AGI-band counts, uprated only by population growth. Replaced with a dynamic read from `soi_targets.csv` at the latest SOI year ≤ calibration year (currently TY2023, 19 granular bands). Population-only uprating missed dramatic 2015→2023 distributional shifts: +64% at the $100K+ band and −27% at $0–5K. Fixes #766, #769.
1 parent 0745788 commit 97735f0

3 files changed

Lines changed: 35 additions & 24 deletions

File tree

changelog.d/767.fixed.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Fix EITC-by-kids calibration: 2-kid bucket was using `>=2` against an exclusive IRS target, causing a 27% EITC aggregate undercount since October 2024. See #766.
2+
3+
Fix SOI filing-status mask to include SURVIVING_SPOUSE alongside JOINT when matching IRS "Married Filing Jointly/Surviving Spouse" rows, so ~1.58M surviving-spouse tax units are constrained by joint-filer AGI-band targets.
4+
5+
Tighten the EITC validation reference in `validate_national_h5.py` from ~$60B to ~$67B (2024 Treasury Tax Expenditure estimate) so underconverged calibrations no longer pass sanity checks.
6+
7+
Replace hardcoded SOI filer-count targets from TY2015 (uprated only by population growth) with dynamic reads from `soi_targets.csv` at the latest SOI year ≤ calibration year. Uses 19 granular AGI bands instead of 7 coarse bands, correcting dramatic distributional shifts (TY2015→TY2023 showed +64% at $100K+ and −27% at $0–5K AGI that population-only uprating missed). See #769.

policyengine_us_data/calibration/validate_national_h5.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
"social_security": (1_200_000_000_000, "~$1.2T"),
5757
"snap": (110_000_000_000, "~$110B"),
5858
"ssi": (60_000_000_000, "~$60B"),
59-
"eitc": (60_000_000_000, "~$60B"),
59+
"eitc": (67_000_000_000, "~$67B"),
6060
"income_tax_before_credits": (4_000_000_000_000, "~$4T"),
6161
}
6262

policyengine_us_data/utils/loss.py

Lines changed: 27 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -428,7 +428,7 @@ def build_loss_matrix(dataset: type, time_period):
428428
if row["Filing status"] == "Single":
429429
mask *= df["filing_status"].values == "SINGLE"
430430
elif row["Filing status"] == "Married Filing Jointly/Surviving Spouse":
431-
mask *= df["filing_status"].values == "JOINT"
431+
mask *= np.isin(df["filing_status"].values, ["JOINT", "SURVIVING_SPOUSE"])
432432
elif row["Filing status"] == "Head of Household":
433433
mask *= df["filing_status"].values == "HEAD_OF_HOUSEHOLD"
434434
elif row["Filing status"] == "Married Filing Separately":
@@ -599,7 +599,10 @@ def build_loss_matrix(dataset: type, time_period):
599599
)
600600
eitc_eligible_children = sim.calculate("eitc_child_count").values
601601
eitc = sim.calculate("eitc").values
602-
if row["count_children"] < 2:
602+
# IRS Pub 1304 Table 2.5 reports EITC returns by exclusive
603+
# qualifying-child categories: 0, 1, 2, and "3 or more". Row 3
604+
# represents 3+ since EITC caps qualifying children at 3.
605+
if row["count_children"] < 3:
603606
meets_child_criteria = eitc_eligible_children == row["count_children"]
604607
else:
605608
meets_child_criteria = eitc_eligible_children >= row["count_children"]
@@ -627,38 +630,39 @@ def build_loss_matrix(dataset: type, time_period):
627630
time_period,
628631
)
629632

630-
# Tax filer counts by AGI band (SOI Table 1.1)
631-
# This calibrates total filers (not just taxable returns) including
632-
# low-AGI filers who are important for income distribution accuracy
633-
SOI_FILER_COUNTS_2015 = {
634-
# (agi_lower, agi_upper): total_returns
635-
(-np.inf, 0): 2_072_066,
636-
(0, 5_000): 10_134_703,
637-
(5_000, 10_000): 11_398_595,
638-
(10_000, 25_000): 23_447_927,
639-
(25_000, 50_000): 23_727_745,
640-
(50_000, 100_000): 32_801_908,
641-
(100_000, np.inf): 25_120_985,
642-
}
633+
# Tax filer counts by AGI band (SOI Table 1.1). Calibrates total
634+
# filers (not just taxable returns), with granular bands sourced
635+
# from the latest SOI year <= calibration year to avoid hardcoding
636+
# stale 2015 values.
637+
soi_all = pd.read_csv(CALIBRATION_FOLDER / "soi_targets.csv")
638+
soi_count_rows = soi_all[
639+
(soi_all["Variable"] == "count")
640+
& (soi_all["Filing status"] == "All")
641+
& (~soi_all["Full population"])
642+
& (~soi_all["Taxable only"])
643+
& (soi_all["Year"] <= time_period)
644+
]
645+
soi_latest_year = int(soi_count_rows["Year"].max())
646+
soi_filer_bands = (
647+
soi_count_rows[soi_count_rows["Year"] == soi_latest_year]
648+
.sort_values("AGI lower bound")
649+
.reset_index(drop=True)
650+
)
643651

644-
# Get AGI and filer status at tax unit level, mapped to household
645652
agi_tu = sim.calculate("adjusted_gross_income").values
646653
is_filer_tu = sim.calculate("tax_unit_is_filer").values > 0
647654

648-
for (
649-
agi_lower,
650-
agi_upper,
651-
), filer_count_2015 in SOI_FILER_COUNTS_2015.items():
655+
for _, row in soi_filer_bands.iterrows():
656+
agi_lower = row["AGI lower bound"]
657+
agi_upper = row["AGI upper bound"]
652658
in_band = (agi_tu >= agi_lower) & (agi_tu < agi_upper)
653659
label = f"nation/soi/filer_count/agi_{fmt(agi_lower)}_{fmt(agi_upper)}"
654660
loss_matrix[label] = sim.map_result(
655661
(is_filer_tu & in_band).astype(float),
656662
"tax_unit",
657663
"household",
658664
)
659-
# Uprate from 2015 to current year using population growth
660-
uprated_target = filer_count_2015 * population_uprating
661-
targets_array.append(uprated_target)
665+
targets_array.append(row["Value"])
662666

663667
# Hard-coded totals
664668
for variable_name, target in HARD_CODED_TOTALS.items():

0 commit comments

Comments
 (0)