enhanced_cps_2024 overshoots CBO income_tax target by ~1.86x across 2024-2026 — loss weighting drowns out aggregate targets

The Enhanced CPS, when its `income_tax_positive` is summed and compared against the CBO target it is meant to calibrate to (`calibration.gov.cbo.income_tax`), overshoots by ~1.86× in every year tested:

| Year | CBO target | `enhanced_cps_2024` | Ratio | `cps_2024` (no PUF clone / reweight) |
|---:|---:|---:|---:|---:|
| 2024 | $2,426B | $4,503B | 1.86× | $1,905B (0.79×) |
| 2025 | $2,656B | $4,719B | 1.78× | $1,992B (0.75×) |
| 2026 | $2,751B | $5,101B | 1.85× | $2,134B (0.78×) |

## Repro

```python
from policyengine_us import Microsimulation, CountryTaxBenefitSystem
p = CountryTaxBenefitSystem().parameters
for year in [2024, 2025, 2026]:
    sim = Microsimulation(
        dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
    )
    actual = sim.calc("income_tax_positive", period=year).sum() / 1e9
    target = p.calibration.gov.cbo.income_tax(f"{year}-01-01") / 1e9
    print(f"{year}: \${actual:.0f}B (target \${target:.0f}B, ratio {actual/target:.2f}x)")
```

The HF file was uploaded 2026-05-20 (`Promote 493 files from staging to production for candidate 1.115.4-patch`), so this isn't stale-data — the latest production build misses by the same margin.

## Root cause

From reading `policyengine_us_data/datasets/cps/enhanced_cps.py` and `policyengine_us_data/utils/loss.py`:

The `reweight()` function in `enhanced_cps.py` minimises:

```python
nation_normalisation_factor = is_national * (1 / is_national.sum())
# ...
rel_error = (((estimate - targets_array) + 1) / (targets_array + 1)) ** 2
rel_error_normalized = inv_mean_normalisation * rel_error * normalisation_factor
return rel_error_normalized.mean()
```

Every national target gets the same `1 / N_national` weight. With roughly 500 national targets in the loss matrix (mostly IRS SOI by AGI band × filing status × variable, plus the handful of CBO aggregates from `CBO_PROGRAMS`, plus EITC by child count, plus Census age populations), the single CBO `income_tax_positive` target gets ~0.2% of the national loss weight. A 1.86× miss contributes `(0.86)² / 500 ≈ 0.0015` to total loss — well below what Adam at `lr=0.2` will keep pushing on.

So the optimizer converges to a local minimum where most of the small SOI cells fit reasonably well, and the big aggregate sits 86% off. Moving weights to fit the aggregate would push many small-cell targets out of fit, which is a net loss increase under the current uniform target weighting.

## What's *not* the problem

- The CBO target value is correct — matches CBO's published projection for federal individual income tax receipts.
- The simulator-side variable is correct since #519's *Use `income_tax_positive` for CBO calibration in loss.py* (2026-02-02).
- The HF dataset isn't stale (uploaded 2026-05-20 from `1.115.4-patch`).
- `cps_2024` (no reweight) undershoots by ~22%, consistent with CPS top-coding very high incomes. That's the data-limit floor; the reweight is supposed to lift it but instead overshoots by ~86%.

## Suggested fixes (any of which should help; first is cheapest)

1. **Importance weights per target.** Scale each target's loss contribution by something like `log10(target_magnitude)` or `sqrt(target_magnitude)` so large aggregates aren't drowned out. Uniform weighting currently treats a \$2.75T aggregate and a \$1M SOI bucket as equally important.
2. **Two-stage fit.** Optimize first against only the CBO aggregate targets (`income_tax_positive`, `snap`, `social_security`, `ssi`, `unemployment_compensation`), then add the SOI cells with the aggregate-fit weights as a warm start.
3. **Hard aggregate-consistency constraint.** Add `sum_over_AGI_bands(income_tax) == CBO_total` as an equality (Lagrangian) rather than letting it compete with bucket targets.

A quick check on whether the other CBO aggregates (`snap`, `social_security`, `ssi`, `unemployment_compensation`) are also missing would help decide between (1) — a structural under-weighting that hits all aggregates — and (3) — possibly an income-tax-specific interaction with the cap-gains/dividends per-AGI-bracket targets added recently (#868).

Discovered while building https://github.com/PolicyEngine/bottom-50-tax-analysis. That repo currently defaults to `cps_2024` because of this regression; happy to switch back to `enhanced_cps_2024` once the calibration is fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhanced_cps_2024 overshoots CBO income_tax target by ~1.86x across 2024-2026 — loss weighting drowns out aggregate targets #1107

Repro

Root cause

What's not the problem

Suggested fixes (any of which should help; first is cheapest)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Year	CBO target	`enhanced_cps_2024`	Ratio	`cps_2024` (no PUF clone / reweight)
2024	$2,426B	$4,503B	1.86×	$1,905B (0.79×)
2025	$2,656B	$4,719B	1.78×	$1,992B (0.75×)
2026	$2,751B	$5,101B	1.85×	$2,134B (0.78×)

enhanced_cps_2024 overshoots CBO income_tax target by ~1.86x across 2024-2026 — loss weighting drowns out aggregate targets #1107

Description

Repro

Root cause

What's not the problem

Suggested fixes (any of which should help; first is cheapest)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions