Skip to content

Commit a83a93a

Browse files
authored
Add Forbes-backed PUF top tail
1 parent 4f7154a commit a83a93a

8 files changed

Lines changed: 10692 additions & 17 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ docs/.ipynb_checkpoints/
4040
## Calibration run outputs (weights, diagnostics, packages, config)
4141
policyengine_us_data/storage/calibration/
4242

43+
## Cached rich-list backbone snapshots
44+
policyengine_us_data/storage/forbes_us_top_400_*.json
45+
policyengine_us_data/storage/forbes_us_top_400_*.json.lock
46+
4347
## Batch processing checkpoints
4448
completed_*.txt
4549

changelog.d/825.added.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add a reproducible Forbes-backed PUF top-tail synthesis path.

policyengine_us_data/datasets/puf/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,51 @@ Public Use File into PolicyEngine's US microdata pipeline
55
(`irs_puf.py`, `puf.py`, `disaggregate_puf.py`, `uprate_puf.py`, and
66
the supporting aggregate-record utilities).
77

8+
The `$100M+` aggregate record (`RECID 999999`) now has an optional
9+
Forbes-backed synthesis path. It pulls a public US rich-list backbone
10+
from the `rtb-api` project, using the canonical `2024-09-01` snapshot
11+
(the valuation date Forbes uses for the 2024 Forbes 400 list) from a
12+
pinned `rtb-api` commit. The normalized top-400 snapshot and normalized
13+
top-tail SCF donor pool are committed in this package so default builds
14+
are reproducible offline; explicit refresh runs can still write local
15+
cache files under [`policyengine_us_data/storage`](../../storage). The
16+
builder then creates the top tail in two stages:
17+
18+
- `Forbes -> SCF`: selected Forbes units are expanded into replicate
19+
draws, and the same top-tail SCF donor model is used both to decide
20+
which Forbes units enter the `$100M+` bucket and to draw each unit's
21+
joint wealth-to-income regime.
22+
- `SCF -> PUF`: those SCF draws are matched to top-tail PUF donors to
23+
fill tax-return detail that SCF does not directly observe.
24+
25+
The builder creates a staged artifact with the source Forbes snapshot,
26+
selected Forbes units, SCF draws, PUF priors, calibrated synthetic rows,
27+
and diagnostics. Only the final synthetic rows are upserted into the PUF
28+
aggregate-record replacement path. If the Forbes snapshot or SCF donor
29+
pool cannot be loaded in the production disaggregation entry point, the
30+
code falls back to the existing donor-based disaggregation path.
31+
32+
For PR review or local validation, build the staged artifact and inspect
33+
the deterministic diagnostics before running the full data pipeline:
34+
35+
```python
36+
from policyengine_us_data.datasets.puf.forbes_backbone import (
37+
build_forbes_top_tail_artifact,
38+
build_forbes_top_tail_diagnostic_tables,
39+
format_forbes_top_tail_diagnostics,
40+
)
41+
42+
artifact = build_forbes_top_tail_artifact(...)
43+
tables = build_forbes_top_tail_diagnostic_tables(artifact, row, amount_columns)
44+
print(format_forbes_top_tail_diagnostics(tables))
45+
```
46+
47+
The diagnostic bundle includes a one-row summary, exact calibration
48+
errors by PUF amount column, component composition comparing SCF priors,
49+
PUF priors, calibrated synthetic totals, and target totals, selected
50+
Forbes units, and SCF draw composition. The formatted summary includes
51+
ASCII bar visuals so it can be pasted directly into a PR or CI log.
52+
853
The PUF is an IRS SOI Division sample of individual income-tax returns,
954
stripped of direct identifiers, with top-coded amounts and
1055
disclosure-avoidance perturbations applied. PolicyEngine uses the 2015

policyengine_us_data/datasets/puf/disaggregate_puf.py

Lines changed: 42 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,14 @@
1313

1414
from __future__ import annotations
1515

16+
import logging
17+
1618
import numpy as np
1719
import pandas as pd
1820
from . import aggregate_record_utils as utils
21+
from .forbes_backbone import build_forbes_top_tail_bucket
22+
23+
logger = logging.getLogger(__name__)
1924

2025
AGGREGATE_RECIDS = utils.AGGREGATE_RECIDS
2126
SYNTHETIC_RECID_START = utils.SYNTHETIC_RECID_START
@@ -45,6 +50,7 @@
4550
def disaggregate_aggregate_records(
4651
puf: pd.DataFrame,
4752
seed: int = 42,
53+
use_forbes_top_tail: bool = True,
4854
) -> pd.DataFrame:
4955
"""Replace the four IRS aggregate rows with calibrated synthetic donors."""
5056

@@ -63,15 +69,42 @@ def disaggregate_aggregate_records(
6369
next_recid = SYNTHETIC_RECID_START
6470

6571
for recid in AGGREGATE_RECIDS:
66-
synthetic = utils._disaggregate_bucket(
67-
recid=recid,
68-
row=agg_rows.loc[recid],
69-
regular=regular,
70-
amount_columns=amount_columns,
71-
donor_scores=donor_scores,
72-
next_recid=next_recid,
73-
rng=rng,
74-
)
72+
if use_forbes_top_tail and recid == 999999:
73+
try:
74+
synthetic = build_forbes_top_tail_bucket(
75+
row=agg_rows.loc[recid],
76+
regular=regular,
77+
amount_columns=amount_columns,
78+
donor_scores=donor_scores,
79+
next_recid=next_recid,
80+
rng=rng,
81+
)
82+
except Exception as exc:
83+
logger.warning(
84+
"Forbes top-tail synthesis failed; falling back to legacy donors: %s",
85+
exc,
86+
)
87+
synthetic = None
88+
if synthetic is None:
89+
synthetic = utils._disaggregate_bucket(
90+
recid=recid,
91+
row=agg_rows.loc[recid],
92+
regular=regular,
93+
amount_columns=amount_columns,
94+
donor_scores=donor_scores,
95+
next_recid=next_recid,
96+
rng=rng,
97+
)
98+
else:
99+
synthetic = utils._disaggregate_bucket(
100+
recid=recid,
101+
row=agg_rows.loc[recid],
102+
regular=regular,
103+
amount_columns=amount_columns,
104+
donor_scores=donor_scores,
105+
next_recid=next_recid,
106+
rng=rng,
107+
)
75108
next_recid += len(synthetic)
76109
all_synthetic.append(synthetic[puf.columns])
77110

0 commit comments

Comments
 (0)