@@ -5,6 +5,51 @@ Public Use File into PolicyEngine's US microdata pipeline
55(` irs_puf.py ` , ` puf.py ` , ` disaggregate_puf.py ` , ` uprate_puf.py ` , and
66the supporting aggregate-record utilities).
77
8+ The ` $100M+ ` aggregate record (` RECID 999999 ` ) now has an optional
9+ Forbes-backed synthesis path. It pulls a public US rich-list backbone
10+ from the ` rtb-api ` project, using the canonical ` 2024-09-01 ` snapshot
11+ (the valuation date Forbes uses for the 2024 Forbes 400 list) from a
12+ pinned ` rtb-api ` commit. The normalized top-400 snapshot and normalized
13+ top-tail SCF donor pool are committed in this package so default builds
14+ are reproducible offline; explicit refresh runs can still write local
15+ cache files under [ ` policyengine_us_data/storage ` ] ( ../../storage ) . The
16+ builder then creates the top tail in two stages:
17+
18+ - ` Forbes -> SCF ` : selected Forbes units are expanded into replicate
19+ draws, and the same top-tail SCF donor model is used both to decide
20+ which Forbes units enter the ` $100M+ ` bucket and to draw each unit's
21+ joint wealth-to-income regime.
22+ - ` SCF -> PUF ` : those SCF draws are matched to top-tail PUF donors to
23+ fill tax-return detail that SCF does not directly observe.
24+
25+ The builder creates a staged artifact with the source Forbes snapshot,
26+ selected Forbes units, SCF draws, PUF priors, calibrated synthetic rows,
27+ and diagnostics. Only the final synthetic rows are upserted into the PUF
28+ aggregate-record replacement path. If the Forbes snapshot or SCF donor
29+ pool cannot be loaded in the production disaggregation entry point, the
30+ code falls back to the existing donor-based disaggregation path.
31+
32+ For PR review or local validation, build the staged artifact and inspect
33+ the deterministic diagnostics before running the full data pipeline:
34+
35+ ``` python
36+ from policyengine_us_data.datasets.puf.forbes_backbone import (
37+ build_forbes_top_tail_artifact,
38+ build_forbes_top_tail_diagnostic_tables,
39+ format_forbes_top_tail_diagnostics,
40+ )
41+
42+ artifact = build_forbes_top_tail_artifact(... )
43+ tables = build_forbes_top_tail_diagnostic_tables(artifact, row, amount_columns)
44+ print (format_forbes_top_tail_diagnostics(tables))
45+ ```
46+
47+ The diagnostic bundle includes a one-row summary, exact calibration
48+ errors by PUF amount column, component composition comparing SCF priors,
49+ PUF priors, calibrated synthetic totals, and target totals, selected
50+ Forbes units, and SCF draw composition. The formatted summary includes
51+ ASCII bar visuals so it can be pasted directly into a PR or CI log.
52+
853The PUF is an IRS SOI Division sample of individual income-tax returns,
954stripped of direct identifiers, with top-coded amounts and
1055disclosure-avoidance perturbations applied. PolicyEngine uses the 2015
0 commit comments