PolicyEngine
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎changelog.d/825.added.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/825.added.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎policyengine_us_data/datasets/puf/README.md‎
Lines changed: 45 additions & 0 deletions b/‎policyengine_us_data/datasets/puf/README.md‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎policyengine_us_data/datasets/puf/disaggregate_puf.py‎
Lines changed: 42 additions & 9 deletions b/‎policyengine_us_data/datasets/puf/disaggregate_puf.py‎
Lines changed: 42 additions & 9 deletions
@@ -40,6 +40,10 @@ docs/.ipynb_checkpoints/
 ## Calibration run outputs (weights, diagnostics, packages, config)
 policyengine_us_data/storage/calibration/
 
+## Cached rich-list backbone snapshots
+policyengine_us_data/storage/forbes_us_top_400_*.json
+policyengine_us_data/storage/forbes_us_top_400_*.json.lock
+
 ## Batch processing checkpoints
 completed_*.txt
 
 
@@ -0,0 +1 @@
+Add a reproducible Forbes-backed PUF top-tail synthesis path.
@@ -5,6 +5,51 @@ Public Use File into PolicyEngine's US microdata pipeline
 (`irs_puf.py`, `puf.py`, `disaggregate_puf.py`, `uprate_puf.py`, and
 the supporting aggregate-record utilities).
 
+The `$100M+` aggregate record (`RECID 999999`) now has an optional
+Forbes-backed synthesis path. It pulls a public US rich-list backbone
+from the `rtb-api` project, using the canonical `2024-09-01` snapshot
+(the valuation date Forbes uses for the 2024 Forbes 400 list) from a
+pinned `rtb-api` commit. The normalized top-400 snapshot and normalized
+top-tail SCF donor pool are committed in this package so default builds
+are reproducible offline; explicit refresh runs can still write local
+cache files under [`policyengine_us_data/storage`](../../storage). The
+builder then creates the top tail in two stages:
+
+- `Forbes -> SCF`: selected Forbes units are expanded into replicate
+  draws, and the same top-tail SCF donor model is used both to decide
+  which Forbes units enter the `$100M+` bucket and to draw each unit's
+  joint wealth-to-income regime.
+- `SCF -> PUF`: those SCF draws are matched to top-tail PUF donors to
+  fill tax-return detail that SCF does not directly observe.
+
+The builder creates a staged artifact with the source Forbes snapshot,
+selected Forbes units, SCF draws, PUF priors, calibrated synthetic rows,
+and diagnostics. Only the final synthetic rows are upserted into the PUF
+aggregate-record replacement path. If the Forbes snapshot or SCF donor
+pool cannot be loaded in the production disaggregation entry point, the
+code falls back to the existing donor-based disaggregation path.
+
+For PR review or local validation, build the staged artifact and inspect
+the deterministic diagnostics before running the full data pipeline:
+
+```python
+from policyengine_us_data.datasets.puf.forbes_backbone import (
+    build_forbes_top_tail_artifact,
+    build_forbes_top_tail_diagnostic_tables,
+    format_forbes_top_tail_diagnostics,
+)
+
+artifact = build_forbes_top_tail_artifact(...)
+tables = build_forbes_top_tail_diagnostic_tables(artifact, row, amount_columns)
+print(format_forbes_top_tail_diagnostics(tables))
+```
+
+The diagnostic bundle includes a one-row summary, exact calibration
+errors by PUF amount column, component composition comparing SCF priors,
+PUF priors, calibrated synthetic totals, and target totals, selected
+Forbes units, and SCF draw composition. The formatted summary includes
+ASCII bar visuals so it can be pasted directly into a PR or CI log.
+
 The PUF is an IRS SOI Division sample of individual income-tax returns,
 stripped of direct identifiers, with top-coded amounts and
 disclosure-avoidance perturbations applied. PolicyEngine uses the 2015
 
@@ -13,9 +13,14 @@
 
 from __future__ import annotations
 
+import logging
+
 import numpy as np
 import pandas as pd
 from . import aggregate_record_utils as utils
+from .forbes_backbone import build_forbes_top_tail_bucket
+
+logger = logging.getLogger(__name__)
 
 AGGREGATE_RECIDS = utils.AGGREGATE_RECIDS
 SYNTHETIC_RECID_START = utils.SYNTHETIC_RECID_START
@@ -45,6 +50,7 @@
 def disaggregate_aggregate_records(
     puf: pd.DataFrame,
     seed: int = 42,
+    use_forbes_top_tail: bool = True,
 ) -> pd.DataFrame:
     """Replace the four IRS aggregate rows with calibrated synthetic donors."""
 
@@ -63,15 +69,42 @@ def disaggregate_aggregate_records(
     next_recid = SYNTHETIC_RECID_START
 
     for recid in AGGREGATE_RECIDS:
-        synthetic = utils._disaggregate_bucket(
-            recid=recid,
-            row=agg_rows.loc[recid],
-            regular=regular,
-            amount_columns=amount_columns,
-            donor_scores=donor_scores,
-            next_recid=next_recid,
-            rng=rng,
-        )
+        if use_forbes_top_tail and recid == 999999:
+            try:
+                synthetic = build_forbes_top_tail_bucket(
+                    row=agg_rows.loc[recid],
+                    regular=regular,
+                    amount_columns=amount_columns,
+                    donor_scores=donor_scores,
+                    next_recid=next_recid,
+                    rng=rng,
+                )
+            except Exception as exc:
+                logger.warning(
+                    "Forbes top-tail synthesis failed; falling back to legacy donors: %s",
+                    exc,
+                )
+                synthetic = None
+            if synthetic is None:
+                synthetic = utils._disaggregate_bucket(
+                    recid=recid,
+                    row=agg_rows.loc[recid],
+                    regular=regular,
+                    amount_columns=amount_columns,
+                    donor_scores=donor_scores,
+                    next_recid=next_recid,
+                    rng=rng,
+                )
+        else:
+            synthetic = utils._disaggregate_bucket(
+                recid=recid,
+                row=agg_rows.loc[recid],
+                regular=regular,
+                amount_columns=amount_columns,
+                donor_scores=donor_scores,
+                next_recid=next_recid,
+                rng=rng,
+            )
         next_recid += len(synthetic)
         all_synthetic.append(synthetic[puf.columns])
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Add a reproducible Forbes-backed PUF top-tail synthesis path.`