Add liquid asset imputation from SIPP (#511)

MaxGhenis · claude · juaristi22 · commit e66441b91301 · 2026-02-12T14:11:04.000+05:30
* Add liquid asset imputation from SIPP

Imputes three asset categories from SIPP 2023 using QRF:
- bank_account_assets (TVAL_BANK): checking, savings, money market
- stock_assets (TVAL_STMF): stocks and mutual funds
- bond_assets (TVAL_BOND): bonds and government securities

This enables modeling of SSI and other means-tested programs that
have asset tests. PolicyEngine-US defines which assets count for
each program (e.g., ssi_countable_resources = bank + stocks + bonds).

Tests verify imputed totals match Fed data (~$15-20T in liquid assets)
and distribution is realistic (~20% have &lt;$2k).

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Remove SSI resource test placeholder

The random pass rate assignment for meets_ssi_resource_test is no longer
needed now that liquid assets are imputed from SIPP. The SSI resource
test will be calculated from actual imputed assets in policyengine-us.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Add SSI takeup rate and draw

- Add ssi.yaml parameter with 50% takeup rate (Urban Institute estimate)
- Add takes_up_ssi_if_eligible draw in CPS processing
- Remove old ssi_pass_rate.yaml (replaced by proper takeup)

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Fix asset imputation by adding is_female, is_married to calculation

The Microsimulation DataFrame needs these columns explicitly calculated.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Fix is_married entity mismatch by using raw CPS data

is_married in policyengine-us is defined at Family entity level, but
imputation models need person-level marital status. Get it directly
from raw CPS A_MARITL variable instead of calculate_dataframe.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Drop temporary imputation columns before saving

is_married, is_under_18, is_under_6 are only needed for imputation
models. is_married in policyengine-us is Family-level, so we can't
save a person-level version to the dataset.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Fix test to use ssi takeup rate instead of ssi_pass_rate

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

* Skip asset tests if policyengine-us variables unavailable

The bank_account_assets, stock_assets, and bond_assets variables were
added to policyengine-us but aren't yet on PyPI. Add skip condition
so tests pass until the next policyengine-us release.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -35,3 +35,4 @@ completed_*.txt
 
 ## Test fixtures
 !policyengine_us_data/tests/test_local_area_calibration/test_fixture_50hh.h5
+oregon_ctc_analysis.py
diff --git a/changelog_entry.yaml b/changelog_entry.yaml
@@ -0,0 +1,8 @@
+- bump: minor
+  changes:
+    added:
+      - Add liquid asset imputation from SIPP (bank accounts, stocks, bonds) for SSI and means-tested program modeling
+      - Add SSI takeup rate parameter and takes_up_ssi_if_eligible draw
+    removed:
+      - Remove random SSI resource test placeholder (now calculated from imputed assets in policyengine-us)
+      - Remove ssi_pass_rate parameter (replaced by ssi takeup rate)
diff --git a/policyengine_us_data/datasets/cps/cps.py b/policyengine_us_data/datasets/cps/cps.py
@@ -211,7 +211,7 @@ def add_takeup(self):
     early_head_start_rate = load_take_up_rate(
         "early_head_start", self.time_period
     )
-    ssi_pass_rate = load_take_up_rate("ssi_pass_rate", self.time_period)
+    ssi_rate = load_take_up_rate("ssi", self.time_period)
 
     # EITC: varies by number of children
     eitc_child_count = baseline.calculate("eitc_child_count").values
@@ -264,9 +264,9 @@ def add_takeup(self):
         rng.random(n_persons) < early_head_start_rate
     )
 
-    # SSI resource test
-    rng = seeded_rng("meets_ssi_resource_test")
-    data["meets_ssi_resource_test"] = rng.random(n_persons) < ssi_pass_rate
+    # SSI
+    rng = seeded_rng("takes_up_ssi_if_eligible")
+    data["takes_up_ssi_if_eligible"] = rng.random(n_persons) < ssi_rate
 
     # WIC: resolve draws to bools using category-specific rates
     wic_categories = baseline.calculate("wic_category_str").values
@@ -1761,11 +1761,20 @@ def add_tips(self, cps: h5py.File):
             "employment_income",
             "age",
             "household_weight",
+            "is_female",
         ],
         2025,
     )
     cps = pd.DataFrame(cps)
 
+    # Get is_married from raw CPS data (A_MARITL codes: 1,2 = married)
+    # Note: is_married in policyengine-us is Family-level, but we need
+    # person-level for imputation models
+    raw_data = self.raw_cps(require=True).load()
+    raw_person = raw_data["person"]
+    cps["is_married"] = raw_person.A_MARITL.isin([1, 2]).values
+    raw_data.close()
+
     cps["is_under_18"] = cps.age < 18
     cps["is_under_6"] = cps.age < 6
     cps["count_under_18"] = (
@@ -1793,6 +1802,27 @@ def add_tips(self, cps: h5py.File):
         mean_quantile=0.5,
     ).tip_income.values
 
+    # Impute liquid assets from SIPP (bank accounts, stocks, bonds)
+
+    from policyengine_us_data.datasets.sipp import get_asset_model
+
+    asset_model = get_asset_model()
+
+    asset_predictions = asset_model.predict(
+        X_test=cps,
+        mean_quantile=0.5,
+    )
+    cps["bank_account_assets"] = asset_predictions.bank_account_assets.values
+    cps["stock_assets"] = asset_predictions.stock_assets.values
+    cps["bond_assets"] = asset_predictions.bond_assets.values
+
+    # Drop temporary columns used only for imputation
+    # is_married is person-level here but policyengine-us defines it at Family
+    # level, so we must not save it
+    cps = cps.drop(
+        columns=["is_married", "is_under_18", "is_under_6"], errors="ignore"
+    )
+
     self.save_dataset(cps)
 
 
diff --git a/policyengine_us_data/datasets/sipp/__init__.py b/policyengine_us_data/datasets/sipp/__init__.py
@@ -1 +1,6 @@
-from .sipp import train_tip_model, get_tip_model
+from .sipp import (
+    train_tip_model,
+    get_tip_model,
+    train_asset_model,
+    get_asset_model,
+)
diff --git a/policyengine_us_data/datasets/sipp/sipp.py b/policyengine_us_data/datasets/sipp/sipp.py
@@ -136,3 +136,135 @@ def get_tip_model() -> QRF:
             model = pickle.load(f)
 
     return model
+
+
+# Asset imputation from SIPP 2023
+# Imputes asset categories separately for policy flexibility
+
+ASSET_COLUMNS = [
+    "SSUID",
+    "PNUM",
+    "MONTHCODE",
+    "SPANEL",
+    "SWAVE",
+    "WPFINWGT",
+    "TAGE",
+    "ESEX",
+    "EMS",
+    "TPTOTINC",
+    # Asset values (person-level sums from SIPP)
+    "TVAL_BANK",  # Checking, savings, money market
+    "TVAL_STMF",  # Stocks and mutual funds
+    "TVAL_BOND",  # Bonds and government securities
+    # SSI receipt (for validation)
+    "RSSI_YRYN",  # Received SSI in at least one month
+]
+
+
+def train_asset_model():
+    """Train QRF model for liquid asset categories using SIPP 2023 data.
+
+    Imputes three asset categories separately:
+    - bank_account_assets: checking, savings, money market (TVAL_BANK)
+    - stock_assets: stocks and mutual funds (TVAL_STMF)
+    - bond_assets: bonds and government securities (TVAL_BOND)
+
+    Policy models can then define countable resources based on rules.
+    """
+    hf_hub_download(
+        repo_id="PolicyEngine/policyengine-us-data",
+        filename="pu2023.csv",
+        repo_type="model",
+        local_dir=STORAGE_FOLDER,
+    )
+
+    df = pd.read_csv(
+        STORAGE_FOLDER / "pu2023.csv",
+        delimiter="|",
+        usecols=ASSET_COLUMNS,
+    )
+
+    # Filter to December (end of year values) to get annual snapshot
+    df = df[df.MONTHCODE == 12]
+
+    # Rename SIPP variables to policy-neutral names
+    df["bank_account_assets"] = df["TVAL_BANK"].fillna(0)
+    df["stock_assets"] = df["TVAL_STMF"].fillna(0)
+    df["bond_assets"] = df["TVAL_BOND"].fillna(0)
+
+    # Prepare predictors
+    df["age"] = df.TAGE
+    df["is_female"] = df.ESEX == 2
+    df["is_married"] = df.EMS == 1
+    df["employment_income"] = df.TPTOTINC * 12
+    df["household_weight"] = df.WPFINWGT
+    df["household_id"] = df.SSUID
+
+    # Calculate household-level counts
+    df["is_under_18"] = df.TAGE < 18
+    df["count_under_18"] = (
+        df.groupby("SSUID")["is_under_18"].sum().loc[df.SSUID.values].values
+    )
+
+    sipp = df[
+        [
+            "household_id",
+            "employment_income",
+            "bank_account_assets",
+            "stock_assets",
+            "bond_assets",
+            "age",
+            "is_female",
+            "is_married",
+            "count_under_18",
+            "household_weight",
+        ]
+    ]
+
+    sipp = sipp[~sipp.isna().any(axis=1)]
+
+    # Subsample for training efficiency
+    sipp = sipp.loc[
+        np.random.choice(
+            sipp.index,
+            size=min(20_000, len(sipp)),
+            replace=True,
+            p=sipp.household_weight / sipp.household_weight.sum(),
+        )
+    ]
+
+    model = QRF()
+
+    model = model.fit(
+        X_train=sipp,
+        predictors=[
+            "employment_income",
+            "age",
+            "is_female",
+            "is_married",
+            "count_under_18",
+        ],
+        imputed_variables=[
+            "bank_account_assets",
+            "stock_assets",
+            "bond_assets",
+        ],
+    )
+
+    return model
+
+
+def get_asset_model() -> QRF:
+    """Get or train the liquid asset imputation model."""
+    model_path = STORAGE_FOLDER / "liquid_assets.pkl"
+
+    if not model_path.exists():
+        model = train_asset_model()
+
+        with open(model_path, "wb") as f:
+            pickle.dump(model, f)
+    else:
+        with open(model_path, "rb") as f:
+            model = pickle.load(f)
+
+    return model
diff --git a/policyengine_us_data/parameters/take_up/ssi.yaml b/policyengine_us_data/parameters/take_up/ssi.yaml
@@ -0,0 +1,9 @@
+description: Percentage of eligible SSI recipients who claim SSI.
+metadata:
+  label: SSI takeup rate
+  unit: /1
+  reference:
+    - title: Urban Institute - SSI Participation Rates for Adults 65+
+      href: https://www.urban.org/research/publication/estimation-national-state-and-substate-program-participation-rates-adults-65
+values:
+  2018-01-01: 0.50
diff --git a/policyengine_us_data/parameters/take_up/ssi_pass_rate.yaml b/policyengine_us_data/parameters/take_up/ssi_pass_rate.yaml
diff --git a/policyengine_us_data/tests/test_datasets/test_sipp_assets.py b/policyengine_us_data/tests/test_datasets/test_sipp_assets.py
diff --git a/policyengine_us_data/tests/test_stochastic_variables.py b/policyengine_us_data/tests/test_stochastic_variables.py

Original file line number	Diff line number	Diff line change
`@@ -35,3 +35,4 @@ completed_*.txt`
`35`	`35`
`36`	`36`	`## Test fixtures`
`37`	`37`	`!policyengine_us_data/tests/test_local_area_calibration/test_fixture_50hh.h5`
	`38`	`+oregon_ctc_analysis.py`