Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/1079.added
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added ACS VALP-backed primary_residence_value imputation to CPS and source-imputed outputs.
3 changes: 2 additions & 1 deletion docs/appendix.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,8 @@ within the same record.
- auto_loan_balance
- auto_loan_interest

#### Variables Imputed from American Community Survey (2 variables)
#### Variables Imputed from American Community Survey (3 variables)

- rent
- real_estate_taxes
- primary_residence_value
8 changes: 5 additions & 3 deletions docs/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ sources.
| ------------------- | ----------------------- | ---------------------------------------------------------------------- |
| CPS ASEC | 2024 (income year 2023) | Base microdata; pipeline ages values to target policy year |
| IRS PUF | 2015 | Pipeline ages values to target policy year using income growth indices |
| ACS | 2022 | Provides rent and real estate tax imputation targets |
| ACS | 2022 | Provides rent, real estate tax, and primary residence value targets |
| SCF | 2022 | Provides wealth and debt variable imputation targets |
| SIPP | 2023 | Provides tip income and asset imputation targets |
| Calibration targets | Primarily 2023–2024 | Varies by source; see calibration data sources below |
Expand Down Expand Up @@ -93,8 +93,10 @@ proper matching.

The ACS provides housing and geographic data that supplements the CPS housing information. For
homeowners, we impute property taxes based on state of residence, household income, and demographic
characteristics. We also impute rent values for specific tenure types where CPS data is incomplete,
along with additional housing characteristics not captured in the CPS. These imputations use
characteristics. We also impute owner-occupied primary residence market value from ACS property
value records, with non-owner households set to zero. Rent values are imputed for specific tenure
types where CPS data is incomplete, along with additional housing characteristics not captured in
the CPS. These imputations use
Quantile Regression Forests to preserve distributional characteristics while accounting for
household heterogeneity.

Expand Down
10 changes: 5 additions & 5 deletions docs/pipeline_map.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ stages:
- id: in_acs
label: ACS 2022
node_type: artifact
description: Training data for rent QRF
description: Training data for housing QRF
- id: in_sipp
label: SIPP 2023
node_type: artifact
Expand Down Expand Up @@ -648,12 +648,12 @@ stages:
stability: moving
- id: 1f_source_imputation
label: 1f
title: 'Substage 1f: Source Imputation (ACS + SIPP + SCF)'
title: 'Substage 1f: Source Imputation (ACS + SIPP + ORG + SCF)'
canonical_stage_id: 1_build_datasets
legacy_stage_id: '4'
manifest_step_ids:
- 01_build_datasets
description: Impute wealth/assets from external surveys onto stratified CPS via QRF
description: Impute housing, wealth/assets, and labor-market variables from external surveys onto stratified CPS via QRF
country: us
extra_nodes:
- id: in_strat_s4
Expand All @@ -663,7 +663,7 @@ stages:
- id: in_acs_s4
label: ACS_2022
node_type: artifact
description: American Community Survey - has state_fips predictor
description: American Community Survey - has state_fips predictor and housing targets
- id: in_sipp_s4
label: SIPP 2023
node_type: external
Expand All @@ -679,7 +679,7 @@ stages:
- id: out_imputed
label: source_imputed_stratified_extended_cps.h5
node_type: artifact
description: Enriched with ACS/SIPP/SCF vars - uploaded to HuggingFace
description: Enriched with ACS/SIPP/ORG/SCF vars - uploaded to HuggingFace
- id: util_clone_assign
label: clone_and_assign.py
node_type: utility
Expand Down
36 changes: 29 additions & 7 deletions policyengine_us_data/calibration/source_impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
financial predictors.

Sources and variables:
ACS -> rent, real_estate_taxes (with state predictor)
ACS -> rent, real_estate_taxes, primary_residence_value
(with state predictor)
SIPP -> tip_income, bank_account_assets, stock_assets,
bond_assets, household_vehicles_owned,
household_vehicles_value (no state predictor)
Expand All @@ -29,6 +30,7 @@
import logging
from typing import Dict, Optional

import h5py
import numpy as np
import pandas as pd
from policyengine_us_data.datasets.cps.tipped_occupation import (
Expand Down Expand Up @@ -72,6 +74,12 @@
ACS_IMPUTED_VARIABLES = [
"rent",
"real_estate_taxes",
"primary_residence_value",
]

ACS_CALCULATED_IMPUTED_VARIABLES = [
"rent",
"real_estate_taxes",
]

SIPP_IMPUTED_VARIABLES = [
Expand Down Expand Up @@ -150,6 +158,7 @@
"RENTED": 2,
"NONE": 0,
}
OWNER_TENURE_CODE = 1

SIPP_JOB_OCCUPATION_COLUMNS = [f"TJB{i}_OCC" for i in range(1, 8)]

Expand Down Expand Up @@ -321,7 +330,7 @@ def _person_state_fips(
id="acs_qrf",
label="ACS QRF Imputation",
node_type="library",
description="Impute rent and real estate tax variables from ACS donor data.",
description="Impute housing value, rent, and real estate tax variables from ACS donor data.",
source_file="policyengine_us_data/calibration/source_impute.py",
status="current",
stability="moving",
Expand All @@ -337,7 +346,7 @@ def _impute_acs(
time_period: int,
dataset_path: Optional[str] = None,
) -> Dict[str, Dict[int, np.ndarray]]:
"""Impute rent and real_estate_taxes from ACS with state.
"""Impute rent, real_estate_taxes, and primary_residence_value from ACS.

Args:
data: CPS data dict.
Expand All @@ -357,19 +366,28 @@ def _impute_acs(
predictors = ACS_PREDICTORS + ["state_fips"]

acs_df = acs.calculate_dataframe(
ACS_PREDICTORS + ACS_IMPUTED_VARIABLES, map_to="person"
ACS_PREDICTORS + ACS_CALCULATED_IMPUTED_VARIABLES,
map_to="person",
)
acs_df["state_fips"] = acs.calculate("state_fips", map_to="person").values.astype(
np.float32
)
with h5py.File(ACS_2022.file_path, "r") as acs_h5:
acs_df["primary_residence_value"] = np.asarray(
acs_h5["primary_residence_value"],
dtype=np.float32,
)

train_df = acs_df[acs_df.is_household_head].sample(10_000, random_state=42)
train_df = _encode_tenure_type(train_df)
del acs

if dataset_path is not None:
cps_sim = Microsimulation(dataset=dataset_path)
cps_df = cps_sim.calculate_dataframe(ACS_PREDICTORS, map_to="person")
cps_df = cps_sim.calculate_dataframe(
ACS_PREDICTORS,
map_to="person",
)
del cps_sim
else:
cps_df = pd.DataFrame()
Expand Down Expand Up @@ -402,18 +420,22 @@ def _impute_acs(
imputed_variables=ACS_IMPUTED_VARIABLES,
)
predictions = fitted.predict(X_test=cps_heads)
owner_head_mask = cps_heads["tenure_type"].to_numpy() == OWNER_TENURE_CODE

n_persons = len(data["person_id"][time_period])
for var in ACS_IMPUTED_VARIABLES:
values = np.zeros(n_persons, dtype=np.float32)
values[mask] = predictions[var].values
predicted_values = predictions[var].values
if var == "primary_residence_value":
predicted_values = np.where(owner_head_mask, predicted_values, 0)
values[mask] = predicted_values
data[var] = {time_period: values}
data["pre_subsidy_rent"] = {time_period: data["rent"][time_period].copy()}

del fitted, predictions
gc.collect()

logger.info("ACS imputation complete: rent, real_estate_taxes")
logger.info("ACS imputation complete: %s", ", ".join(ACS_IMPUTED_VARIABLES))
return data


Expand Down
7 changes: 5 additions & 2 deletions policyengine_us_data/datasets/acs/acs.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,17 +71,20 @@ def add_person_variables(
acs["self_employment_income"] = person.SEMP
acs["social_security"] = person.SSP
acs["taxable_private_pension_income"] = person.RETP
person[["rent", "real_estate_taxes"]] = (
person[["rent", "real_estate_taxes", "primary_residence_value", "TEN"]] = (
household.set_index("household_id")
.loc[person["household_id"]][["RNTP", "TAXAMT"]]
.loc[person["household_id"]][["RNTP", "TAXAMT", "VALP", "TEN"]]
.values
)
acs["is_household_head"] = person.SPORDER == 1
factor = person.SPORDER == 1
owner_occupied = person.TEN.astype(int).isin([1, 2])
person.rent *= factor * 12
person.real_estate_taxes *= factor
person.primary_residence_value *= factor * owner_occupied
acs["rent"] = person.rent
acs["real_estate_taxes"] = person.real_estate_taxes
acs["primary_residence_value"] = person.primary_residence_value
acs["tenure_type"] = (
household.TEN.astype(int)
.map(
Expand Down
1 change: 1 addition & 0 deletions policyengine_us_data/datasets/acs/census_acs.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
"RMSP", # Number of rooms
"RNTP", # Monthly rent
"TEN", # Tenure
"VALP", # Property value
"VEH", # Number of vehicles
"FINCP", # Total income
"GRNTP", # Gross rent
Expand Down
40 changes: 32 additions & 8 deletions policyengine_us_data/datasets/cps/cps.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@
from policyengine_us_data.pipeline_metadata import pipeline_node
from policyengine_us_data.pipeline_schema import PipelineNode

ACS_CALCULATED_IMPUTED_VARIABLES = ["rent", "real_estate_taxes"]
ACS_IMPUTED_VARIABLES = [*ACS_CALCULATED_IMPUTED_VARIABLES, "primary_residence_value"]
OWNER_TENURE_TYPES = {"OWNED_WITH_MORTGAGE", "OWNED_OUTRIGHT"}

CURRENT_HEALTH_COVERAGE_REPORTED_VAR_MAP = {
"reported_has_direct_purchase_health_coverage_at_interview": "NOW_DIR",
"reported_has_marketplace_health_coverage_at_interview": "NOW_MRK",
Expand Down Expand Up @@ -339,9 +343,9 @@ def downsample(self, frac: float) -> None:
@pipeline_node(
PipelineNode(
id="add_rent",
label="Rent Imputation",
label="ACS Housing Imputation",
node_type="library",
description="Impute rent and real estate taxes using ACS donor data.",
description="Impute housing values, rent, and real estate taxes using ACS donor data.",
source_file="policyengine_us_data/datasets/cps/cps.py",
status="legacy",
stability="moving",
Expand Down Expand Up @@ -398,8 +402,10 @@ def add_rent(self, cps: h5py.File, person: DataFrame, household: DataFrame):
"state_code_str",
"household_size",
]
IMPUTATIONS = ["rent", "real_estate_taxes"]
train_df = acs.calculate_dataframe(PREDICTORS + IMPUTATIONS, map_to="person")
train_df = acs.calculate_dataframe(
PREDICTORS + ACS_CALCULATED_IMPUTED_VARIABLES,
map_to="person",
)
# TODO(PolicyEngine/policyengine-core#482): policyengine-core 3.24.0+
# silently drops user-supplied ETERNITY inputs on dataset reload because
# _user_input_keys records the user-supplied period instead of the
Expand All @@ -413,26 +419,34 @@ def add_rent(self, cps: h5py.File, person: DataFrame, household: DataFrame):
train_df["is_household_head"] = np.asarray(
acs_h5["is_household_head"], dtype=bool
)
train_df["primary_residence_value"] = np.asarray(
acs_h5["primary_residence_value"],
dtype=float,
)
train_df.tenure_type = train_df.tenure_type.map(
{
"OWNED_OUTRIGHT": "OWNED_WITH_MORTGAGE",
},
na_action="ignore",
).fillna(train_df.tenure_type)
train_df = train_df[train_df.is_household_head].sample(10_000)
inference_df = cps_sim.calculate_dataframe(PREDICTORS, map_to="person")
inference_df = cps_sim.calculate_dataframe(
PREDICTORS,
map_to="person",
)
inference_df["is_household_head"] = np.asarray(cps["is_household_head"], dtype=bool)
mask = inference_df.is_household_head.values
inference_df = inference_df[mask]
owner_head_mask = inference_df.tenure_type.astype(str).isin(OWNER_TENURE_TYPES)

qrf = QRF()
logging.info("Training imputation model for rent and real estate taxes.")
logging.info("Training imputation model for ACS housing variables.")
fitted_model = qrf.fit(
X_train=train_df,
predictors=PREDICTORS,
imputed_variables=IMPUTATIONS,
imputed_variables=ACS_IMPUTED_VARIABLES,
)
logging.info("Imputing rent and real estate taxes.")
logging.info("Imputing ACS housing variables.")
imputed_values = fitted_model.predict(X_test=inference_df)
logging.info("Imputation complete.")
# ``cps["age"]`` has an integer dtype, so ``np.zeros_like(cps["age"])``
Expand All @@ -444,6 +458,16 @@ def add_rent(self, cps: h5py.File, person: DataFrame, household: DataFrame):
cps["pre_subsidy_rent"] = cps["rent"]
cps["real_estate_taxes"] = np.zeros(len(cps["age"]), dtype=float)
cps["real_estate_taxes"][mask] = imputed_values["real_estate_taxes"]
primary_residence_values = np.asarray(
imputed_values["primary_residence_value"],
dtype=float,
)
cps["primary_residence_value"] = np.zeros(len(cps["age"]), dtype=float)
cps["primary_residence_value"][mask] = np.where(
owner_head_mask,
primary_residence_values,
0,
)


TEMPORARY_TAKEUP_SOURCE_ANCHORS = ("snap_reported", "ssi_reported")
Expand Down
2 changes: 2 additions & 0 deletions tests/integration/support/tiny_stage_1.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
"is_household_head",
"rent",
"real_estate_taxes",
"primary_residence_value",
)

ACS_HOUSEHOLD_ARRAYS = (
Expand Down Expand Up @@ -229,6 +230,7 @@ def write_tiny_acs(path: Path) -> None:
"is_household_head": np.array([True, False, True], dtype=np.bool_),
"rent": np.array([0, 0, 14_400], dtype=np.float32),
"real_estate_taxes": np.array([2_400, 0, 0], dtype=np.float32),
"primary_residence_value": np.array([275_000, 0, 0], dtype=np.float32),
"tenure_type": np.array([b"OWNED_WITH_MORTGAGE", b"RENTED"]),
"household_vehicles_owned": np.array([2, 1], dtype=np.int16),
"state_fips": np.array([37, 37], dtype=np.int16),
Expand Down
3 changes: 3 additions & 0 deletions tests/integration/support/tiny_stage_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
"non_qualified_dividend_income",
"rent",
"real_estate_taxes",
"primary_residence_value",
"deductible_mortgage_interest",
"is_tax_unit_head",
"is_tax_unit_spouse",
Expand Down Expand Up @@ -160,6 +161,7 @@ def write_tiny_cps(
"non_qualified_dividend_income": np.array([10, 5, 0], dtype=np.float32),
"rent": acs["rent"][:],
"real_estate_taxes": acs["real_estate_taxes"][:],
"primary_residence_value": acs["primary_residence_value"][:],
"deductible_mortgage_interest": np.array([1_800, 0, 0], dtype=np.float32),
"is_tax_unit_head": np.array([True, False, True], dtype=np.bool_),
"is_tax_unit_spouse": np.array([False, True, False], dtype=np.bool_),
Expand Down Expand Up @@ -239,6 +241,7 @@ def write_tiny_puf(
),
"rent": np.zeros(person_count, dtype=np.float32),
"real_estate_taxes": raw["E18500"].to_numpy(dtype=np.float32),
"primary_residence_value": np.zeros(person_count, dtype=np.float32),
"deductible_mortgage_interest": raw["E19200"].to_numpy(dtype=np.float32),
"is_tax_unit_head": np.ones(person_count, dtype=np.bool_),
"is_tax_unit_spouse": np.zeros(person_count, dtype=np.bool_),
Expand Down
Loading