Skip to content

Commit 9980b41

Browse files
authored
Refine SSI disability training filters and targets (#1131)
* Refine SIPP SSI disability training filters * Fix SSI disability filter PR checks * Remove SSI fiscal-year calibration variable * Use SSA actual SSI payments target * Use 2024 SIPP source for imputations * Format SIPP dataset code * Update stage validation SSI target
1 parent 382323e commit 9980b41

19 files changed

Lines changed: 337 additions & 354 deletions

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,12 @@
22
**/__pycache__
33
**/.DS_STORE
44
**/*.h5
5+
**/*.h5.lock
56
**/*.npy
67
**/*.csv
78
**/*.csv.gz
9+
**/pu*_csv.zip
10+
**/*.clone_diagnostics.json
811
**/_build
912
**/*.pkl
1013
**/*.db

changelog.d/1131.fixed

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Refine the SIPP SSI disability training candidate screen to use SGA and approximate SSI countable income, and remove the manual cache-version suffix.

policyengine_us_data/calibration/chunked_matrix_assembler.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -380,7 +380,9 @@ def run_single_chunk(self, chunk_id: int) -> ChunkResult:
380380
continue
381381
try:
382382
hh_vars[variable] = chunk_sim.calculate(
383-
variable, state.time_period, map_to="household"
383+
variable,
384+
state.time_period,
385+
map_to="household",
384386
).values.astype(np.float32)
385387
except Exception as exc:
386388
logger.warning(
@@ -394,7 +396,9 @@ def run_single_chunk(self, chunk_id: int) -> ChunkResult:
394396
continue
395397
try:
396398
target_entity_vars[variable] = chunk_sim.calculate(
397-
variable, state.time_period, map_to=entity_key
399+
variable,
400+
state.time_period,
401+
map_to=entity_key,
398402
).values.astype(np.float32)
399403
except Exception as exc:
400404
logger.warning(

policyengine_us_data/calibration/sanity_checks.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,7 @@
3636
"income_tax_before_credits",
3737
]
3838

39-
COMPUTED_KEY_MONETARY_VARS = [
40-
"ssi_federal_fiscal_year_outlays",
41-
]
39+
COMPUTED_KEY_MONETARY_VARS = []
4240

4341
TAKEUP_VARS = [
4442
"takes_up_snap_if_eligible",
@@ -665,6 +663,9 @@ def _append_finite_check(var: str, vals) -> None:
665663

666664

667665
def _computed_key_monetary_values(h5_path: str, period: int) -> dict[str, np.ndarray]:
666+
if not COMPUTED_KEY_MONETARY_VARS:
667+
return {}
668+
668669
try:
669670
from policyengine_us import Microsimulation
670671

policyengine_us_data/calibration/source_impute.py

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@
5151
SSI_DISABILITY_EXPORT_VARIABLES,
5252
VEHICLE_MODEL_PREDICTORS,
5353
build_vehicle_training_frame,
54+
ensure_sipp_file,
5455
get_ssi_disability_model,
5556
predict_ssi_disability_criteria,
5657
preserve_under_65_ssi_disability_criteria,
@@ -663,16 +664,26 @@ def _impute_sipp(
663664
Returns:
664665
Updated data dict.
665666
"""
666-
from huggingface_hub import hf_hub_download
667-
from policyengine_us_data.storage import STORAGE_FOLDER
668-
669-
hf_hub_download(
670-
repo_id="PolicyEngine/policyengine-us-data",
671-
filename="pu2023_slim.csv",
672-
repo_type="model",
673-
local_dir=STORAGE_FOLDER,
667+
tip_cols = (
668+
[
669+
"SSUID",
670+
"MONTHCODE",
671+
"WPFINWGT",
672+
"TAGE",
673+
"TPTOTINC",
674+
]
675+
+ SIPP_JOB_OCCUPATION_COLUMNS
676+
+ SIPP_TIP_AMOUNT_COLUMNS
677+
+ [
678+
SIPP_TIP_AMOUNT_TO_ALLOCATION_COLUMN[column]
679+
for column in SIPP_TIP_AMOUNT_COLUMNS
680+
]
681+
)
682+
sipp_df = pd.read_csv(
683+
ensure_sipp_file(),
684+
delimiter="|",
685+
usecols=tip_cols,
674686
)
675-
sipp_df = pd.read_csv(STORAGE_FOLDER / "pu2023_slim.csv")
676687

677688
tip_amount_columns = [
678689
column for column in SIPP_TIP_AMOUNT_COLUMNS if column in sipp_df
@@ -788,12 +799,6 @@ def _impute_sipp(
788799

789800
# Asset imputation
790801
try:
791-
hf_hub_download(
792-
repo_id="PolicyEngine/policyengine-us-data",
793-
filename="pu2023.csv",
794-
repo_type="model",
795-
local_dir=STORAGE_FOLDER,
796-
)
797802
asset_cols = (
798803
[
799804
"SSUID",
@@ -817,7 +822,7 @@ def _impute_sipp(
817822
+ SIPP_ASSET_ALLOCATION_COLUMNS
818823
)
819824
asset_df = pd.read_csv(
820-
STORAGE_FOLDER / "pu2023.csv",
825+
ensure_sipp_file(),
821826
delimiter="|",
822827
usecols=asset_cols,
823828
)

policyengine_us_data/calibration/target_config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ include:
205205
geo_level: national
206206
- variable: social_security_survivors
207207
geo_level: national
208-
- variable: ssi_federal_fiscal_year_outlays
208+
- variable: ssi
209209
geo_level: national
210210
- variable: person_count
211211
geo_level: national

policyengine_us_data/datasets/sipp/README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ SIPP panel wave. These are the canonical reference for every variable
1818
name, value code, and weighting construct used by the code in this
1919
folder:
2020

21-
- [SIPP 2023 public-use data dictionary (PDF)](https://www2.census.gov/programs-surveys/sipp/tech-documentation/data-dictionaries/2023/2023_SIPP_Data_Dictionary.pdf)
22-
- [SIPP 2023 users' guide (PDF, Aug 2026 revision)](https://www2.census.gov/programs-surveys/sipp/tech-documentation/methodology/2023_SIPP_Users_Guide_AUG26.pdf)
21+
- [SIPP 2024 public-use data dictionary (PDF)](https://www2.census.gov/programs-surveys/sipp/tech-documentation/data-dictionaries/2024/2024_SIPP_Data_Dictionary.pdf)
22+
- [SIPP 2024 users' guide (PDF)](https://www2.census.gov/programs-surveys/sipp/tech-documentation/methodology/2024_SIPP_Users_Guide.pdf)
2323

2424
See also:
2525

@@ -30,15 +30,16 @@ See also:
3030
## Data products in this folder
3131

3232
- `sipp.py` — trains and caches QRF imputation models (`get_tip_model`,
33-
`get_asset_model`, `get_vehicle_model`) from SIPP 2023 person-month
33+
`get_asset_model`, `get_vehicle_model`) from SIPP 2024 person-month
3434
data. The training frame is filtered to `MONTHCODE == 12` (December)
3535
so every row represents one person-year rather than twelve annualized
3636
months.
3737

38-
The raw SIPP CSVs (`pu2023.csv` and the slim variant `pu2023_slim.csv`)
39-
are mirrored on the `PolicyEngine/policyengine-us-data` HuggingFace model
40-
repo and downloaded on demand when a training run is needed. They are
41-
not vendored in this Git repository.
38+
The raw SIPP CSV (`pu2024.csv`) is downloaded on demand when a training
39+
run is needed. The downloader first checks the
40+
`PolicyEngine/policyengine-us-data` HuggingFace model repo for a cached
41+
copy, then falls back to Census's public `pu2024_csv.zip` archive. The raw
42+
file is not vendored in this Git repository.
4243

4344
## Licensing
4445

0 commit comments

Comments
 (0)