Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ read `docs/engineering/skills/pipeline_operations.md`.
When adding, changing, or reviewing calibration target definitions, read
`docs/engineering/skills/calibration_targets.md`.

When adding, changing, or reviewing donor-survey imputations, read
`docs/engineering/skills/imputation.md`.

## Calibration targets

Manually sourced national or local-file calibration targets must be registered
Expand Down
1 change: 1 addition & 0 deletions changelog.d/1103.changed
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Use target-specific source-quality filters for ACS and SIPP imputations.
2 changes: 2 additions & 0 deletions docs/engineering/skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ Current skills:
notes.
- `github-prs.md`: same-repository PR workflow, PR head verification, and title
conventions.
- `imputation.md`: donor-survey imputation provenance rules, including
target-level exclusion of allocated source values.
- `pipeline_docs.md`: decorator-backed pipeline map maintenance and generated
pydoc-style artifacts.
- `pipeline_operations.md`: model-neutral workflow for diagnosing deployed Modal
Expand Down
36 changes: 36 additions & 0 deletions docs/engineering/skills/imputation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Imputation

Use this guide when adding, changing, or reviewing donor-survey imputations.

## Source Provenance

Do not train an imputation target on donor rows whose source value for that
target is itself allocated, hot-decked, edited, or imputed by the source survey.
Wire source-survey allocation or quality flags into the training frame whenever
the donor file exposes them.

Apply this rule at the target-variable level, not the donor-row level. A donor
row with observed tip income but allocated bank-account assets can train
`tip_income`; the same row must be excluded from the `bank_account_assets`
training target. Use `policyengine_us_data.utils.source_quality` to build
target masks, then pass them to `microimpute` through `target_filters` or
`row_filter` so the filtering logic lives in the imputation library rather than
in one-off model wrappers.

Do not drop final CPS, ECPS, or calibration records solely because a donor
survey target was excluded from training. The exclusion applies to donor
training rows only; recipient datasets should remain complete.

When a donor source lacks target-level quality flags, document that limitation
near the imputation code and keep the training surface structured so flags can
be added later.

## Tests

Add focused regression tests when adding a donor imputation or a source-quality
flag:

- allocation flags are read from the donor source,
- allocated source values are excluded for the affected target,
- unrelated observed targets from the same row can still train, and
- legacy and current imputation surfaces use the same target provenance rule.
23 changes: 23 additions & 0 deletions policyengine_us_data/calibration/puf_impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,29 @@ def _qrf_ss_shares(
for sub in shares:
shares[sub] = np.where(total > 0, shares[sub] / total, 0.0)

if (
"age" in data
and "social_security_retirement" in shares
and "social_security_disability" in shares
):
# Preserve QRF survivor/dependent predictions, but anchor the
# retirement-vs-disability split to the same age rule as the fallback.
age = data["age"][time_period][:n_cps][puf_has_ss]
is_old = age >= MINIMUM_RETIREMENT_AGE
retirement_or_disability = (
shares["social_security_retirement"] + shares["social_security_disability"]
)
shares["social_security_retirement"] = np.where(
is_old,
retirement_or_disability,
0.0,
)
shares["social_security_disability"] = np.where(
is_old,
0.0,
retirement_or_disability,
)

del fitted, predictions
gc.collect()

Expand Down
Loading