Add second-stage QRF imputation of FRS-only variables on SPI-donor rows

MaxGhenis · claude · MaxGhenis · commit c8865476ed30 · 2026-04-18T07:37:29.000-04:00
The enhanced-FRS pipeline's zero-weight SPI-donor subsample has its
income columns rewritten by a SPI-trained first-stage QRF, but every
other FRS column (benefit `_reported` values, pension contributions,
savings income, council tax benefit) stays as whatever middle-income FRS
donor was sampled. After calibration upweight this cascades into false
benefit aggregates, distorted allowances, and housing-cost mismatches —
the tracking issue decomposes about £4-6bn of benefit-aggregate drift to
this failure mode (most visibly the "£1M earners with zero everything
else" pattern described in #1621).

Adds a second-stage QRF (`frs_only.py`) that trains on the original
full-FRS build with predictors = [demographics + first-stage income
outputs] and outputs = a curated list of FRS-only variables, then
predicts for every SPI-donor row. High-earner predictions collapse UC /
HB / WTC receipt toward zero, pension contributions rescale, and savings
interest correlates with imputed income. Mirrors the CPS-only stage-2
QRF introduced in policyengine-us-data#589.

Unit tests cover: non-negative outputs, that non-target columns are
untouched, that missing train/target columns are skipped silently, and
that the predictions track the training-data income → receipt gradient.
The real full-FRS retrain runs in CI via the integration data-build.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/changelog.d/add-second-stage-qrf-frs-vars.changed.md b/changelog.d/add-second-stage-qrf-frs-vars.changed.md
@@ -0,0 +1 @@
+Add second-stage QRF imputation of FRS-only variables on SPI-donor rows. After the first-stage SPI-trained QRF overwrites income components on the zero-weight subsample, a new second-stage QRF trained on the full FRS rewrites benefit `_reported` columns, pension contributions, and savings-income so they correlate with the freshly-imputed incomes instead of staying as whatever middle-income FRS donor was sampled. Mirrors the `policyengine-us-data#589` pattern. Prevents synthetic £2 M earners from carrying a middle-income donor's UC / housing-benefit receipt into calibration, which was blowing up benefit aggregates under upweight.
diff --git a/policyengine_uk_data/datasets/imputations/__init__.py b/policyengine_uk_data/datasets/imputations/__init__.py
@@ -2,6 +2,7 @@
 from .vat import *
 from .wealth import *
 from .income import *
+from .frs_only import impute_frs_only_variables
 from .capital_gains import *
 from .services import impute_services
 from .salary_sacrifice import impute_salary_sacrifice
diff --git a/policyengine_uk_data/datasets/imputations/frs_only.py b/policyengine_uk_data/datasets/imputations/frs_only.py
@@ -0,0 +1,210 @@
+"""Second-stage QRF imputation of FRS-only variables on SPI-donor rows.
+
+The enhanced-FRS pipeline in :mod:`income` creates a zero-weight subsample
+of the FRS that will be upweighted during calibration to fit SPI-derived
+high-income targets. The first-stage QRF (trained on SPI) replaces only
+the six core income components (plus ``gift_aid`` and
+``charitable_investment_gifts``) on those rows. Every other FRS column —
+benefit ``_reported`` values, pension contributions, savings, rent,
+mortgage, council tax — stays at whatever the middle-income FRS donor
+whose row was sampled happened to report.
+
+That produces implausible joint distributions on the synthetic
+high-income side. A row with imputed £2 M self-employment income carries
+its donor's £120 UC ``_reported`` value, its donor's tiny pension
+contribution, and its donor's typical rent. Under calibration upweight
+these cascade into false benefit aggregates, depressed allowances, and
+distorted housing-cost totals.
+
+This second-stage QRF trains on the original FRS with predictors =
+[demographics + first-stage income outputs] and outputs = a curated list
+of FRS-only variables. For each SPI-donor row, it substitutes the
+predicted value drawn from FRS respondents with similar demographics and
+post-stage-1 incomes. Benefit ``_reported`` flags for high earners
+naturally collapse to zero (no high-earner FRS respondent reports UC),
+pension contributions rescale, and savings interest / rent correlate
+with income instead of with the random FRS donor's draw.
+
+Mirrors the US ``_impute_cps_only_variables`` approach introduced in
+``policyengine-us-data#589`` but targets UK-specific FRS variables.
+"""
+
+from __future__ import annotations
+
+import logging
+
+import numpy as np
+import pandas as pd
+from policyengine_uk.data import UKSingleYearDataset
+
+logger = logging.getLogger(__name__)
+
+
+STAGE2_DEMOGRAPHIC_PREDICTORS = [
+    "age",
+    "gender",
+    "region",
+]
+
+# Predictors drawn from the first-stage QRF output columns. They are the
+# same six income components that the first stage imputes from SPI.
+STAGE2_INCOME_PREDICTORS = [
+    "employment_income",
+    "self_employment_income",
+    "savings_interest_income",
+    "dividend_income",
+    "private_pension_income",
+    "property_income",
+]
+
+# FRS-only variables the second stage replaces on SPI-donor rows. Kept
+# conservative: benefit ``_reported`` columns and pension contributions
+# are the leading sources of cross-income inconsistency, and are
+# well-populated in the base FRS build so training is stable.
+FRS_ONLY_PERSON_VARIABLES = [
+    # Pension contributions
+    "employee_pension_contributions",
+    "employer_pension_contributions",
+    "personal_pension_contributions",
+    "pension_contributions_via_salary_sacrifice",
+    # Savings-related
+    "tax_free_savings_income",
+    # Benefit `_reported` columns
+    "universal_credit_reported",
+    "pension_credit_reported",
+    "child_benefit_reported",
+    "housing_benefit_reported",
+    "income_support_reported",
+    "working_tax_credit_reported",
+    "child_tax_credit_reported",
+    "attendance_allowance_reported",
+    "state_pension_reported",
+    "dla_sc_reported",
+    "dla_m_reported",
+    "pip_m_reported",
+    "pip_dl_reported",
+    "sda_reported",
+    "carers_allowance_reported",
+    "iidb_reported",
+    "afcs_reported",
+    "bsp_reported",
+    "incapacity_benefit_reported",
+    "maternity_allowance_reported",
+    "winter_fuel_allowance_reported",
+    "council_tax_benefit_reported",
+    "jsa_contrib_reported",
+    "jsa_income_reported",
+    "esa_contrib_reported",
+    "esa_income_reported",
+]
+
+
+def _one_hot_encode(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
+    """Return ``df`` with object-typed ``columns`` one-hot encoded.
+
+    QRF predictors must be numeric. Uses ``pandas.get_dummies`` so
+    identical category sets are produced from the same input data.
+    """
+    return pd.get_dummies(df, columns=columns, drop_first=False, dtype=float)
+
+
+def _align_columns(
+    train_df: pd.DataFrame, test_df: pd.DataFrame
+) -> tuple[pd.DataFrame, pd.DataFrame]:
+    """Ensure train/test share the same columns in the same order.
+
+    After independent ``get_dummies`` calls on train and test one-hot
+    expansions can diverge if a category appears in one set and not the
+    other. Reindex both to the union of columns, filling missing cells
+    with zero.
+    """
+    columns = sorted(set(train_df.columns) | set(test_df.columns))
+    return (
+        train_df.reindex(columns=columns, fill_value=0.0),
+        test_df.reindex(columns=columns, fill_value=0.0),
+    )
+
+
+def impute_frs_only_variables(
+    train_dataset: UKSingleYearDataset,
+    target_dataset: UKSingleYearDataset,
+) -> UKSingleYearDataset:
+    """Impute FRS-only person variables onto ``target_dataset``.
+
+    ``train_dataset`` must be a full FRS build (before income
+    imputation) so the training rows preserve the original co-occurrence
+    of income and every FRS-only variable. ``target_dataset`` is the
+    SPI-donor subsample after the first-stage QRF has overwritten its
+    income columns.
+
+    A single multi-output QRF is fitted on the training data and used
+    to predict values for every row of ``target_dataset``; predictions
+    replace the existing (donor-leaked) values in
+    ``FRS_ONLY_PERSON_VARIABLES`` only. Variables absent from either
+    frame are skipped silently.
+    """
+    from policyengine_uk_data.utils.qrf import QRF
+
+    target_dataset = target_dataset.copy()
+
+    train_person = train_dataset.person
+    target_person = target_dataset.person
+
+    # Use only variables present in both frames.
+    outputs = [
+        v
+        for v in FRS_ONLY_PERSON_VARIABLES
+        if v in train_person.columns and v in target_person.columns
+    ]
+    missing = set(FRS_ONLY_PERSON_VARIABLES) - set(outputs)
+    if missing:
+        logger.warning(
+            "Stage-2 FRS-only imputation: %d variables absent from "
+            "train/target frames, skipped: %s",
+            len(missing),
+            sorted(missing),
+        )
+    if not outputs:
+        logger.warning(
+            "Stage-2 FRS-only imputation: no output variables available; "
+            "returning target_dataset unchanged."
+        )
+        return target_dataset
+
+    predictors = STAGE2_DEMOGRAPHIC_PREDICTORS + STAGE2_INCOME_PREDICTORS
+
+    train_inputs_raw = train_person[predictors].copy()
+    target_inputs_raw = target_person[predictors].copy()
+
+    train_inputs = _one_hot_encode(train_inputs_raw, columns=["gender", "region"])
+    target_inputs = _one_hot_encode(target_inputs_raw, columns=["gender", "region"])
+    train_inputs, target_inputs = _align_columns(train_inputs, target_inputs)
+
+    # Replace NaNs in outputs with 0 so the QRF trains on clean targets;
+    # FRS-only variables are almost all zero-heavy "amount if eligible"
+    # columns that default to zero when unreported.
+    train_outputs = train_person[outputs].fillna(0).astype(float)
+
+    logger.info(
+        "Stage-2 FRS-only imputation: %d outputs, training on %d FRS "
+        "persons, predicting for %d SPI-donor persons",
+        len(outputs),
+        len(train_inputs),
+        len(target_inputs),
+    )
+
+    model = QRF()
+    model.fit(train_inputs, train_outputs)
+    predictions = model.predict(target_inputs)
+
+    # The QRF occasionally returns NaN for extreme predictor combos;
+    # clamp to zero (the population-typical value for these variables).
+    predictions = predictions.fillna(0.0)
+
+    for column in outputs:
+        # Clamp negative predictions — these columns represent receipted
+        # amounts or contributions and are non-negative by construction.
+        values = np.maximum(predictions[column].values, 0.0)
+        target_dataset.person[column] = values
+
+    return target_dataset
diff --git a/policyengine_uk_data/datasets/imputations/income.py b/policyengine_uk_data/datasets/imputations/income.py
@@ -256,6 +256,21 @@ def impute_income(dataset: UKSingleYearDataset) -> UKSingleYearDataset:
         IMPUTATIONS,
     )
 
+    # Second-stage QRF: rewrite FRS-only variables (benefit `_reported`
+    # columns, pension contributions, savings, etc.) on the SPI-donor rows
+    # so they correlate with the freshly-imputed incomes instead of staying
+    # as whatever middle-income FRS donor was sampled. Without this the
+    # £2M imputed earners keep their donor's £120 UC receipt, blowing up
+    # benefit aggregates under calibration upweight.
+    from policyengine_uk_data.datasets.imputations.frs_only import (
+        impute_frs_only_variables,
+    )
+
+    zero_weight_copy = impute_frs_only_variables(
+        train_dataset=dataset,
+        target_dataset=zero_weight_copy,
+    )
+
     dataset = impute_over_incomes(
         dataset,
         model,
diff --git a/policyengine_uk_data/tests/test_frs_only_imputation.py b/policyengine_uk_data/tests/test_frs_only_imputation.py
diff --git a/uv.lock b/uv.lock

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+Add second-stage QRF imputation of FRS-only variables on SPI-donor rows. After the first-stage SPI-trained QRF overwrites income components on the zero-weight subsample, a new second-stage QRF trained on the full FRS rewrites benefit `_reported` columns, pension contributions, and savings-income so they correlate with the freshly-imputed incomes instead of staying as whatever middle-income FRS donor was sampled. Mirrors the `policyengine-us-data#589` pattern. Prevents synthetic £2 M earners from carrying a middle-income donor's UC / housing-benefit receipt into calibration, which was blowing up benefit aggregates under upweight.