Skip to content

Commit ee20d6a

Browse files
committed
Merge main into fix/population-rescale-217; resolve test_population + uv.lock
Two conflicts from main picking up subsequent work: - `policyengine_uk_data/tests/test_population.py` — main adopted a 4 % tolerance in #366 with the explanatory comment about the April 2026 data-pipeline improvements (#359/#362/#363) pulling the baseline overshoot from ~6.5 % down to ~1.6–3.3 %. Kept main's version since 4 % is the CI-stable value Max validated against the post-April pipeline; the branch's 3 % was marginal before those improvements landed. - `uv.lock` — both sides had stale self-version entries (1.52.0 vs 1.52.1). Regenerated via `uv lock --upgrade-package policyengine-uk-data` so it matches the current `pyproject.toml` version (1.53.1). The asymmetric-loss change (`log((1+x)/(1+y))**2`) and the population-target loss-weight experiment from earlier commits remain on this branch as separate, still-open proposals — Nikhil's review asked for a before/after sweep across all targets before either is merged. No functional change in this commit beyond the conflict resolution.
2 parents 53a866b + 163c432 commit ee20d6a

16 files changed

Lines changed: 765 additions & 19 deletions

File tree

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,27 @@
1+
## [1.53.1] - 2026-04-20
2+
3+
No significant changes.
4+
5+
6+
## [1.53.0] - 2026-04-19
7+
8+
### Added
9+
10+
- Tightened `test_population` tolerance from 7% to 3% now that the stage-2 QRF (#362), TFC target refresh (#363), and reported-anchor takeup (#359) pulled the weighted UK population overshoot from ~6.5% down to ~1.6%. Added four regression tests in `test_population_fidelity.py` (weighted-total match, household-count range, non-inflation guard, country-sum consistency) extracted from the earlier #310 draft so any future calibration drift back toward the pre-April-2026 overshoot trips CI.
11+
12+
13+
## [1.52.2] - 2026-04-18
14+
15+
### Changed
16+
17+
- Add second-stage QRF imputation of FRS-only variables on SPI-donor rows. After the first-stage SPI-trained QRF overwrites income components on the zero-weight subsample, a new second-stage QRF trained on the full FRS rewrites benefit `_reported` columns, pension contributions, and savings-income so they correlate with the freshly-imputed incomes instead of staying as whatever middle-income FRS donor was sampled. Mirrors the `policyengine-us-data#589` pattern. Prevents synthetic £2 M earners from carrying a middle-income donor's UC / housing-benefit receipt into calibration, which was blowing up benefit aggregates under upweight.
18+
- Anchor stochastic takeup assignment for Universal Credit, Pension Credit, and Child Benefit to the FRS-reported receipt columns, matching the `policyengine-us-data` pattern. Respondents who report positive receipt in the FRS benefits table now receive `would_claim_* = True` with certainty, and non-reporters are filled probabilistically to hit the aggregate target rate. Removes a source of calibration noise where respondents who clearly took up a benefit could be randomly assigned `would_claim = False`.
19+
20+
### Fixed
21+
22+
- Refresh Tax-Free Childcare calibration targets and take-up rate using HMRC's June 2025 release (covering 2024-25 outturn: £632 m spending, 985 k children reached). The prior target set was calibrated against the September 2024 release and undershot current TFC spending by roughly a third. Bumps the default TFC take-up rate from 0.586 to 0.88 on 2024-04-06 to close most of the gap pending a full recalibration run.
23+
24+
125
## [1.52.1] - 2026-04-18
226

327
### Fixed

changelog.d/368.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
- Set Marriage Allowance take-up rate to 0.5 (HMRC outturn ~2.1m claimants of ~4.2m eligible couples) instead of the placeholder 1.0, so microsimulation no longer overstates Marriage Allowance cost by ~£500m/year.

policyengine_uk_data/datasets/childcare/takeup_rate.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,33 @@
33
from policyengine_uk import Microsimulation
44

55
# 🎯 Calibration targets
6+
#
7+
# TFC targets refreshed from HMRC "Tax-Free Childcare statistics: June 2025"
8+
# (published 27 Aug 2025, covering 2024-25 outturn):
9+
# - spending: £632.2 m (Table 1, annual government top-up)
10+
# - caseload: 985 thousand children received TFC in 2024-25 (annual unique)
11+
# The prior 0.6 / 660 targets were calibrated against the Sep 2024 release
12+
# (2023-24 outturn) and have since been overtaken by the TFC account
13+
# expansion and the Sep 2025 "30 free hours for under-5s" boost in uptake.
14+
#
15+
# Other programme targets kept at their prior DfE values.
616
targets = {
717
"spending": {
8-
"tfc": 0.6,
18+
"tfc": 0.63,
919
"extended": 2.5,
1020
"targeted": 0.6,
1121
"universal": 1.7,
1222
},
1323
"caseload": {
14-
"tfc": 660,
24+
"tfc": 985,
1525
"extended": 740,
1626
"targeted": 130,
1727
"universal": 490,
1828
},
1929
}
2030

21-
# Here is the link to the UK government’s aggregate data for Tax-Free Childcare:
22-
# https://www.gov.uk/government/statistics/tax-free-childcare-statistics-september-2024
31+
# UK government aggregate Tax-Free Childcare statistics:
32+
# https://www.gov.uk/government/statistics/tax-free-childcare-statistics-june-2025
2333

2434
# This is the Department for Education (DfE) data for the other childcare programmes:
2535
# https://skillsfunding.service.gov.uk/view-latest-funding/national-funding-allocations/DSG/2024-to-2025

policyengine_uk_data/datasets/frs.py

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1217,24 +1217,45 @@ def determine_education_level(fted_val, typeed2_val, age_val):
12171217
scp_under_6_rate = load_take_up_rate("scp_under_6", year)
12181218
scp_6_plus_rate = load_take_up_rate("scp_6_plus", year)
12191219

1220-
# Generate take-up decisions by comparing random draws to take-up rates
1220+
# Generate take-up decisions by comparing random draws to take-up rates,
1221+
# anchored to reported receipts where the FRS captures them. Respondents
1222+
# who report positive receipt of a benefit are assigned takeup=True with
1223+
# certainty; the remaining non-reporters are filled probabilistically to
1224+
# hit the aggregate target rate. See policyengine_uk_data/utils/takeup.py.
1225+
from policyengine_uk_data.utils.takeup import (
1226+
assign_takeup_with_reported_anchors,
1227+
)
1228+
1229+
def _reported_benunit_mask(person_column: str) -> np.ndarray:
1230+
reporter_benunits = set(
1231+
pe_person.loc[pe_person[person_column] > 0, "person_benunit_id"].values
1232+
)
1233+
return pe_benunit["benunit_id"].isin(reporter_benunits).values
1234+
12211235
# Person-level
12221236
pe_person["would_claim_marriage_allowance"] = (
12231237
generator.random(len(pe_person)) < marriage_allowance_rate
12241238
)
12251239

1226-
# Benefit unit-level
1227-
pe_benunit["would_claim_child_benefit"] = (
1228-
generator.random(len(pe_benunit)) < child_benefit_rate
1240+
# Benefit unit-level — anchor on any adult in the benefit unit having
1241+
# reported positive receipt in the FRS benefits table.
1242+
pe_benunit["would_claim_child_benefit"] = assign_takeup_with_reported_anchors(
1243+
generator.random(len(pe_benunit)),
1244+
child_benefit_rate,
1245+
reported_mask=_reported_benunit_mask("child_benefit_reported"),
12291246
)
12301247
pe_benunit["child_benefit_opts_out"] = (
12311248
generator.random(len(pe_benunit)) < child_benefit_opts_out_rate
12321249
)
1233-
pe_benunit["would_claim_pc"] = (
1234-
generator.random(len(pe_benunit)) < pension_credit_rate
1250+
pe_benunit["would_claim_pc"] = assign_takeup_with_reported_anchors(
1251+
generator.random(len(pe_benunit)),
1252+
pension_credit_rate,
1253+
reported_mask=_reported_benunit_mask("pension_credit_reported"),
12351254
)
1236-
pe_benunit["would_claim_uc"] = (
1237-
generator.random(len(pe_benunit)) < universal_credit_rate
1255+
pe_benunit["would_claim_uc"] = assign_takeup_with_reported_anchors(
1256+
generator.random(len(pe_benunit)),
1257+
universal_credit_rate,
1258+
reported_mask=_reported_benunit_mask("universal_credit_reported"),
12381259
)
12391260
pe_benunit["would_claim_tfc"] = generator.random(len(pe_benunit)) < tfc_rate
12401261
pe_benunit["would_claim_extended_childcare"] = (

policyengine_uk_data/datasets/imputations/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from .vat import *
33
from .wealth import *
44
from .income import *
5+
from .frs_only import impute_frs_only_variables
56
from .capital_gains import *
67
from .services import impute_services
78
from .salary_sacrifice import impute_salary_sacrifice
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
"""Second-stage QRF imputation of FRS-only variables on SPI-donor rows.
2+
3+
The enhanced-FRS pipeline in :mod:`income` creates a zero-weight subsample
4+
of the FRS that will be upweighted during calibration to fit SPI-derived
5+
high-income targets. The first-stage QRF (trained on SPI) replaces only
6+
the six core income components (plus ``gift_aid`` and
7+
``charitable_investment_gifts``) on those rows. Every other FRS column —
8+
benefit ``_reported`` values, pension contributions, savings, rent,
9+
mortgage, council tax — stays at whatever the middle-income FRS donor
10+
whose row was sampled happened to report.
11+
12+
That produces implausible joint distributions on the synthetic
13+
high-income side. A row with imputed £2 M self-employment income carries
14+
its donor's £120 UC ``_reported`` value, its donor's tiny pension
15+
contribution, and its donor's typical rent. Under calibration upweight
16+
these cascade into false benefit aggregates, depressed allowances, and
17+
distorted housing-cost totals.
18+
19+
This second-stage QRF trains on the original FRS with predictors =
20+
[demographics + first-stage income outputs] and outputs = a curated list
21+
of FRS-only variables. For each SPI-donor row, it substitutes the
22+
predicted value drawn from FRS respondents with similar demographics and
23+
post-stage-1 incomes. Benefit ``_reported`` flags for high earners
24+
naturally collapse to zero (no high-earner FRS respondent reports UC),
25+
pension contributions rescale, and savings interest / rent correlate
26+
with income instead of with the random FRS donor's draw.
27+
28+
Mirrors the US ``_impute_cps_only_variables`` approach introduced in
29+
``policyengine-us-data#589`` but targets UK-specific FRS variables.
30+
"""
31+
32+
from __future__ import annotations
33+
34+
import logging
35+
36+
import numpy as np
37+
import pandas as pd
38+
from policyengine_uk.data import UKSingleYearDataset
39+
40+
logger = logging.getLogger(__name__)
41+
42+
43+
STAGE2_DEMOGRAPHIC_PREDICTORS = [
44+
"age",
45+
"gender",
46+
"region",
47+
]
48+
49+
# Predictors drawn from the first-stage QRF output columns. They are the
50+
# same six income components that the first stage imputes from SPI.
51+
STAGE2_INCOME_PREDICTORS = [
52+
"employment_income",
53+
"self_employment_income",
54+
"savings_interest_income",
55+
"dividend_income",
56+
"private_pension_income",
57+
"property_income",
58+
]
59+
60+
# FRS-only variables the second stage replaces on SPI-donor rows. Kept
61+
# conservative: benefit ``_reported`` columns and pension contributions
62+
# are the leading sources of cross-income inconsistency, and are
63+
# well-populated in the base FRS build so training is stable.
64+
FRS_ONLY_PERSON_VARIABLES = [
65+
# Pension contributions
66+
"employee_pension_contributions",
67+
"employer_pension_contributions",
68+
"personal_pension_contributions",
69+
"pension_contributions_via_salary_sacrifice",
70+
# Savings-related
71+
"tax_free_savings_income",
72+
# Benefit `_reported` columns
73+
"universal_credit_reported",
74+
"pension_credit_reported",
75+
"child_benefit_reported",
76+
"housing_benefit_reported",
77+
"income_support_reported",
78+
"working_tax_credit_reported",
79+
"child_tax_credit_reported",
80+
"attendance_allowance_reported",
81+
"state_pension_reported",
82+
"dla_sc_reported",
83+
"dla_m_reported",
84+
"pip_m_reported",
85+
"pip_dl_reported",
86+
"sda_reported",
87+
"carers_allowance_reported",
88+
"iidb_reported",
89+
"afcs_reported",
90+
"bsp_reported",
91+
"incapacity_benefit_reported",
92+
"maternity_allowance_reported",
93+
"winter_fuel_allowance_reported",
94+
"council_tax_benefit_reported",
95+
"jsa_contrib_reported",
96+
"jsa_income_reported",
97+
"esa_contrib_reported",
98+
"esa_income_reported",
99+
]
100+
101+
102+
def _one_hot_encode(df: pd.DataFrame, columns: list[str]) -> pd.DataFrame:
103+
"""Return ``df`` with object-typed ``columns`` one-hot encoded.
104+
105+
QRF predictors must be numeric. Uses ``pandas.get_dummies`` so
106+
identical category sets are produced from the same input data.
107+
"""
108+
return pd.get_dummies(df, columns=columns, drop_first=False, dtype=float)
109+
110+
111+
def _align_columns(
112+
train_df: pd.DataFrame, test_df: pd.DataFrame
113+
) -> tuple[pd.DataFrame, pd.DataFrame]:
114+
"""Ensure train/test share the same columns in the same order.
115+
116+
After independent ``get_dummies`` calls on train and test one-hot
117+
expansions can diverge if a category appears in one set and not the
118+
other. Reindex both to the union of columns, filling missing cells
119+
with zero.
120+
"""
121+
columns = sorted(set(train_df.columns) | set(test_df.columns))
122+
return (
123+
train_df.reindex(columns=columns, fill_value=0.0),
124+
test_df.reindex(columns=columns, fill_value=0.0),
125+
)
126+
127+
128+
def _build_predictor_frame(dataset: UKSingleYearDataset) -> pd.DataFrame:
129+
"""Return a person-indexed DataFrame of stage-2 predictor columns.
130+
131+
``region`` lives on the household frame in the enhanced-FRS build,
132+
so it is joined onto each person row via ``person_household_id``.
133+
Remaining predictors (age, gender, the six income components) are
134+
read directly from the person frame. If the person frame already
135+
carries ``region`` (as in some test fixtures and the standalone SPI
136+
build) that value wins and no join is performed.
137+
"""
138+
person = dataset.person
139+
predictors = STAGE2_DEMOGRAPHIC_PREDICTORS + STAGE2_INCOME_PREDICTORS
140+
141+
if "region" in person.columns:
142+
frame = person[predictors].copy()
143+
elif (
144+
"region" in dataset.household.columns
145+
and "person_household_id" in person.columns
146+
):
147+
hh_region = dataset.household.set_index("household_id")["region"]
148+
person_region = person["person_household_id"].map(hh_region)
149+
frame = person[[c for c in predictors if c != "region"]].copy()
150+
frame["region"] = person_region.values
151+
frame = frame[predictors]
152+
else:
153+
raise KeyError(
154+
"Stage-2 imputation needs 'region' either on the person frame "
155+
"or on the household frame with a 'person_household_id' join key."
156+
)
157+
return frame
158+
159+
160+
def impute_frs_only_variables(
161+
train_dataset: UKSingleYearDataset,
162+
target_dataset: UKSingleYearDataset,
163+
) -> UKSingleYearDataset:
164+
"""Impute FRS-only person variables onto ``target_dataset``.
165+
166+
``train_dataset`` must be a full FRS build (before income
167+
imputation) so the training rows preserve the original co-occurrence
168+
of income and every FRS-only variable. ``target_dataset`` is the
169+
SPI-donor subsample after the first-stage QRF has overwritten its
170+
income columns.
171+
172+
A single multi-output QRF is fitted on the training data and used
173+
to predict values for every row of ``target_dataset``; predictions
174+
replace the existing (donor-leaked) values in
175+
``FRS_ONLY_PERSON_VARIABLES`` only. Variables absent from either
176+
frame are skipped silently.
177+
"""
178+
from policyengine_uk_data.utils.qrf import QRF
179+
180+
target_dataset = target_dataset.copy()
181+
182+
train_person = train_dataset.person
183+
target_person = target_dataset.person
184+
185+
# Use only variables present in both frames.
186+
outputs = [
187+
v
188+
for v in FRS_ONLY_PERSON_VARIABLES
189+
if v in train_person.columns and v in target_person.columns
190+
]
191+
missing = set(FRS_ONLY_PERSON_VARIABLES) - set(outputs)
192+
if missing:
193+
logger.warning(
194+
"Stage-2 FRS-only imputation: %d variables absent from "
195+
"train/target frames, skipped: %s",
196+
len(missing),
197+
sorted(missing),
198+
)
199+
if not outputs:
200+
logger.warning(
201+
"Stage-2 FRS-only imputation: no output variables available; "
202+
"returning target_dataset unchanged."
203+
)
204+
return target_dataset
205+
206+
train_inputs_raw = _build_predictor_frame(train_dataset)
207+
target_inputs_raw = _build_predictor_frame(target_dataset)
208+
209+
train_inputs = _one_hot_encode(train_inputs_raw, columns=["gender", "region"])
210+
target_inputs = _one_hot_encode(target_inputs_raw, columns=["gender", "region"])
211+
train_inputs, target_inputs = _align_columns(train_inputs, target_inputs)
212+
213+
# Replace NaNs in outputs with 0 so the QRF trains on clean targets;
214+
# FRS-only variables are almost all zero-heavy "amount if eligible"
215+
# columns that default to zero when unreported.
216+
train_outputs = train_person[outputs].fillna(0).astype(float)
217+
218+
logger.info(
219+
"Stage-2 FRS-only imputation: %d outputs, training on %d FRS "
220+
"persons, predicting for %d SPI-donor persons",
221+
len(outputs),
222+
len(train_inputs),
223+
len(target_inputs),
224+
)
225+
226+
model = QRF()
227+
model.fit(train_inputs, train_outputs)
228+
predictions = model.predict(target_inputs)
229+
230+
# The QRF occasionally returns NaN for extreme predictor combos;
231+
# clamp to zero (the population-typical value for these variables).
232+
predictions = predictions.fillna(0.0)
233+
234+
for column in outputs:
235+
# Clamp negative predictions — these columns represent receipted
236+
# amounts or contributions and are non-negative by construction.
237+
values = np.maximum(predictions[column].values, 0.0)
238+
target_dataset.person[column] = values
239+
240+
return target_dataset

policyengine_uk_data/datasets/imputations/income.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,21 @@ def impute_income(dataset: UKSingleYearDataset) -> UKSingleYearDataset:
256256
IMPUTATIONS,
257257
)
258258

259+
# Second-stage QRF: rewrite FRS-only variables (benefit `_reported`
260+
# columns, pension contributions, savings, etc.) on the SPI-donor rows
261+
# so they correlate with the freshly-imputed incomes instead of staying
262+
# as whatever middle-income FRS donor was sampled. Without this the
263+
# £2M imputed earners keep their donor's £120 UC receipt, blowing up
264+
# benefit aggregates under calibration upweight.
265+
from policyengine_uk_data.datasets.imputations.frs_only import (
266+
impute_frs_only_variables,
267+
)
268+
269+
zero_weight_copy = impute_frs_only_variables(
270+
train_dataset=dataset,
271+
target_dataset=zero_weight_copy,
272+
)
273+
259274
dataset = impute_over_incomes(
260275
dataset,
261276
model,

0 commit comments

Comments
 (0)