Summary
The released UK enhanced FRS includes SPI-imputed synthetic rows, but final calibrated household weights leave them with only about 0.35% of total household weight. This looks like the UK analogue of PolicyEngine/policyengine-us-data#1139: the synthetic donor rows are present, but calibration gives them too little mass to materially improve SPI-heavy income variables.
Artifact checked
- HuggingFace cache repo:
policyengine/policyengine-uk-data-private
- Ref:
1.55.5
- Snapshot:
664497078e8615d4491309459f6422e6a35d423d
- File:
enhanced_frs_2023_24.h5
Result
household_rows=53,508
total_household_weight=31,282,963.49
zero_weight_rows=0
FRS-derived rows: 33,508 rows, 31,171,999.95 weight, 99.645%
SPI synthetic rows: 20,000 rows, 110,963.54 weight, 0.355%
This classification is inferred from the build order: impute_income() stacks original FRS with 10k zero-weight SPI-imputed households, then impute_capital_gains() duplicates the combined dataset for the capital-gains split.
Why this matters
The SPI synthetic rows are meant to provide support for income components and reliefs that are under-covered in FRS, especially high-income tails. With only ~0.35% final household-weight share, the rows are technically active but likely too weak to close SPI-derived targets.
Proposed fix direction
- Give zero-weight SPI synthetic households meaningful positive prior mass before calibration, rather than random near-zero dust.
- Add explicit row-source flags, e.g.
household_is_spi_synthetic and household_is_capital_gains_clone, so diagnostics do not need row-order inference.
- Log source weight shares and calibration loss/target diagnostics during calibration.
- Keep the existing second-stage FRS-only imputation on SPI rows (
impute_frs_only_variables) so benefit, pension, savings, and other FRS-only features are regenerated conditional on the SPI-imputed incomes.
A branch implementing this is being tested locally and in CI: codex/spi-prior-target-diagnostics.
Related: PolicyEngine/policyengine-us-data#1139 and PolicyEngine/policyengine-us-data#1140.
Summary
The released UK enhanced FRS includes SPI-imputed synthetic rows, but final calibrated household weights leave them with only about 0.35% of total household weight. This looks like the UK analogue of PolicyEngine/policyengine-us-data#1139: the synthetic donor rows are present, but calibration gives them too little mass to materially improve SPI-heavy income variables.
Artifact checked
policyengine/policyengine-uk-data-private1.55.5664497078e8615d4491309459f6422e6a35d423denhanced_frs_2023_24.h5Result
This classification is inferred from the build order:
impute_income()stacks original FRS with 10k zero-weight SPI-imputed households, thenimpute_capital_gains()duplicates the combined dataset for the capital-gains split.Why this matters
The SPI synthetic rows are meant to provide support for income components and reliefs that are under-covered in FRS, especially high-income tails. With only ~0.35% final household-weight share, the rows are technically active but likely too weak to close SPI-derived targets.
Proposed fix direction
household_is_spi_syntheticandhousehold_is_capital_gains_clone, so diagnostics do not need row-order inference.impute_frs_only_variables) so benefit, pension, savings, and other FRS-only features are regenerated conditional on the SPI-imputed incomes.A branch implementing this is being tested locally and in CI:
codex/spi-prior-target-diagnostics.Related: PolicyEngine/policyengine-us-data#1139 and PolicyEngine/policyengine-us-data#1140.