Skip to content

SPI synthetic rows receive only 0.35% of final enhanced FRS weight #420

@MaxGhenis

Description

@MaxGhenis

Summary

The released UK enhanced FRS includes SPI-imputed synthetic rows, but final calibrated household weights leave them with only about 0.35% of total household weight. This looks like the UK analogue of PolicyEngine/policyengine-us-data#1139: the synthetic donor rows are present, but calibration gives them too little mass to materially improve SPI-heavy income variables.

Artifact checked

  • HuggingFace cache repo: policyengine/policyengine-uk-data-private
  • Ref: 1.55.5
  • Snapshot: 664497078e8615d4491309459f6422e6a35d423d
  • File: enhanced_frs_2023_24.h5

Result

household_rows=53,508
total_household_weight=31,282,963.49
zero_weight_rows=0

FRS-derived rows: 33,508 rows, 31,171,999.95 weight, 99.645%
SPI synthetic rows: 20,000 rows, 110,963.54 weight, 0.355%

This classification is inferred from the build order: impute_income() stacks original FRS with 10k zero-weight SPI-imputed households, then impute_capital_gains() duplicates the combined dataset for the capital-gains split.

Why this matters

The SPI synthetic rows are meant to provide support for income components and reliefs that are under-covered in FRS, especially high-income tails. With only ~0.35% final household-weight share, the rows are technically active but likely too weak to close SPI-derived targets.

Proposed fix direction

  • Give zero-weight SPI synthetic households meaningful positive prior mass before calibration, rather than random near-zero dust.
  • Add explicit row-source flags, e.g. household_is_spi_synthetic and household_is_capital_gains_clone, so diagnostics do not need row-order inference.
  • Log source weight shares and calibration loss/target diagnostics during calibration.
  • Keep the existing second-stage FRS-only imputation on SPI rows (impute_frs_only_variables) so benefit, pension, savings, and other FRS-only features are regenerated conditional on the SPI-imputed incomes.

A branch implementing this is being tested locally and in CI: codex/spi-prior-target-diagnostics.

Related: PolicyEngine/policyengine-us-data#1139 and PolicyEngine/policyengine-us-data#1140.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions