Skip to content

Impute salary sacrifice participation to full population #219

Description

@MaxGhenis

Problem

The current dataset only uses raw FRS SPNAMT (salary sacrifice pension amount) field, which has only 371 observations with non-zero values out of ~36,000 persons. This represents £5bn weighted, but the HMRC target is ~£24bn.

PR #216 attempted to add calibration targets for salary sacrifice, but this requires ~5x weight scaling which inflates population from 68M to 74M (6% over target vs 2% tolerance).

Root Cause

Salary sacrifice is NOT imputed - unlike consumption, wealth, VAT, services, income, and capital gains which all have imputation steps in create_datasets.py. The raw FRS severely under-reports SS participation.

Proposed Solution

Implement ML-based imputation for salary sacrifice participation, similar to how other variables are imputed.

Key Finding: We CAN Distinguish Non-Response from Zero

The FRS SALSAC variable is a routing question that asks "Does your employer offer a salary sacrifice scheme for pension contributions?":

SALSAC Value Meaning Count
'1' Yes, participates in SS 224 jobs
'2' No, doesn't participate 3,803 jobs
' ' (blank) Skip/not asked 13,265 jobs

This provides:

  • Training data: 4,027 observations with definite Yes/No responses
  • Imputation candidates: 13,265 observations where the question was skipped

External Validation Target

Per HMRC surveys, approximately 30% of private sector employees use salary sacrifice for pension contributions. This can be used to validate imputation results.

HMRC Table 6.2 Targets (2023-24)

  • Total SS pension contributions: ~£24bn
  • IT relief from SS: ~£7.2bn
    • Basic rate: £1.6bn
    • Higher rate: £4.4bn
    • Additional rate: £1.2bn

Implementation Steps

  1. Create imputation model using SALSAC='1'/'2' as training labels
  2. Predict SS participation probability for SALSAC=' ' (skipped) observations
  3. Impute SS amounts based on participation probability and employee characteristics
  4. Validate against HMRC 30% participation rate target
  5. Remove calibration targets from loss function (or reduce their weight significantly)

Related Issues/PRs

Files to Modify

  • policyengine_uk_data/datasets/frs.py - Add SALSAC extraction
  • policyengine_uk_data/datasets/create_datasets.py - Add SS imputation step
  • New file: policyengine_uk_data/datasets/imputations/salary_sacrifice.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions