Skip to content

LA calibration: unify missing-source handling across loss.py blocks #381

Description

@vahid-ahmadi

Background

After PR #374 (commit 96f5707), the council-tax blocks in policyengine_uk_data/datasets/local_areas/local_authorities/loss.py use NaN-masking for cells where no direct source is available:

  • voa/council_tax/{A..H}np.where(has_count, direct, np.nan)
  • housing/council_tax_netnp.where(has_ct_net, direct, np.nan)

The calibrator (utils/calibrate.py) was updated to mask NaN cells out of the loss, so missing-source LAs simply don't contribute to training on those targets.

Inconsistency

The other LA-level blocks in the same file still use the national-share fallback pattern:

# tenure/*  (English Housing Survey — England-only)
y[f\"tenure/{tenure_key}\"] = np.where(
    has_tenure, targets.values, national * la_household_share
)

# rent/private_rent  (VOA private rents — England + Wales)
y[\"rent/private_rent\"] = np.where(
    has_rent, target.values, national_rent * la_household_share
)

# ons/equiv_net_income_*  (ONS small-area income — England + Wales)
y[\"ons/equiv_net_income_bhc\"] = np.where(
    has_ons_data, target.values, national_bhc * la_household_share
)

For LAs with missing source data (Wales / Scotland / NI for tenure; Scotland / NI for rent and ONS income), these blocks fabricate a target value as a population-weighted slice of the national total, rather than masking the cell out.

This means the LA reweighter currently follows two coexisting policies:

  • Council tax: only train on directly observed cells.
  • Tenure / rent / ONS income: train on observed cells plus fabricated national-share fallbacks.

Question

Is the council-tax NaN-masking approach the new standard for all LA blocks, or is the national-share fallback intentional for the older blocks?

If the new standard, the existing tenure / rent / ONS-income blocks should be migrated to NaN-masking too (same shape: np.where(has_data, direct, np.nan)).

If the older approach is intentional for those targets (e.g., national-share fallback is acceptable for tenure mix percentages because mix patterns vary less across countries than council tax band distributions), the council-tax block comment should explicitly say so, and ideally the reasoning gets captured in a short policy note alongside loss.py.

Proposed actions

Two tractable paths:

  1. Unify on NaN-masking. Migrate the tenure / rent / ONS-income blocks to np.where(has_data, direct, np.nan) and rely on the calibrator's NaN-masking. Pros: consistent, no fabricated targets anywhere. Cons: behaviour change for all callers; needs a calibration-quality check before/after to confirm residuals don't degrade.

  2. Document the asymmetry. Add a short comment at the top of loss.py (or in utils/calibrate.py) explaining when each pattern applies and why. Pros: tiny, no behaviour change. Cons: leaves the inconsistency in place for the next PR adding an LA target.

Either path is fine. The current state (no comment, no consistency) makes the next PR-author guess.

Related

cc @MaxGhenis @vahid-ahmadi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions