Skip to content
This repository was archived by the owner on Jun 19, 2026. It is now read-only.

Commit 8c3c587

Browse files
vahid-ahmadiclaude
andcommitted
Thread time_period through local-area loss matrices (#345 step 4)
Fixes two concrete bugs that would have prevented calibrating the same base dataset at a year other than its stored `time_period`: - `local_authorities/loss.py` read household weights at a hard-coded 2025 when computing the national-total fallbacks used for LAs missing ONS data. Now uses the explicit `time_period` argument. - `constituencies/loss.py` passed `dataset.time_period` to `get_national_income_projections` and `sim.default_calculation_period` even when the caller supplied a different `time_period`. Same fix. Also extracts the year-resolution logic from `build_loss_matrix._resolve_value` into a documented public function `resolve_target_value`, names the three-year tolerance as a constant, and adds 12 unit tests covering the fallback policy (exact match, nearest past year, tolerance limit, no backwards extrapolation, VOA population scaling). Ships `docs/targets_coverage.md` documenting year coverage across every target category and where the real gaps are (DWP 2026+, local-area CSV refreshes). No new data sourced in this PR — sourcing is deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 04121ee commit 8c3c587

6 files changed

Lines changed: 247 additions & 11 deletions

File tree

changelog.d/345.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Add panel ID contract, `create_yearly_snapshots` helper and `age_dataset` demographic ageing module as the first three steps towards per-year snapshots (#345).
1+
Add panel ID contract, `create_yearly_snapshots` helper, `age_dataset` demographic ageing module and year-aware loss matrices with a documented `resolve_target_value` fallback policy as the first four steps towards per-year snapshots (#345).

docs/targets_coverage.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Calibration target coverage by year
2+
3+
Per-year calibration (step 4 of [#345](https://github.com/PolicyEngine/policyengine-uk-data/issues/345)) depends on target values being available for every year we want to calibrate against. This document is the snapshot of where coverage is today so it is easy to see where future data-sourcing work is needed.
4+
5+
The pipeline resolves a requested year against each target's available years using `resolve_target_value` in `policyengine_uk_data/targets/build_loss_matrix.py`. The policy is: exact match → nearest past year within three years → `None`. There is no backwards extrapolation and no forwards extrapolation beyond three years — if the tolerance is exceeded the target is silently skipped.
6+
7+
## National and country-level targets
8+
9+
| Category | Source | Year range in repo | Year-keyed? | Notes |
10+
| --- | --- | --- | --- | --- |
11+
| Population by sex × age band | ONS mid-year population estimates | 2022-2029 | Yes (`ons_demographics.py`) | Downloaded as multi-year at registry build time |
12+
| Regional population by age | ONS subnational estimates | 2018-2029 | Yes (`demographics.csv`) | Multi-year CSV keyed by `year` column |
13+
| UK total population | ONS | 2018-2029 | Yes (`demographics.csv`) | Used by `resolve_target_value` for VOA scaling |
14+
| Scotland demographics (children, babies, 3+ child households) | ONS / Scottish Government | 2025 + some 2029 | Partial | Missing 2026-2028 |
15+
| Income by band (SPI) — national | HMRC SPI | 2022-2029 | Yes (`incomes_projection.csv`) | Projected from 2021 SPI via microsimulation |
16+
| Income tax, NICs, VAT, CGT, SDLT, fuel duty totals | OBR Economic and Fiscal Outlook | 2024-2030 | Yes (`obr.py`) | Live download, multi-year |
17+
| Council tax totals | OBR | 2024-2030 | Yes | OBR line items |
18+
| Council tax band counts | VOA | 2024 | No | Population-scaled by `resolve_target_value` for adjacent years |
19+
| Housing totals (mortgage, private rent, social rent) | ONS / EHS | 2025 | No | Single-year only — needs 2026+ refresh |
20+
| Tenure totals | ONS / EHS | 2025 | No | Single-year only |
21+
| Savings interest | ONS | 2025 | No | Single-year only |
22+
| Land values (household, corporate, total) | ONS National Balance Sheet | 2025 | No | Single-year only |
23+
| Regional household land values | MHCLG | 2025 | No | Single-year only |
24+
| DWP benefit caseloads (UC, ESA, PIP, JSA, benefit cap, UC by children / family type) | DWP Stat-Xplore / benefit statistics | 2025 (a few 2026) | Mostly no | **Primary gap**: 2026+ needs DWP forecasts or policy extrapolation |
25+
| Salary sacrifice (IT relief, contributions, NI relief) | HMRC / OBR | 2025 | No | OBR has 2024-2030 on some items |
26+
| Salary sacrifice headcount | OBR | 2024-2030 | Yes | Multi-year |
27+
| UC jobseeker splits, UC outside cap | OBR | 2024-2030 | Yes | Multi-year |
28+
| Two-child limit | DWP | 2025 | No | Single-year only |
29+
| Student loan plan borrower counts | SLC | 2025 | No | Single-year only |
30+
| Student loan repayment | SLC | 2025 | No | Single-year only |
31+
| NTS vehicle ownership | DfT National Travel Survey | 2024 | No | Single-year only |
32+
| TV licence | OBR | 2024 + 3% pa extrapolation | Implicit | Hard-coded extrapolation in `obr.py` |
33+
34+
## Constituency- and LA-level targets
35+
36+
| Category | Source | Year | Notes |
37+
| --- | --- | --- | --- |
38+
| Age bands per constituency / LA | ONS subnational population estimates | Snapshot (no year column) | `age.csv` files under `datasets/local_areas/*/targets/`; need annual refresh |
39+
| Income by area (employment, self-employment; count + amount) | HMRC SPI table 3.15 | Snapshot | `spi_by_constituency.csv`, `spi_by_la.csv`; HMRC publishes annually |
40+
| UC household counts by area | DWP Stat-Xplore | November 2023 | Scaled to 2025 national totals via `_scaled_uc_children_by_country` |
41+
| UC households by number of children (area level) | DWP Stat-Xplore | November 2023 base + 2025 scaling | In `local_uc.py` |
42+
| ONS small-area income estimates (LA only) | ONS | FYE 2020 + uprating | Uprated per-year via `get_ons_income_uprating_factors(year)` |
43+
| Tenure by LA | English Housing Survey 2023 | Snapshot | `la_tenure.xlsx` |
44+
| Private rent median by LA | VOA / ONS | Snapshot | `la_private_rents_median.xlsx` |
45+
46+
## Known gaps — what blocks full per-year calibration
47+
48+
1. **DWP benefit caseloads for 2026+**. The DWP statistical releases publish mostly current-year snapshots; forecasts are internal. Getting these requires coordination with the policy team or an agreed extrapolation policy from 2025 onwards.
49+
2. **Local-area CSVs (age, SPI, UC)**. Single-year snapshots stored as CSVs without a `year` column. For panel calibration these need an annual refresh process and a filename convention that includes the source year (e.g. `spi_by_constituency_2024.csv`).
50+
3. **Small-scale single-year sources**. NTS vehicles (2024), housing totals, SLC student loans, land values — each individually small but collectively relevant.
51+
52+
## Related code
53+
54+
- `policyengine_uk_data/targets/build_loss_matrix.py``resolve_target_value` and the national loss matrix builder.
55+
- `policyengine_uk_data/datasets/local_areas/constituencies/loss.py` — constituency loss matrix; now honours `time_period`.
56+
- `policyengine_uk_data/datasets/local_areas/local_authorities/loss.py` — LA loss matrix; now honours `time_period` (previously read weights at hard-coded 2025).
57+
- `policyengine_uk_data/targets/sources/` — individual target modules. The multi-year ones (`obr.py`, `ons_demographics.py`) are the template for converting the rest.

policyengine_uk_data/datasets/local_areas/constituencies/loss.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,18 @@ def create_constituency_target_matrix(
4646
time_period = dataset.time_period
4747

4848
sim = Microsimulation(dataset=dataset, reform=reform)
49-
sim.default_calculation_period = dataset.time_period
49+
# Honour the explicit ``time_period`` argument so that calibrating the
50+
# same base dataset to a different year (needed for panel output in
51+
# #345) reads variables at the requested year rather than the dataset's
52+
# stored year.
53+
sim.default_calculation_period = time_period
5054

5155
matrix = pd.DataFrame()
5256
y = pd.DataFrame()
5357

5458
# ── Income targets ─────────────────────────────────────────────
5559
incomes = get_constituency_income_targets()
56-
national_incomes = get_national_income_projections(int(dataset.time_period))
60+
national_incomes = get_national_income_projections(int(time_period))
5761

5862
for income_variable in INCOME_VARIABLES:
5963
income_values = sim.calculate(income_variable).values

policyengine_uk_data/datasets/local_areas/local_authorities/loss.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,15 +51,20 @@ def create_local_authority_target_matrix(
5151
la_codes = pd.read_csv(STORAGE_FOLDER / "local_authorities_2021.csv")
5252

5353
sim = Microsimulation(dataset=dataset, reform=reform)
54-
original_weights = sim.calculate("household_weight", 2025).values
54+
# Read the uncalibrated weights at the requested calibration year rather
55+
# than a hard-coded 2025. The downstream national totals on lines
56+
# ~160-170 are used as fall-back per-LA targets when ONS data is
57+
# missing, and they must be expressed at the same year as the targets
58+
# we are calibrating against.
59+
original_weights = sim.calculate("household_weight", int(time_period)).values
5560
sim.default_calculation_period = time_period
5661

5762
matrix = pd.DataFrame()
5863
y = pd.DataFrame()
5964

6065
# ── Income targets ─────────────────────────────────────────────
6166
incomes = get_la_income_targets()
62-
national_incomes = get_national_income_projections(int(dataset.time_period))
67+
national_incomes = get_national_income_projections(int(time_period))
6368

6469
for income_variable in INCOME_VARIABLES:
6570
income_values = sim.calculate(income_variable).values

policyengine_uk_data/targets/build_loss_matrix.py

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -122,24 +122,59 @@ def create_target_matrix(
122122
return df, pd.Series(target_values, index=target_names)
123123

124124

125-
def _resolve_value(target: Target, year: int) -> float | None:
126-
"""Get the target value for a year, falling back to nearest year.
125+
YEAR_FALLBACK_TOLERANCE = 3
126+
"""Maximum allowed gap between the requested year and the nearest available
127+
target year before ``resolve_target_value`` gives up and returns ``None``.
128+
129+
Three years matches the typical publication lag on HMRC/DWP/ONS series and
130+
the length of the OBR's rolling forecast window. It is a deliberately
131+
generous ceiling — the goal is to keep the calibration functional when a
132+
dataset runs slightly ahead of the latest published target, not to make up
133+
numbers several years out."""
134+
135+
136+
def resolve_target_value(
137+
target: Target,
138+
year: int,
139+
*,
140+
tolerance: int = YEAR_FALLBACK_TOLERANCE,
141+
) -> float | None:
142+
"""Return the calibration value for ``target`` at ``year``.
143+
144+
Policy, in order:
145+
146+
1. **Exact year available** → return it.
147+
2. **Requested year is in the past relative to all available years**
148+
(e.g. ask 2022, data starts 2024) → return ``None``. We do not
149+
extrapolate backwards, because doing so would quietly misreport
150+
historical reality.
151+
3. **Nearest past year is within ``tolerance`` years** → use that value.
152+
For VOA-sourced targets only, the value is scaled by the UK
153+
population ratio between the two years so that count-type targets
154+
track demographic growth.
155+
4. **Nearest past year is further than ``tolerance``** → return
156+
``None`` rather than make up a number.
127157
128-
VOA council tax targets are population-uprated when extrapolating
129-
from their base year (2024).
158+
Args:
159+
target: a ``Target`` whose ``values`` map maps year → scalar.
160+
year: the calendar year being requested.
161+
tolerance: maximum number of years the fallback may look back.
162+
Defaults to ``YEAR_FALLBACK_TOLERANCE``.
163+
164+
Returns:
165+
The target value, or ``None`` if no usable year is available.
130166
"""
131167
if year in target.values:
132168
return target.values[year]
133169
available = sorted(target.values.keys())
134170
if not available:
135171
return None
136172
closest = min(available, key=lambda y: abs(y - year))
137-
if abs(closest - year) > 3:
173+
if abs(closest - year) > tolerance:
138174
return None
139175
if closest > year:
140176
return None
141177
base_value = target.values[closest]
142-
# VOA council tax counts scale with population
143178
if target.source == "voa" and year != closest:
144179
from policyengine_uk_data.targets.sources.local_age import (
145180
get_uk_total_population,
@@ -152,6 +187,11 @@ def _resolve_value(target: Target, year: int) -> float | None:
152187
return base_value
153188

154189

190+
# Kept as a private alias so existing call sites (this module only) do not
191+
# need to be rewritten. New code should prefer ``resolve_target_value``.
192+
_resolve_value = resolve_target_value
193+
194+
155195
class _SimContext:
156196
"""Holds the simulation and lazily computed intermediate arrays."""
157197

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
"""Tests for the year-resolution policy in build_loss_matrix (#345, step 4)."""
2+
3+
import pytest
4+
5+
from policyengine_uk_data.targets.build_loss_matrix import (
6+
YEAR_FALLBACK_TOLERANCE,
7+
resolve_target_value,
8+
)
9+
from policyengine_uk_data.targets.schema import (
10+
GeographicLevel,
11+
Target,
12+
Unit,
13+
)
14+
15+
16+
def _target(values: dict[int, float], *, source: str = "ons") -> Target:
17+
"""Minimal Target instance for year-resolution tests."""
18+
return Target(
19+
name=f"{source}/test",
20+
variable="test_variable",
21+
source=source,
22+
unit=Unit.COUNT,
23+
geographic_level=GeographicLevel.NATIONAL,
24+
values=values,
25+
)
26+
27+
28+
def test_exact_year_match_returns_that_value():
29+
t = _target({2024: 10.0, 2025: 20.0, 2026: 30.0})
30+
assert resolve_target_value(t, 2025) == 20.0
31+
32+
33+
def test_exact_match_preferred_over_fallback():
34+
t = _target({2024: 10.0, 2025: 20.0})
35+
# Even though 2024 is "nearest" to 2024, the exact match for 2025
36+
# must win outright.
37+
assert resolve_target_value(t, 2024) == 10.0
38+
39+
40+
def test_falls_back_to_nearest_past_year_within_tolerance():
41+
t = _target({2023: 7.0})
42+
# 2024 and 2025 fall back to 2023 within tolerance.
43+
assert resolve_target_value(t, 2024) == 7.0
44+
assert resolve_target_value(t, 2025) == 7.0
45+
assert resolve_target_value(t, 2026) == 7.0
46+
47+
48+
def test_returns_none_when_only_future_years_available():
49+
"""Extrapolating backwards would misreport historical reality."""
50+
t = _target({2025: 50.0, 2026: 60.0})
51+
assert resolve_target_value(t, 2023) is None
52+
assert resolve_target_value(t, 2024) is None
53+
54+
55+
def test_returns_none_when_fallback_exceeds_tolerance():
56+
t = _target({2020: 5.0})
57+
# 2024 is four years away — beyond the three-year default.
58+
assert resolve_target_value(t, 2024) is None
59+
60+
61+
def test_custom_tolerance_is_honoured():
62+
t = _target({2020: 5.0})
63+
# Explicitly widen the tolerance.
64+
assert resolve_target_value(t, 2024, tolerance=4) == 5.0
65+
# Or tighten it.
66+
assert resolve_target_value(t, 2022, tolerance=1) is None
67+
68+
69+
def test_empty_values_returns_none():
70+
t = _target({})
71+
assert resolve_target_value(t, 2025) is None
72+
73+
74+
def test_default_tolerance_is_three_years():
75+
"""Lock the public tolerance constant so a change is deliberate."""
76+
assert YEAR_FALLBACK_TOLERANCE == 3
77+
78+
79+
def test_non_voa_target_does_not_get_population_scaled():
80+
"""Only VOA council-tax counts should track population growth."""
81+
t = _target({2024: 100.0}, source="dwp")
82+
# 2025 is within tolerance but DWP data must not be rescaled.
83+
assert resolve_target_value(t, 2025) == 100.0
84+
85+
86+
def test_voa_target_scales_with_population_when_extrapolating(monkeypatch):
87+
"""VOA counts must move roughly in line with population."""
88+
t = _target({2024: 100.0}, source="voa")
89+
90+
fake_pop = {2024: 67_000_000.0, 2025: 68_000_000.0}
91+
92+
def fake_total_population(year):
93+
return fake_pop[year]
94+
95+
monkeypatch.setattr(
96+
"policyengine_uk_data.targets.sources.local_age.get_uk_total_population",
97+
fake_total_population,
98+
)
99+
100+
resolved = resolve_target_value(t, 2025)
101+
expected = 100.0 * 68_000_000.0 / 67_000_000.0
102+
assert resolved == pytest.approx(expected)
103+
104+
105+
def test_voa_target_returns_base_when_year_matches_exactly(monkeypatch):
106+
"""Population scaling only kicks in when we actually extrapolate."""
107+
t = _target({2025: 123.0}, source="voa")
108+
109+
# If the scaler is called, blow up — it must not be touched here.
110+
def explode(year):
111+
raise AssertionError("Population scaler called on exact match")
112+
113+
monkeypatch.setattr(
114+
"policyengine_uk_data.targets.sources.local_age.get_uk_total_population",
115+
explode,
116+
)
117+
118+
assert resolve_target_value(t, 2025) == 123.0
119+
120+
121+
def test_voa_guards_against_zero_population_base(monkeypatch):
122+
"""If the population lookup returns zero, fall back to the raw value."""
123+
t = _target({2024: 100.0}, source="voa")
124+
125+
monkeypatch.setattr(
126+
"policyengine_uk_data.targets.sources.local_age.get_uk_total_population",
127+
lambda year: 0.0,
128+
)
129+
# Division by zero must be avoided; raw value returned as-is.
130+
assert resolve_target_value(t, 2025) == 100.0

0 commit comments

Comments
 (0)