Skip to content

Commit dfe5e6c

Browse files
vahid-ahmadiclaudeMaxGhenis
authored
Calibrate combined self-employed NICs and skip inert Class 3 target (#379)
* Calibrate combined self-employed NICs target and skip inert Class 3 Two NI calibration gaps surfaced during issue audit (#88) and bug report #378: 1. Recent OBR EFOs (e.g. March 2026) publish a single combined "Class 4 and Class 2 Self employed NICs" line instead of two separate rows. The parser's Class 2 / Class 4 candidate labels no longer matched, so neither target was registered and self-employed NICs were silently uncalibrated against OBR. 2. ni_class_3 is an input variable in PolicyEngine UK with no formula and no dataset path that populates it. The matrix column is therefore a flat zero, calibration cannot move it, and the diagnostic that "the target is included" is misleading. This commit: - Adds an obr/ni_self_employed target whose values come from the combined EFO line and whose matrix column is computed via a new custom_compute that sums ni_class_2 + ni_class_4 at the household level. Smoke-build on enhanced_frs_2023_24.h5 with year=2025: 6,848 non-zero households, target £2.90bn. - Keeps the legacy Class 2 / Class 4 candidate labels around so older or future EFOs that revert to separate rows still produce individual targets. - Removes the ni_class_3 entry from _parse_nics with a comment pointing at #378 and the conditions for restoring it (a Class 3 imputation that addresses #88 in full). Tests cover both layers: - test_obr_nics.py: parser handles the combined EFO layout, the legacy separate layout, and intentionally drops Class 3 in either format. - test_obr_nic_signal.py: the registered targets are present in the registry, the combined target carries a custom_compute callable, ni_class_3 is absent, and (gated on enhanced_frs) each underlying PE-UK NI variable produces non-zero variation while ni_class_3 returns a uniform zero — the very property that makes it inert as a calibration target. Closes #378. Partial close of #88 — Class 3 imputation remains a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Use direct self-employed NICs target variable --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Max Ghenis <mghenis@gmail.com>
1 parent 0fe94f8 commit dfe5e6c

4 files changed

Lines changed: 281 additions & 91 deletions

File tree

changelog.d/378.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
- Calibrate against self-employed NICs (combined OBR Class 2 + Class 4 line) and skip the inert Class 3 target so calibration diagnostics are no longer misleading (#378, partial close of #88).

policyengine_uk_data/targets/sources/obr.py

Lines changed: 36 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -249,6 +249,21 @@ def _parse_nics(wb: openpyxl.Workbook) -> list[Target]:
249249
["Class 1 Employer NICs"],
250250
"ni_employer",
251251
),
252+
# Self-employed NICs.
253+
#
254+
# Recent OBR EFOs (e.g. March 2026) publish a single combined
255+
# "Class 4 and Class 2 Self employed NICs" line rather than two
256+
# separate rows. Earlier EFOs split them. We list the combined
257+
# label first; if a future EFO reverts to separate rows, the
258+
# legacy ``ni_class_2`` / ``ni_class_4`` entries below will pick
259+
# them up instead.
260+
"ni_self_employed": (
261+
[
262+
"Class 4 and Class 2 Self employed NICs",
263+
"Class 2 and Class 4 Self employed NICs",
264+
],
265+
"ni_self_employed",
266+
),
252267
"ni_class_2": (
253268
[
254269
"Class 2 NICs",
@@ -257,14 +272,14 @@ def _parse_nics(wb: openpyxl.Workbook) -> list[Target]:
257272
],
258273
"ni_class_2",
259274
),
260-
"ni_class_3": (
261-
[
262-
"Class 3 NICs",
263-
"Class 3 Voluntary NICs",
264-
"Class 3 voluntary NICs",
265-
],
266-
"ni_class_3",
267-
),
275+
# Class 3 NICs are voluntary contributions to fill state-pension
276+
# record gaps. The PE-UK variable ni_class_3 is an input with no
277+
# formula and no dataset path populates it, so a calibration target
278+
# would fall through to a flat-zero matrix column with no signal —
279+
# see issue #378. Class 3 is also small (~£50m vs ~£150bn total
280+
# NICs, ~0.03%), so the calibrator cannot meaningfully match it
281+
# without a record-level imputation. Skipped here pending an
282+
# imputation that addresses #88; restore the row when one lands.
268283
"ni_class_4": (
269284
[
270285
"Class 4 NICs",
@@ -296,18 +311,19 @@ def _parse_nics(wb: openpyxl.Workbook) -> list[Target]:
296311
continue
297312

298313
values = _read_row_values(ws, row_num, cols)
299-
if values:
300-
targets.append(
301-
Target(
302-
name=f"obr/{name}",
303-
variable=variable,
304-
source="obr",
305-
unit=Unit.GBP,
306-
values=values,
307-
reference_url=ref,
308-
forecast_vintage=vintage,
309-
)
310-
)
314+
if not values:
315+
continue
316+
317+
kwargs: dict = {
318+
"name": f"obr/{name}",
319+
"variable": variable,
320+
"source": "obr",
321+
"unit": Unit.GBP,
322+
"values": values,
323+
"reference_url": ref,
324+
"forecast_vintage": vintage,
325+
}
326+
targets.append(Target(**kwargs))
311327

312328
return targets
313329

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
"""Regression tests for OBR NIC calibration signal.
2+
3+
Issue #378 (closing) and partial close of #88: every active OBR NIC
4+
target must produce a non-trivial calibration matrix column.
5+
6+
Active targets (against the March 2026 EFO format):
7+
8+
- ``obr/ni_employee`` — Class 1 employee, formula-derived in PE-UK.
9+
- ``obr/ni_employer`` — Class 1 employer, formula-derived in PE-UK.
10+
- ``obr/ni_self_employed`` — combined Class 2 + Class 4, aligned to the
11+
PE-UK ``ni_self_employed`` variable.
12+
13+
Class 3 is intentionally absent because no dataset populates
14+
``ni_class_3`` — the matrix column would be a flat zero.
15+
16+
Two layers:
17+
18+
1. Registry layer (no Microsimulation) — assert the three expected NIC
19+
targets are present in the OBR target set and Class 3 is absent.
20+
2. Signal layer (gated on the enhanced FRS fixture) — assert each
21+
active NIC variable produces non-zero variation across households,
22+
and that Class 3 would not. Catches future regressions where someone
23+
re-adds the target without an accompanying imputation.
24+
"""
25+
26+
from __future__ import annotations
27+
28+
import pytest
29+
30+
31+
_ACTIVE_TOPLINE_TARGET_NAMES = (
32+
"obr/ni_employee",
33+
"obr/ni_employer",
34+
"obr/ni_self_employed",
35+
)
36+
_PE_UK_NIC_VARIABLES_WITH_SIGNAL = (
37+
"ni_employee",
38+
"ni_employer",
39+
"ni_self_employed",
40+
"ni_class_2",
41+
"ni_class_4",
42+
)
43+
44+
45+
# ── Layer 1: registry contract ──────────────────────────────────────
46+
47+
48+
def test_obr_nic_target_registry_includes_active_classes():
49+
"""The OBR target source must emit the three top-line NIC class targets."""
50+
from policyengine_uk_data.targets import get_all_targets
51+
52+
expected = set(_ACTIVE_TOPLINE_TARGET_NAMES)
53+
actual = {t.name for t in get_all_targets() if t.name in expected}
54+
assert actual == expected, f"Missing OBR NIC targets: {expected - actual}"
55+
56+
57+
def test_obr_ni_self_employed_target_uses_direct_pe_variable():
58+
"""The combined self-employed target must map directly to the PE-UK
59+
``ni_self_employed`` variable so target lineage stays explicit."""
60+
from policyengine_uk_data.targets import get_all_targets
61+
62+
target = next(
63+
(t for t in get_all_targets() if t.name == "obr/ni_self_employed"),
64+
None,
65+
)
66+
assert target is not None, "obr/ni_self_employed not registered"
67+
assert target.variable == "ni_self_employed"
68+
assert target.custom_compute is None
69+
70+
71+
def test_obr_ni_class_3_target_is_intentionally_absent():
72+
"""Class 3 must not appear in the registered targets (#378)."""
73+
from policyengine_uk_data.targets import get_all_targets
74+
75+
obr_targets = [t for t in get_all_targets() if t.source == "obr"]
76+
assert "obr/ni_class_3" not in {t.name for t in obr_targets}
77+
assert "ni_class_3" not in {t.variable for t in obr_targets}
78+
79+
80+
# ── Layer 2: simulator signal ───────────────────────────────────────
81+
82+
83+
@pytest.mark.parametrize("variable", _PE_UK_NIC_VARIABLES_WITH_SIGNAL)
84+
def test_active_nic_variable_has_nonzero_variation(enhanced_frs, variable):
85+
"""Each active NIC variable must produce variation across households,
86+
otherwise the calibration matrix column is a flat constant and the
87+
optimiser cannot match its target."""
88+
from policyengine_uk import Microsimulation
89+
90+
sim = Microsimulation(dataset=enhanced_frs)
91+
sim.default_calculation_period = enhanced_frs.time_period
92+
values = sim.calculate(variable).values
93+
94+
nonzero = int((values != 0).sum())
95+
assert nonzero > 0, f"{variable}: all values are zero — calibration would be inert"
96+
assert float(values.var()) > 0.0, (
97+
f"{variable}: zero variance — calibration cannot move it"
98+
)
99+
100+
101+
def test_ni_class_3_simulator_returns_uniform_zero(enhanced_frs):
102+
"""Direct evidence for why Class 3 is excluded: the simulator produces
103+
a flat-zero vector, so any calibration target on it is inert. If this
104+
ever stops being true (e.g. policyengine-uk adds a formula or this
105+
repo adds an imputation), the Class 3 target should be re-enabled in
106+
obr.py and the corresponding skip removed."""
107+
from policyengine_uk import Microsimulation
108+
109+
sim = Microsimulation(dataset=enhanced_frs)
110+
sim.default_calculation_period = enhanced_frs.time_period
111+
values = sim.calculate("ni_class_3").values
112+
113+
assert (values == 0).all(), (
114+
f"ni_class_3 has {(values != 0).sum()} non-zero entries — "
115+
"if intended, restore the target in obr.py and update this test."
116+
)

0 commit comments

Comments
 (0)