Skip to content

Commit a2eb7b4

Browse files
committed
Merge main into panel branch; resolve calibrate.py
Conflict was in `policyengine_uk_data/utils/calibrate.py`: main added `load_weights` (#351 defensive h5-weight loader) at the same file position where this branch added `compute_log_weight_smoothness_penalty` (#346 step 5). Both functions are independent and both stay. All 150 tests in the panel-pipeline suite + the calibrate smoothness tests pass after the merge, including `load_weights` consumers pulled in from main (test_calibrate_save, test_la_land_value_targets).
2 parents 154732d + 163c432 commit a2eb7b4

35 files changed

Lines changed: 2387 additions & 246 deletions

.github/CONTRIBUTING.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,46 @@
1-
## Updating data
1+
# Contributing to policyengine-uk-data
22

3-
If your changes present a non-bugfix change to one or more datasets which are cloud-hosted (FRS and EFRS), then please change both the filename and URL (in both the class definition file and in `storage/upload_completed_datasets.py`). This enables us to store historical versions of datasets separately and reproducibly.
3+
See the [shared PolicyEngine contribution guide](https://github.com/PolicyEngine/.github/blob/main/CONTRIBUTING.md) for cross-repo conventions (towncrier changelog fragments, `uv run`, PR description format, anti-patterns). This file covers policyengine-uk-data specifics.
44

5-
## Updating the versioning
5+
## Commands
66

7-
Please add to `changelog.yaml` and then run `make changelog` before committing the results ONCE in this PR.
7+
```bash
8+
make install # install deps (uv)
9+
make format # format (required)
10+
make download # download raw FRS + SPI inputs from HF (needs HUGGING_FACE_TOKEN)
11+
make data # full dataset build (impute, calibrate, upload)
12+
make test # test suite
13+
uv run pytest policyengine_uk_data/tests/path/to/test.py -v
14+
```
15+
16+
Python 3.13+. Default branch: `main`. Raw FRS / SPI microdata live on HuggingFace; set `HUGGING_FACE_TOKEN` before running anything that touches the dataset build.
17+
18+
## What lives here
19+
20+
This repo builds the `.h5` files that feed `policyengine-uk`:
21+
22+
- `datasets/frs.py` — raw FRS → PolicyEngine variable mapping
23+
- `datasets/imputations/` — QRF / other imputations layered on top (income, wealth, consumption, etc.)
24+
- `datasets/local_areas/` — constituency and local-authority calibration
25+
- `targets/` — calibration target sources (OBR, DWP, HMRC, ONS, SLC, etc.)
26+
- `utils/calibrate.py` — the reweighting optimiser
27+
- `storage/` — raw inputs, intermediate artefacts, published outputs
28+
29+
## Data-protection rules — no exceptions
30+
31+
The enhanced FRS dataset is licensed under strict UK Data Service terms. Violating them risks losing access, which would end PolicyEngine UK.
32+
33+
- **Never upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated.
34+
- **Never modify `upload_completed_datasets.py` or `utils/data_upload.py`** to change upload destinations without explicit confirmation from the data controller (currently Nikhil Woodruff).
35+
- **Never print, log, or output individual-level records.** Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not.
36+
- **If you see a private/public repo split, assume it is intentional** — ask why before changing it.
37+
38+
## Updating datasets
39+
40+
If your change is a non-bugfix update to a cloud-hosted dataset (FRS, enhanced FRS), bump both the filename and URL in the class definition and in `storage/upload_completed_datasets.py`. That lets us store historical dataset versions separately and reproducibly.
41+
42+
## Repo-specific anti-patterns
43+
44+
- **Don't** hardcode dataset years in variable transforms; use `dataset.time_period` and the uprating pipeline.
45+
- **Don't** commit large binary artefacts — use HuggingFace storage.
46+
- **Don't** skip `make test` when touching the imputation or calibration pipeline; full CI rebuilds the dataset and takes ~25 minutes.

CHANGELOG.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,76 @@
1+
## [1.53.1] - 2026-04-20
2+
3+
No significant changes.
4+
5+
6+
## [1.53.0] - 2026-04-19
7+
8+
### Added
9+
10+
- Tightened `test_population` tolerance from 7% to 3% now that the stage-2 QRF (#362), TFC target refresh (#363), and reported-anchor takeup (#359) pulled the weighted UK population overshoot from ~6.5% down to ~1.6%. Added four regression tests in `test_population_fidelity.py` (weighted-total match, household-count range, non-inflation guard, country-sum consistency) extracted from the earlier #310 draft so any future calibration drift back toward the pre-April-2026 overshoot trips CI.
11+
12+
13+
## [1.52.2] - 2026-04-18
14+
15+
### Changed
16+
17+
- Add second-stage QRF imputation of FRS-only variables on SPI-donor rows. After the first-stage SPI-trained QRF overwrites income components on the zero-weight subsample, a new second-stage QRF trained on the full FRS rewrites benefit `_reported` columns, pension contributions, and savings-income so they correlate with the freshly-imputed incomes instead of staying as whatever middle-income FRS donor was sampled. Mirrors the `policyengine-us-data#589` pattern. Prevents synthetic £2 M earners from carrying a middle-income donor's UC / housing-benefit receipt into calibration, which was blowing up benefit aggregates under upweight.
18+
- Anchor stochastic takeup assignment for Universal Credit, Pension Credit, and Child Benefit to the FRS-reported receipt columns, matching the `policyengine-us-data` pattern. Respondents who report positive receipt in the FRS benefits table now receive `would_claim_* = True` with certainty, and non-reporters are filled probabilistically to hit the aggregate target rate. Removes a source of calibration noise where respondents who clearly took up a benefit could be randomly assigned `would_claim = False`.
19+
20+
### Fixed
21+
22+
- Refresh Tax-Free Childcare calibration targets and take-up rate using HMRC's June 2025 release (covering 2024-25 outturn: £632 m spending, 985 k children reached). The prior target set was calibrated against the September 2024 release and undershot current TFC spending by roughly a third. Bumps the default TFC take-up rate from 0.586 to 0.88 on 2024-04-06 to close most of the gap pending a full recalibration run.
23+
24+
25+
## [1.52.1] - 2026-04-18
26+
27+
### Fixed
28+
29+
- Update the `Raise VAT standard rate by 2pp` reform-impact test expectation from 25.0 bn to 43.0 bn — the enhanced FRS's total consumption aggregate has grown to a UK-realistic ~£1.6 T (matching ONS 2025 total consumer expenditure), so a 2pp rise on the current `microdata_vat_coverage = 0.38`-scaled base produces ~£43 bn, not the £25 bn calibrated against an older smaller dataset. Also clamps raw electricity/gas consumption in `impute_energy_splits` to be non-negative (a handful of LCFS bill-variable inconsistencies produced small negatives), fixing `test_non_negative_energy`. Follow-up: revisit `microdata_vat_coverage` itself now that the underlying base is fuller (#364).
30+
31+
32+
## [1.52.0] - 2026-04-17
33+
34+
### Changed
35+
36+
- Point CONTRIBUTING.md at the shared PolicyEngine contribution guide (https://github.com/PolicyEngine/.github) and trim the per-repo file to commands, repo-specific conventions, and anti-patterns. Removes the stale `changelog_entry.yaml` / `make changelog` instructions.
37+
38+
### Removed
39+
40+
- Remove `policyengine_uk_data/tests/test_changelog_encoding.py`. It validated UTF-8 and YAML structure of the deprecated `changelog_entry.yaml`, which was retired when the repo migrated to towncrier `changelog.d/` fragments. All three tests now unconditionally `pytest.skip` because the file no longer exists, and any fragment-format validation is already handled by the `Check changelog fragment` CI step.
41+
42+
43+
## [1.51.1] - 2026-04-17
44+
45+
### Fixed
46+
47+
- Guard the rent/mortgage rescaling in `impute_over_incomes` against `ZeroDivisionError` when the seed dataset's imputation columns sum to zero (e.g. the zero-weight synthetic copy in `impute_income`).
48+
49+
50+
## [1.51.0] - 2026-04-17
51+
52+
### Added
53+
54+
- Add `policyengine_uk_data.utils.hf_destinations` with `PRIVATE_REPO` / `PUBLIC_REPO` constants and an AST-based xfail test (`tests/test_hf_destinations.py`) that flags every `upload(...)`, `upload_file(...)`, `upload_files_to_hf(...)`, and `upload_data_files(...)` call site that still bypasses the shared constants.
55+
- Add `policyengine_uk_data.utils.calibrate.load_weights`, a defensive loader that normalises calibration weights to 2D `(n_areas, n_records)` and validates expected shapes so consumers can't silently read the wrong axis layout across the L2 and L0 calibrators.
56+
57+
### Fixed
58+
59+
- Fix `calibrate_local_areas` non-verbose branch silently failing to save weights because the `if epoch % 10 == 0` save block was indented outside the training loop.
60+
- Replace bare `* 52` weekly-to-annual conversion in LCFS imputation with the shared `WEEKS_IN_YEAR = 365.25 / 7` constant used by `datasets/frs.py`, and replace two `np.random.seed(42)` calls with local `np.random.default_rng(42)` so consumption imputation stops mutating the process-wide RNG state.
61+
- Add OBR calibration targets for NIC Classes 2, 3 and 4 (self-employed flat-rate, voluntary and profit-based) alongside the existing Class 1 employee/employer rows, and accept common label-wording variants in OBR EFO Table 3.4.
62+
- Fix `datasets/spi.py` `__main__` crash (two-arg call to three-arg `create_spi`), parameterise the hardcoded £1,250 marriage allowance from policyengine-uk parameters, seed the age imputation RNG, and surface unknown GORCODE regions as `UNKNOWN` instead of silently mapping them to `SOUTH_EAST`.
63+
- Raise `UpratingYearOutOfRangeError` with a clear message when `uprate_values` or `uprate_dataset` is called with a year outside the `[START_YEAR, END_YEAR]` range of the uprating factor table, instead of surfacing a pandas `KeyError` or silently returning wrong values.
64+
- Parameterise the VAT standard rate and reduced-rate share in ETB-based VAT imputation by reading from `policyengine_uk.parameters.gov.hmrc.vat` keyed on the training year, with a `VAT_RATE_BY_YEAR` fallback for offline use. Promote the `etb.year == 2020` filter to a `year` argument with a `DEFAULT_ETB_YEAR` default.
65+
66+
67+
## [1.50.6] - 2026-04-17
68+
69+
### Fixed
70+
71+
- Include `gift_aid` (SPI `GIFTAID`) and `charitable_investment_gifts` (SPI `GIFTINV`) in the SPI income imputation model so synthetic high-earner rows carry plausible charitable giving drawn jointly with income, instead of a flat zero. Previously the 6-variable QRF ran over only the core income components; both charitable relief columns were in `SPI_RENAMES` but never reached the predicted output, so the SPI-donor half of the enhanced FRS carried its FRS donor's (always-zero) charitable giving. Adds both columns to the model's output list, renames the cache file to force retraining, and initialises the FRS-side columns to zero to keep the stacked dataset valid.
72+
73+
174
## [1.50.5] - 2026-04-17
275

376
No significant changes.

changelog.d/368.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
- Set Marriage Allowance take-up rate to 0.5 (HMRC outturn ~2.1m claimants of ~4.2m eligible couples) instead of the placeholder 1.0, so microsimulation no longer overstates Marriage Allowance cost by ~£500m/year.

policyengine_uk_data/datasets/childcare/takeup_rate.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,33 @@
33
from policyengine_uk import Microsimulation
44

55
# 🎯 Calibration targets
6+
#
7+
# TFC targets refreshed from HMRC "Tax-Free Childcare statistics: June 2025"
8+
# (published 27 Aug 2025, covering 2024-25 outturn):
9+
# - spending: £632.2 m (Table 1, annual government top-up)
10+
# - caseload: 985 thousand children received TFC in 2024-25 (annual unique)
11+
# The prior 0.6 / 660 targets were calibrated against the Sep 2024 release
12+
# (2023-24 outturn) and have since been overtaken by the TFC account
13+
# expansion and the Sep 2025 "30 free hours for under-5s" boost in uptake.
14+
#
15+
# Other programme targets kept at their prior DfE values.
616
targets = {
717
"spending": {
8-
"tfc": 0.6,
18+
"tfc": 0.63,
919
"extended": 2.5,
1020
"targeted": 0.6,
1121
"universal": 1.7,
1222
},
1323
"caseload": {
14-
"tfc": 660,
24+
"tfc": 985,
1525
"extended": 740,
1626
"targeted": 130,
1727
"universal": 490,
1828
},
1929
}
2030

21-
# Here is the link to the UK government’s aggregate data for Tax-Free Childcare:
22-
# https://www.gov.uk/government/statistics/tax-free-childcare-statistics-september-2024
31+
# UK government aggregate Tax-Free Childcare statistics:
32+
# https://www.gov.uk/government/statistics/tax-free-childcare-statistics-june-2025
2333

2434
# This is the Department for Education (DfE) data for the other childcare programmes:
2535
# https://skillsfunding.service.gov.uk/view-latest-funding/national-funding-allocations/DSG/2024-to-2025

policyengine_uk_data/datasets/frs.py

Lines changed: 36 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,13 @@
2828
from policyengine_uk_data.parameters import load_take_up_rate, load_parameter
2929

3030

31+
# Canonical weeks-per-year conversion factor for annualising weekly survey
32+
# variables. 365.25 / 7 ≈ 52.1786 accounts for leap years; using the rounded
33+
# integer 52 would under-count by ~0.34%. Exposed at module level so sibling
34+
# loaders (e.g. LCFS/ETB in `datasets/imputations/consumption.py`) can import
35+
# the same value rather than re-defining `* 52` locally and drifting.
36+
WEEKS_IN_YEAR = 365.25 / 7
37+
3138
LEGACY_JOBSEEKER_MIN_AGE = 18
3239
HOURS_WORKED_WEEKS_PER_YEAR = 52
3340
ESA_MIN_AGE = 16
@@ -1210,24 +1217,45 @@ def determine_education_level(fted_val, typeed2_val, age_val):
12101217
scp_under_6_rate = load_take_up_rate("scp_under_6", year)
12111218
scp_6_plus_rate = load_take_up_rate("scp_6_plus", year)
12121219

1213-
# Generate take-up decisions by comparing random draws to take-up rates
1220+
# Generate take-up decisions by comparing random draws to take-up rates,
1221+
# anchored to reported receipts where the FRS captures them. Respondents
1222+
# who report positive receipt of a benefit are assigned takeup=True with
1223+
# certainty; the remaining non-reporters are filled probabilistically to
1224+
# hit the aggregate target rate. See policyengine_uk_data/utils/takeup.py.
1225+
from policyengine_uk_data.utils.takeup import (
1226+
assign_takeup_with_reported_anchors,
1227+
)
1228+
1229+
def _reported_benunit_mask(person_column: str) -> np.ndarray:
1230+
reporter_benunits = set(
1231+
pe_person.loc[pe_person[person_column] > 0, "person_benunit_id"].values
1232+
)
1233+
return pe_benunit["benunit_id"].isin(reporter_benunits).values
1234+
12141235
# Person-level
12151236
pe_person["would_claim_marriage_allowance"] = (
12161237
generator.random(len(pe_person)) < marriage_allowance_rate
12171238
)
12181239

1219-
# Benefit unit-level
1220-
pe_benunit["would_claim_child_benefit"] = (
1221-
generator.random(len(pe_benunit)) < child_benefit_rate
1240+
# Benefit unit-level — anchor on any adult in the benefit unit having
1241+
# reported positive receipt in the FRS benefits table.
1242+
pe_benunit["would_claim_child_benefit"] = assign_takeup_with_reported_anchors(
1243+
generator.random(len(pe_benunit)),
1244+
child_benefit_rate,
1245+
reported_mask=_reported_benunit_mask("child_benefit_reported"),
12221246
)
12231247
pe_benunit["child_benefit_opts_out"] = (
12241248
generator.random(len(pe_benunit)) < child_benefit_opts_out_rate
12251249
)
1226-
pe_benunit["would_claim_pc"] = (
1227-
generator.random(len(pe_benunit)) < pension_credit_rate
1250+
pe_benunit["would_claim_pc"] = assign_takeup_with_reported_anchors(
1251+
generator.random(len(pe_benunit)),
1252+
pension_credit_rate,
1253+
reported_mask=_reported_benunit_mask("pension_credit_reported"),
12281254
)
1229-
pe_benunit["would_claim_uc"] = (
1230-
generator.random(len(pe_benunit)) < universal_credit_rate
1255+
pe_benunit["would_claim_uc"] = assign_takeup_with_reported_anchors(
1256+
generator.random(len(pe_benunit)),
1257+
universal_credit_rate,
1258+
reported_mask=_reported_benunit_mask("universal_credit_reported"),
12311259
)
12321260
pe_benunit["would_claim_tfc"] = generator.random(len(pe_benunit)) < tfc_rate
12331261
pe_benunit["would_claim_extended_childcare"] = (

policyengine_uk_data/datasets/imputations/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from .vat import *
33
from .wealth import *
44
from .income import *
5+
from .frs_only import impute_frs_only_variables
56
from .capital_gains import *
67
from .services import impute_services
78
from .salary_sacrifice import impute_salary_sacrifice

policyengine_uk_data/datasets/imputations/consumption.py

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,16 @@
2525
from policyengine_uk_data.storage import STORAGE_FOLDER
2626
from policyengine_uk.data import UKSingleYearDataset
2727
from policyengine_uk import Microsimulation
28+
from policyengine_uk_data.datasets.frs import WEEKS_IN_YEAR
2829

2930
LCFS_TAB_FOLDER = STORAGE_FOLDER / "lcfs_2021_22"
3031

32+
# Default seed for the stochastic ICE-vehicle flag drawn from
33+
# `NTS_2024_ICE_VEHICLE_SHARE`. Kept at 42 for backward compatibility with
34+
# existing artefact fingerprints; callers can override via the fixture's
35+
# local RNG rather than the process-wide np.random global.
36+
_HAS_FUEL_SEED = 42
37+
3138
# EV/ICE vehicle mix from NTS 2024
3239
NTS_2024_ICE_VEHICLE_SHARE = 0.90
3340

@@ -338,6 +345,13 @@ def _derive_energy_from_lcfs(household: pd.DataFrame) -> pd.DataFrame:
338345
electricity[mask4] = p537[mask4] * mean_elec_share
339346
gas[mask4] = p537[mask4] * (1 - mean_elec_share)
340347

348+
# Clamp to non-negative; raw LCFS bill variables occasionally produce
349+
# small negatives (e.g. B490 > B489 inconsistency, or implausible
350+
# negative P537 entries). Consumption totals can't be negative by
351+
# definition and downstream NEED calibration preserves zero.
352+
electricity = np.maximum(electricity, 0.0)
353+
gas = np.maximum(gas, 0.0)
354+
341355
household = household.copy()
342356
household["electricity_consumption"] = electricity
343357
household["gas_consumption"] = gas
@@ -406,9 +420,12 @@ def create_has_fuel_model():
406420

407421
num_vehicles = was["vcarnr7"].fillna(0).clip(lower=0)
408422
has_vehicle = num_vehicles > 0
409-
np.random.seed(42)
423+
# Use a local RNG so we don't mutate the global np.random state (which
424+
# would silently change any unrelated consumer of np.random that runs
425+
# after this function).
426+
rng = np.random.default_rng(_HAS_FUEL_SEED)
410427
has_fuel = (
411-
has_vehicle & (np.random.random(len(was)) < NTS_2024_ICE_VEHICLE_SHARE)
428+
has_vehicle & (rng.random(len(was)) < NTS_2024_ICE_VEHICLE_SHARE)
412429
).astype(float)
413430

414431
was_df = pd.DataFrame(
@@ -481,18 +498,21 @@ def generate_lcfs_table(lcfs_person: pd.DataFrame, lcfs_household: pd.DataFrame)
481498

482499
household = household.rename(columns=CONSUMPTION_VARIABLE_RENAMES)
483500

484-
# Annualise weekly LCFS values (× 52)
501+
# Annualise weekly LCFS values. Use the same WEEKS_IN_YEAR constant
502+
# (365.25 / 7 ≈ 52.1786) as `datasets/frs.py` rather than a bare `* 52`,
503+
# which underestimates annual totals by ~0.34% and skews VAT / energy
504+
# imputation targets against FRS income.
485505
annualise = list(CONSUMPTION_VARIABLE_RENAMES.values()) + [
486506
"hbai_household_net_income",
487507
"household_gross_income",
488508
"electricity_consumption",
489509
"gas_consumption",
490510
]
491511
for variable in annualise:
492-
household[variable] = household[variable] * 52
512+
household[variable] = household[variable] * WEEKS_IN_YEAR
493513
for variable in PERSON_LCF_RENAMES.values():
494514
household[variable] = (
495-
person[variable].groupby(person.case).sum()[household.case] * 52
515+
person[variable].groupby(person.case).sum()[household.case] * WEEKS_IN_YEAR
496516
)
497517
household.household_weight *= 1_000
498518

@@ -577,9 +597,10 @@ def impute_consumption(dataset: UKSingleYearDataset) -> UKSingleYearDataset:
577597
sim = Microsimulation(dataset=dataset)
578598
num_vehicles = sim.calculate("num_vehicles", map_to="household").values
579599

580-
np.random.seed(42)
600+
# Local RNG — see note at module level (_HAS_FUEL_SEED).
601+
rng = np.random.default_rng(_HAS_FUEL_SEED)
581602
has_vehicle = num_vehicles > 0
582-
is_ice = np.random.random(len(num_vehicles)) < NTS_2024_ICE_VEHICLE_SHARE
603+
is_ice = rng.random(len(num_vehicles)) < NTS_2024_ICE_VEHICLE_SHARE
583604
has_fuel_consumption = (has_vehicle & is_ice).astype(float)
584605
dataset.household["has_fuel_consumption"] = has_fuel_consumption
585606

0 commit comments

Comments
 (0)