PolicyEngine
diff --git a/‎.github/CONTRIBUTING.md‎
Lines changed: 43 additions & 4 deletions b/‎.github/CONTRIBUTING.md‎
Lines changed: 43 additions & 4 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 73 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎changelog.d/368.md‎
Lines changed: 1 addition & 0 deletions b/‎changelog.d/368.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎policyengine_uk_data/datasets/childcare/takeup_rate.py‎
Lines changed: 14 additions & 4 deletions b/‎policyengine_uk_data/datasets/childcare/takeup_rate.py‎
Lines changed: 14 additions & 4 deletions
diff --git a/‎policyengine_uk_data/datasets/frs.py‎
Lines changed: 36 additions & 8 deletions b/‎policyengine_uk_data/datasets/frs.py‎
Lines changed: 36 additions & 8 deletions
diff --git a/‎policyengine_uk_data/datasets/imputations/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎policyengine_uk_data/datasets/imputations/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎policyengine_uk_data/datasets/imputations/consumption.py‎
Lines changed: 28 additions & 7 deletions b/‎policyengine_uk_data/datasets/imputations/consumption.py‎
Lines changed: 28 additions & 7 deletions
@@ -1,7 +1,46 @@
-## Updating data
+# Contributing to policyengine-uk-data
 
-If your changes present a non-bugfix change to one or more datasets which are cloud-hosted (FRS and EFRS), then please change both the filename and URL (in both the class definition file and in `storage/upload_completed_datasets.py`). This enables us to store historical versions of datasets separately and reproducibly.
+See the [shared PolicyEngine contribution guide](https://github.com/PolicyEngine/.github/blob/main/CONTRIBUTING.md) for cross-repo conventions (towncrier changelog fragments, `uv run`, PR description format, anti-patterns). This file covers policyengine-uk-data specifics.
 
-## Updating the versioning
+## Commands
 
-Please add to `changelog.yaml` and then run `make changelog` before committing the results ONCE in this PR.
+```bash
+make install            # install deps (uv)
+make format             # format (required)
+make download           # download raw FRS + SPI inputs from HF (needs HUGGING_FACE_TOKEN)
+make data               # full dataset build (impute, calibrate, upload)
+make test               # test suite
+uv run pytest policyengine_uk_data/tests/path/to/test.py -v
+```
+
+Python 3.13+. Default branch: `main`. Raw FRS / SPI microdata live on HuggingFace; set `HUGGING_FACE_TOKEN` before running anything that touches the dataset build.
+
+## What lives here
+
+This repo builds the `.h5` files that feed `policyengine-uk`:
+
+- `datasets/frs.py` — raw FRS → PolicyEngine variable mapping
+- `datasets/imputations/` — QRF / other imputations layered on top (income, wealth, consumption, etc.)
+- `datasets/local_areas/` — constituency and local-authority calibration
+- `targets/` — calibration target sources (OBR, DWP, HMRC, ONS, SLC, etc.)
+- `utils/calibrate.py` — the reweighting optimiser
+- `storage/` — raw inputs, intermediate artefacts, published outputs
+
+## Data-protection rules — no exceptions
+
+The enhanced FRS dataset is licensed under strict UK Data Service terms. Violating them risks losing access, which would end PolicyEngine UK.
+
+- **Never upload data to any public location.** The HuggingFace repo `policyengine/policyengine-uk-data-private` is private and authenticated.
+- **Never modify `upload_completed_datasets.py` or `utils/data_upload.py`** to change upload destinations without explicit confirmation from the data controller (currently Nikhil Woodruff).
+- **Never print, log, or output individual-level records.** Aggregates (sums, means, counts, weighted totals) are fine; individual rows are not.
+- **If you see a private/public repo split, assume it is intentional** — ask why before changing it.
+
+## Updating datasets
+
+If your change is a non-bugfix update to a cloud-hosted dataset (FRS, enhanced FRS), bump both the filename and URL in the class definition and in `storage/upload_completed_datasets.py`. That lets us store historical dataset versions separately and reproducibly.
+
+## Repo-specific anti-patterns
+
+- **Don't** hardcode dataset years in variable transforms; use `dataset.time_period` and the uprating pipeline.
+- **Don't** commit large binary artefacts — use HuggingFace storage.
+- **Don't** skip `make test` when touching the imputation or calibration pipeline; full CI rebuilds the dataset and takes ~25 minutes.
@@ -1,3 +1,76 @@
+## [1.53.1] - 2026-04-20
+
+No significant changes.
+
+
+## [1.53.0] - 2026-04-19
+
+### Added
+
+- Tightened `test_population` tolerance from 7% to 3% now that the stage-2 QRF (#362), TFC target refresh (#363), and reported-anchor takeup (#359) pulled the weighted UK population overshoot from ~6.5% down to ~1.6%. Added four regression tests in `test_population_fidelity.py` (weighted-total match, household-count range, non-inflation guard, country-sum consistency) extracted from the earlier #310 draft so any future calibration drift back toward the pre-April-2026 overshoot trips CI.
+
+
+## [1.52.2] - 2026-04-18
+
+### Changed
+
+- Add second-stage QRF imputation of FRS-only variables on SPI-donor rows. After the first-stage SPI-trained QRF overwrites income components on the zero-weight subsample, a new second-stage QRF trained on the full FRS rewrites benefit `_reported` columns, pension contributions, and savings-income so they correlate with the freshly-imputed incomes instead of staying as whatever middle-income FRS donor was sampled. Mirrors the `policyengine-us-data#589` pattern. Prevents synthetic £2 M earners from carrying a middle-income donor's UC / housing-benefit receipt into calibration, which was blowing up benefit aggregates under upweight.
+- Anchor stochastic takeup assignment for Universal Credit, Pension Credit, and Child Benefit to the FRS-reported receipt columns, matching the `policyengine-us-data` pattern. Respondents who report positive receipt in the FRS benefits table now receive `would_claim_* = True` with certainty, and non-reporters are filled probabilistically to hit the aggregate target rate. Removes a source of calibration noise where respondents who clearly took up a benefit could be randomly assigned `would_claim = False`.
+
+### Fixed
+
+- Refresh Tax-Free Childcare calibration targets and take-up rate using HMRC's June 2025 release (covering 2024-25 outturn: £632 m spending, 985 k children reached). The prior target set was calibrated against the September 2024 release and undershot current TFC spending by roughly a third. Bumps the default TFC take-up rate from 0.586 to 0.88 on 2024-04-06 to close most of the gap pending a full recalibration run.
+
+
+## [1.52.1] - 2026-04-18
+
+### Fixed
+
+- Update the `Raise VAT standard rate by 2pp` reform-impact test expectation from 25.0 bn to 43.0 bn — the enhanced FRS's total consumption aggregate has grown to a UK-realistic ~£1.6 T (matching ONS 2025 total consumer expenditure), so a 2pp rise on the current `microdata_vat_coverage = 0.38`-scaled base produces ~£43 bn, not the £25 bn calibrated against an older smaller dataset. Also clamps raw electricity/gas consumption in `impute_energy_splits` to be non-negative (a handful of LCFS bill-variable inconsistencies produced small negatives), fixing `test_non_negative_energy`. Follow-up: revisit `microdata_vat_coverage` itself now that the underlying base is fuller (#364).
+
+
+## [1.52.0] - 2026-04-17
+
+### Changed
+
+- Point CONTRIBUTING.md at the shared PolicyEngine contribution guide (https://github.com/PolicyEngine/.github) and trim the per-repo file to commands, repo-specific conventions, and anti-patterns. Removes the stale `changelog_entry.yaml` / `make changelog` instructions.
+
+### Removed
+
+- Remove `policyengine_uk_data/tests/test_changelog_encoding.py`. It validated UTF-8 and YAML structure of the deprecated `changelog_entry.yaml`, which was retired when the repo migrated to towncrier `changelog.d/` fragments. All three tests now unconditionally `pytest.skip` because the file no longer exists, and any fragment-format validation is already handled by the `Check changelog fragment` CI step.
+
+
+## [1.51.1] - 2026-04-17
+
+### Fixed
+
+- Guard the rent/mortgage rescaling in `impute_over_incomes` against `ZeroDivisionError` when the seed dataset's imputation columns sum to zero (e.g. the zero-weight synthetic copy in `impute_income`).
+
+
+## [1.51.0] - 2026-04-17
+
+### Added
+
+- Add `policyengine_uk_data.utils.hf_destinations` with `PRIVATE_REPO` / `PUBLIC_REPO` constants and an AST-based xfail test (`tests/test_hf_destinations.py`) that flags every `upload(...)`, `upload_file(...)`, `upload_files_to_hf(...)`, and `upload_data_files(...)` call site that still bypasses the shared constants.
+- Add `policyengine_uk_data.utils.calibrate.load_weights`, a defensive loader that normalises calibration weights to 2D `(n_areas, n_records)` and validates expected shapes so consumers can't silently read the wrong axis layout across the L2 and L0 calibrators.
+
+### Fixed
+
+- Fix `calibrate_local_areas` non-verbose branch silently failing to save weights because the `if epoch % 10 == 0` save block was indented outside the training loop.
+- Replace bare `* 52` weekly-to-annual conversion in LCFS imputation with the shared `WEEKS_IN_YEAR = 365.25 / 7` constant used by `datasets/frs.py`, and replace two `np.random.seed(42)` calls with local `np.random.default_rng(42)` so consumption imputation stops mutating the process-wide RNG state.
+- Add OBR calibration targets for NIC Classes 2, 3 and 4 (self-employed flat-rate, voluntary and profit-based) alongside the existing Class 1 employee/employer rows, and accept common label-wording variants in OBR EFO Table 3.4.
+- Fix `datasets/spi.py` `__main__` crash (two-arg call to three-arg `create_spi`), parameterise the hardcoded £1,250 marriage allowance from policyengine-uk parameters, seed the age imputation RNG, and surface unknown GORCODE regions as `UNKNOWN` instead of silently mapping them to `SOUTH_EAST`.
+- Raise `UpratingYearOutOfRangeError` with a clear message when `uprate_values` or `uprate_dataset` is called with a year outside the `[START_YEAR, END_YEAR]` range of the uprating factor table, instead of surfacing a pandas `KeyError` or silently returning wrong values.
+- Parameterise the VAT standard rate and reduced-rate share in ETB-based VAT imputation by reading from `policyengine_uk.parameters.gov.hmrc.vat` keyed on the training year, with a `VAT_RATE_BY_YEAR` fallback for offline use. Promote the `etb.year == 2020` filter to a `year` argument with a `DEFAULT_ETB_YEAR` default.
+
+
+## [1.50.6] - 2026-04-17
+
+### Fixed
+
+- Include `gift_aid` (SPI `GIFTAID`) and `charitable_investment_gifts` (SPI `GIFTINV`) in the SPI income imputation model so synthetic high-earner rows carry plausible charitable giving drawn jointly with income, instead of a flat zero. Previously the 6-variable QRF ran over only the core income components; both charitable relief columns were in `SPI_RENAMES` but never reached the predicted output, so the SPI-donor half of the enhanced FRS carried its FRS donor's (always-zero) charitable giving. Adds both columns to the model's output list, renames the cache file to force retraining, and initialises the FRS-side columns to zero to keep the stacked dataset valid.
+
+
 ## [1.50.5] - 2026-04-17
 
 No significant changes.
 
@@ -0,0 +1 @@
+- Set Marriage Allowance take-up rate to 0.5 (HMRC outturn ~2.1m claimants of ~4.2m eligible couples) instead of the placeholder 1.0, so microsimulation no longer overstates Marriage Allowance cost by ~£500m/year.
@@ -3,23 +3,33 @@
 from policyengine_uk import Microsimulation
 
 # 🎯 Calibration targets
+#
+# TFC targets refreshed from HMRC "Tax-Free Childcare statistics: June 2025"
+# (published 27 Aug 2025, covering 2024-25 outturn):
+#   - spending: £632.2 m (Table 1, annual government top-up)
+#   - caseload: 985 thousand children received TFC in 2024-25 (annual unique)
+# The prior 0.6 / 660 targets were calibrated against the Sep 2024 release
+# (2023-24 outturn) and have since been overtaken by the TFC account
+# expansion and the Sep 2025 "30 free hours for under-5s" boost in uptake.
+#
+# Other programme targets kept at their prior DfE values.
 targets = {
     "spending": {
-        "tfc": 0.6,
+        "tfc": 0.63,
         "extended": 2.5,
         "targeted": 0.6,
         "universal": 1.7,
     },
     "caseload": {
-        "tfc": 660,
+        "tfc": 985,
         "extended": 740,
         "targeted": 130,
         "universal": 490,
     },
 }
 
-# Here is the link to the UK government’s aggregate data for Tax-Free Childcare:
-# https://www.gov.uk/government/statistics/tax-free-childcare-statistics-september-2024
+# UK government aggregate Tax-Free Childcare statistics:
+# https://www.gov.uk/government/statistics/tax-free-childcare-statistics-june-2025
 
 # This is the Department for Education (DfE) data for the other childcare programmes:
 # https://skillsfunding.service.gov.uk/view-latest-funding/national-funding-allocations/DSG/2024-to-2025
 
@@ -28,6 +28,13 @@
 from policyengine_uk_data.parameters import load_take_up_rate, load_parameter
 
 
+# Canonical weeks-per-year conversion factor for annualising weekly survey
+# variables. 365.25 / 7 ≈ 52.1786 accounts for leap years; using the rounded
+# integer 52 would under-count by ~0.34%. Exposed at module level so sibling
+# loaders (e.g. LCFS/ETB in `datasets/imputations/consumption.py`) can import
+# the same value rather than re-defining `* 52` locally and drifting.
+WEEKS_IN_YEAR = 365.25 / 7
+
 LEGACY_JOBSEEKER_MIN_AGE = 18
 HOURS_WORKED_WEEKS_PER_YEAR = 52
 ESA_MIN_AGE = 16
@@ -1210,24 +1217,45 @@ def determine_education_level(fted_val, typeed2_val, age_val):
     scp_under_6_rate = load_take_up_rate("scp_under_6", year)
     scp_6_plus_rate = load_take_up_rate("scp_6_plus", year)
 
-    # Generate take-up decisions by comparing random draws to take-up rates
+    # Generate take-up decisions by comparing random draws to take-up rates,
+    # anchored to reported receipts where the FRS captures them. Respondents
+    # who report positive receipt of a benefit are assigned takeup=True with
+    # certainty; the remaining non-reporters are filled probabilistically to
+    # hit the aggregate target rate. See policyengine_uk_data/utils/takeup.py.
+    from policyengine_uk_data.utils.takeup import (
+        assign_takeup_with_reported_anchors,
+    )
+
+    def _reported_benunit_mask(person_column: str) -> np.ndarray:
+        reporter_benunits = set(
+            pe_person.loc[pe_person[person_column] > 0, "person_benunit_id"].values
+        )
+        return pe_benunit["benunit_id"].isin(reporter_benunits).values
+
     # Person-level
     pe_person["would_claim_marriage_allowance"] = (
         generator.random(len(pe_person)) < marriage_allowance_rate
     )
 
-    # Benefit unit-level
-    pe_benunit["would_claim_child_benefit"] = (
-        generator.random(len(pe_benunit)) < child_benefit_rate
+    # Benefit unit-level — anchor on any adult in the benefit unit having
+    # reported positive receipt in the FRS benefits table.
+    pe_benunit["would_claim_child_benefit"] = assign_takeup_with_reported_anchors(
+        generator.random(len(pe_benunit)),
+        child_benefit_rate,
+        reported_mask=_reported_benunit_mask("child_benefit_reported"),
     )
     pe_benunit["child_benefit_opts_out"] = (
         generator.random(len(pe_benunit)) < child_benefit_opts_out_rate
     )
-    pe_benunit["would_claim_pc"] = (
-        generator.random(len(pe_benunit)) < pension_credit_rate
+    pe_benunit["would_claim_pc"] = assign_takeup_with_reported_anchors(
+        generator.random(len(pe_benunit)),
+        pension_credit_rate,
+        reported_mask=_reported_benunit_mask("pension_credit_reported"),
     )
-    pe_benunit["would_claim_uc"] = (
-        generator.random(len(pe_benunit)) < universal_credit_rate
+    pe_benunit["would_claim_uc"] = assign_takeup_with_reported_anchors(
+        generator.random(len(pe_benunit)),
+        universal_credit_rate,
+        reported_mask=_reported_benunit_mask("universal_credit_reported"),
     )
     pe_benunit["would_claim_tfc"] = generator.random(len(pe_benunit)) < tfc_rate
     pe_benunit["would_claim_extended_childcare"] = (
 
@@ -2,6 +2,7 @@
 from .vat import *
 from .wealth import *
 from .income import *
+from .frs_only import impute_frs_only_variables
 from .capital_gains import *
 from .services import impute_services
 from .salary_sacrifice import impute_salary_sacrifice
 
@@ -25,9 +25,16 @@
 from policyengine_uk_data.storage import STORAGE_FOLDER
 from policyengine_uk.data import UKSingleYearDataset
 from policyengine_uk import Microsimulation
+from policyengine_uk_data.datasets.frs import WEEKS_IN_YEAR
 
 LCFS_TAB_FOLDER = STORAGE_FOLDER / "lcfs_2021_22"
 
+# Default seed for the stochastic ICE-vehicle flag drawn from
+# `NTS_2024_ICE_VEHICLE_SHARE`. Kept at 42 for backward compatibility with
+# existing artefact fingerprints; callers can override via the fixture's
+# local RNG rather than the process-wide np.random global.
+_HAS_FUEL_SEED = 42
+
 # EV/ICE vehicle mix from NTS 2024
 NTS_2024_ICE_VEHICLE_SHARE = 0.90
 
@@ -338,6 +345,13 @@ def _derive_energy_from_lcfs(household: pd.DataFrame) -> pd.DataFrame:
     electricity[mask4] = p537[mask4] * mean_elec_share
     gas[mask4] = p537[mask4] * (1 - mean_elec_share)
 
+    # Clamp to non-negative; raw LCFS bill variables occasionally produce
+    # small negatives (e.g. B490 > B489 inconsistency, or implausible
+    # negative P537 entries). Consumption totals can't be negative by
+    # definition and downstream NEED calibration preserves zero.
+    electricity = np.maximum(electricity, 0.0)
+    gas = np.maximum(gas, 0.0)
+
     household = household.copy()
     household["electricity_consumption"] = electricity
     household["gas_consumption"] = gas
@@ -406,9 +420,12 @@ def create_has_fuel_model():
 
     num_vehicles = was["vcarnr7"].fillna(0).clip(lower=0)
     has_vehicle = num_vehicles > 0
-    np.random.seed(42)
+    # Use a local RNG so we don't mutate the global np.random state (which
+    # would silently change any unrelated consumer of np.random that runs
+    # after this function).
+    rng = np.random.default_rng(_HAS_FUEL_SEED)
     has_fuel = (
-        has_vehicle & (np.random.random(len(was)) < NTS_2024_ICE_VEHICLE_SHARE)
+        has_vehicle & (rng.random(len(was)) < NTS_2024_ICE_VEHICLE_SHARE)
     ).astype(float)
 
     was_df = pd.DataFrame(
@@ -481,18 +498,21 @@ def generate_lcfs_table(lcfs_person: pd.DataFrame, lcfs_household: pd.DataFrame)
 
     household = household.rename(columns=CONSUMPTION_VARIABLE_RENAMES)
 
-    # Annualise weekly LCFS values (× 52)
+    # Annualise weekly LCFS values. Use the same WEEKS_IN_YEAR constant
+    # (365.25 / 7 ≈ 52.1786) as `datasets/frs.py` rather than a bare `* 52`,
+    # which underestimates annual totals by ~0.34% and skews VAT / energy
+    # imputation targets against FRS income.
     annualise = list(CONSUMPTION_VARIABLE_RENAMES.values()) + [
         "hbai_household_net_income",
         "household_gross_income",
         "electricity_consumption",
         "gas_consumption",
     ]
     for variable in annualise:
-        household[variable] = household[variable] * 52
+        household[variable] = household[variable] * WEEKS_IN_YEAR
     for variable in PERSON_LCF_RENAMES.values():
         household[variable] = (
-            person[variable].groupby(person.case).sum()[household.case] * 52
+            person[variable].groupby(person.case).sum()[household.case] * WEEKS_IN_YEAR
         )
     household.household_weight *= 1_000
 
@@ -577,9 +597,10 @@ def impute_consumption(dataset: UKSingleYearDataset) -> UKSingleYearDataset:
     sim = Microsimulation(dataset=dataset)
     num_vehicles = sim.calculate("num_vehicles", map_to="household").values
 
-    np.random.seed(42)
+    # Local RNG — see note at module level (_HAS_FUEL_SEED).
+    rng = np.random.default_rng(_HAS_FUEL_SEED)
     has_vehicle = num_vehicles > 0
-    is_ice = np.random.random(len(num_vehicles)) < NTS_2024_ICE_VEHICLE_SHARE
+    is_ice = rng.random(len(num_vehicles)) < NTS_2024_ICE_VEHICLE_SHARE
     has_fuel_consumption = (has_vehicle & is_ice).astype(float)
     dataset.household["has_fuel_consumption"] = has_fuel_consumption
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+- Set Marriage Allowance take-up rate to 0.5 (HMRC outturn ~2.1m claimants of ~4.2m eligible couples) instead of the placeholder 1.0, so microsimulation no longer overstates Marriage Allowance cost by ~£500m/year.`