Skip to content
This repository was archived by the owner on Jun 19, 2026. It is now read-only.

Commit 843293c

Browse files
authored
Make the FRS dataset build deterministic (seed all RNG draws) (#425)
* Make property_purchased assignment deterministic (seeded RNG) The data build set property_purchased via unseeded np.random.random(), so every build drew a different vector of purchasers. That made the dataset non-reproducible and intermittently spiked the first income decile's effective tax rate (the draw occasionally marked too many high-property, low-income households as purchasers), failing test_first_decile_tax_rate_reasonable and blocking releases. Draw from a seeded numpy Generator (default_rng(0)) instead of the global RNG, whose state depends on whatever ran earlier in the build. Same FRS input now always yields the same ~3.85% purchaser assignment. Pairs with the policyengine-uk fix flipping property_purchased's default to False, which fail-safes any household this build does not explicitly set. * Fix review findings: seed capital gains and BRMA sampling too Independent review found the property_purchased seed was necessary but not sufficient for a reproducible build: two more assignments drew from the unseeded global numpy RNG. - imputations/capital_gains.py: quantile draws for the capital gains amount imputation now come from a seeded default_rng(0), so capital gains (and CGT revenue) are reproducible. - frs.py BRMA assignment: both pandas .sample() calls (region/category rent sampling and the household-level pick) now take a seeded random_state generator instead of the global RNG. The SPI synthetic sampling (income.py) was already seeded. The only remaining unseeded np.random is childcare/takeup_rate.py, which is not reached by the dataset build (test-only); left for separate cleanup. Broadened the changelog to reflect that the whole FRS build is now deterministic.
1 parent e5c7f84 commit 843293c

3 files changed

Lines changed: 29 additions & 10 deletions

File tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Make the FRS dataset build deterministic. Several assignments drew from the unseeded global numpy RNG, so otherwise-identical builds produced different datasets: property_purchased (which households are charged stamp duty), capital gains imputation quantiles (CGT revenue), and BRMA assignment (LHA/housing-benefit geography). Each now draws from a seeded generator, so the same inputs always produce the same dataset.

policyengine_uk_data/datasets/frs.py

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1251,9 +1251,12 @@ def determine_education_level(fted_val, typeed2_val, age_val):
12511251
lha_category = sim.calculate("LHA_category", year)
12521252
brma = np.empty(len(region), dtype=object)
12531253

1254-
# Sample from a random BRMA in the region, weighted by the number of observations in each BRMA
1254+
# Sample from a random BRMA in the region, weighted by the number of observations in each BRMA.
1255+
# Use a seeded generator so the assignment is reproducible across builds;
1256+
# pandas .sample() otherwise draws from the unseeded global numpy RNG.
12551257
lha_list_of_rents = pd.read_csv(STORAGE_FOLDER / "lha_list_of_rents.csv.gz")
12561258
lha_list_of_rents = lha_list_of_rents.copy()
1259+
brma_rng = np.random.default_rng(0)
12571260

12581261
for possible_region in lha_list_of_rents.region.unique():
12591262
for possible_lha_category in lha_list_of_rents.lha_category.unique():
@@ -1262,7 +1265,7 @@ def determine_education_level(fted_val, typeed2_val, age_val):
12621265
)
12631266
mask = (region == possible_region) & (lha_category == possible_lha_category)
12641267
brma[mask] = lha_list_of_rents[lor_mask].brma.sample(
1265-
n=len(region[mask]), replace=True
1268+
n=len(region[mask]), replace=True, random_state=brma_rng
12661269
)
12671270

12681271
# Convert benunit-level BRMAs to household-level BRMAs (pick a random one)
@@ -1276,7 +1279,9 @@ def determine_education_level(fted_val, typeed2_val, age_val):
12761279
}
12771280
)
12781281

1279-
df = df.groupby("household_id").brma.aggregate(lambda x: x.sample(n=1).iloc[0])
1282+
df = df.groupby("household_id").brma.aggregate(
1283+
lambda x: x.sample(n=1, random_state=brma_rng).iloc[0]
1284+
)
12801285
brmas = df[sim.calculate("household_id")].values
12811286

12821287
pe_household["brma"] = brmas
@@ -1430,9 +1435,15 @@ def _reported_benunit_mask(person_column: str) -> np.ndarray:
14301435

14311436
pe_benunit["is_married"] = frs["benunit"].famtypb2.isin([5, 7])
14321437

1433-
# Stochastically set property_purchased based on UK housing transaction rate.
1434-
# Previously defaulted to True in policyengine-uk, causing all households
1435-
# to be charged SDLT as if they just bought their property (£370bn total).
1438+
# Assign property_purchased to a share of households matching the UK
1439+
# housing transaction rate, so only genuine purchasers are charged SDLT.
1440+
#
1441+
# This MUST be deterministic: a rules engine's inputs have to be
1442+
# reproducible across builds. Use a seeded Generator (not global
1443+
# np.random, whose state depends on whatever ran earlier in the build)
1444+
# so the same FRS input always yields the same assignment. An unseeded
1445+
# draw previously made the build non-reproducible and intermittently
1446+
# spiked the first decile's effective tax rate.
14361447
#
14371448
# Sources:
14381449
# - Transactions: HMRC 2024 - 1.1m/year
@@ -1443,11 +1454,13 @@ def _reported_benunit_mask(person_column: str) -> np.ndarray:
14431454
#
14441455
# Verification against official SDLT revenue (2024-25):
14451456
# - Official SDLT: £13.9bn (https://www.gov.uk/government/statistics/uk-stamp-tax-statistics)
1446-
# - With fix (3.85%): £15.7bn (close to official)
1447-
# - Without fix (100%): £370bn (26x too high)
1457+
# - With 3.85% purchasers: £15.7bn (close to official)
1458+
# - With every household a purchaser: £370bn (26x too high)
14481459
PROPERTY_PURCHASE_RATE = 0.0385
1460+
PROPERTY_PURCHASE_SEED = 0
1461+
purchase_rng = np.random.default_rng(PROPERTY_PURCHASE_SEED)
14491462
pe_household["property_purchased"] = (
1450-
np.random.random(len(pe_household)) < PROPERTY_PURCHASE_RATE
1463+
purchase_rng.random(len(pe_household)) < PROPERTY_PURCHASE_RATE
14511464
)
14521465

14531466
if not include_internal_disability_reported_amounts:

policyengine_uk_data/datasets/imputations/capital_gains.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,11 @@ def loss(blend_factor):
117117

118118
logging.info("Imputing capital gains among those with gains")
119119

120+
# Draw imputation quantiles from a seeded generator so the build is
121+
# reproducible: an unseeded global np.random made capital gains (and hence
122+
# CGT revenue) differ between otherwise identical builds.
123+
cg_rng = np.random.default_rng(0)
124+
120125
for i in range(len(capital_gains)):
121126
row = capital_gains.iloc[i]
122127
spline = UnivariateSpline(
@@ -128,7 +133,7 @@ def loss(blend_factor):
128133
upper = row.maximum_total_income
129134
ti_in_range = (ti >= lower) * (ti < upper)
130135
in_target_range = has_cg * ti_in_range > 0
131-
quantiles = np.random.random(int(in_target_range.sum()))
136+
quantiles = cg_rng.random(int(in_target_range.sum()))
132137
pred_capital_gains = spline(quantiles)
133138
new_cg[in_target_range] = pred_capital_gains
134139

0 commit comments

Comments
 (0)