Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
36 changes: 35 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,38 @@
- Organized to match the same structure as parameters/
- Comments should include relevant regulatory citations and calculation logic
- **tests/**: Test cases for validating correct implementation of policies
- For government departments, use the correct department for the policy (e.g., education policies under DfE, not DWP)
- For government departments, use the correct department for the policy (e.g., education policies under DfE, not DWP)

## Lessons Learned from Variable Refactoring (2025-05-31)

### Issue: Refactoring Script Bug
A refactoring script was used to split multi-variable Python files into single-variable files. The script had a systematic bug: when a file contained multiple Variable classes and one had the same name as the file, that variable was dropped entirely.

### Variables Dropped During Refactoring
21 variables were dropped, including:
- Wrapper variables: `jsa_income`, `esa_income`, `jsa_contrib`, `esa_contrib`, `afcs`, `bsp`, `iidb`
- Calculated variables: `benefit_cap`, `carers_allowance`, `income_support`, `maternity_allowance`, `sda`, `tax_credits`
- Core variables: `child_benefit`, `allowances`, `marriage_allowance`, `stamp_duty_land_tax`, `vat`, `land_transaction_tax`, `attendance_allowance`, `council_tax_benefit`, `business_rates`, `tax`, `total_wealth`, `private_school_vat`, `bi_phaseout`

### Enum Recovery Errors
During recovery, enums were recreated based on assumptions rather than checking original code, leading to incorrect values:
- **TenureType**: Missing `OWNED_OUTRIGHT` and `OWNED_WITH_MORTGAGE` (consolidated into `OWNER_OCCUPIED`)
- **AccommodationType**: Wrong labels (e.g., "House - detached" vs "Detached house")
- **EducationType**: Wrong capitalization ("Lower secondary" vs "Lower Secondary")
- **FamilyType**: Shortened labels (missing ", with children" suffix)
- **EmploymentStatus**: Complete restructure (missing FT_/PT_ prefixes)
- **MinimumWageCategory**: Wrong format ("18-20" vs "18 to 20", "Over 24" vs "25 or over")
- **CouncilTaxBand**: Added incorrect "Band " prefix
- **StatePensionType**: Wrong capitalization ("Basic" vs "basic")

### Best Practices for Refactoring Recovery
1. **Always check original code** - Never rely on assumptions or context clues
2. **Use git history systematically** - `git show <commit>:path/to/file` to see original content
3. **Verify enum values exactly** - Even small differences in labels can break functionality
4. **Test incrementally** - Run tests after each fix to ensure progress
5. **Document the recovery process** - Track what was fixed for future reference

### OpenFisca-Specific Patterns
- Use `.possible_values` to access enum values, not direct imports
- Import helper functions like `find_freeze_start` when needed
- Functions like `ceil` should be `np.ceil` in OpenFisca context
5 changes: 5 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
- bump: patch
changes:
changed:
- Refactored all Variable files to follow single-responsibility principle with one Variable class per file.
- Split approximately 70 multi-Variable Python files into individual files, improving code organization and maintainability.
13 changes: 9 additions & 4 deletions docs/book/index.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,16 @@
"\n",
"df = sim.calculate_dataframe(\n",
" [\n",
" \"household_id\", # If the first variable is household level, the dataframe will project everything to households. Same for people.\n",
" \"household_id\", # If the first variable is household level, the dataframe will project everything to households. Same for people.\n",
" \"income_tax\",\n",
" \"region\",\n",
" ],\n",
" period=2025\n",
" period=2025,\n",
")\n",
"\n",
"df.groupby(\"region\").income_tax.sum().sort_values(ascending=False)/1e9 # Weights automatically applied"
"df.groupby(\"region\").income_tax.sum().sort_values(\n",
" ascending=False\n",
") / 1e9 # Weights automatically applied"
]
},
{
Expand Down Expand Up @@ -115,7 +117,10 @@
"\n",
"baseline = Microsimulation(dataset=ENHANCED_FRS)\n",
"reformed = Microsimulation(dataset=ENHANCED_FRS, reform=reform)\n",
"revenue = reformed.calculate(\"gov_balance\", 2025).sum() - baseline.calc(\"gov_balance\", 2025).sum()\n",
"revenue = (\n",
" reformed.calculate(\"gov_balance\", 2025).sum()\n",
" - baseline.calc(\"gov_balance\", 2025).sum()\n",
")\n",
"f\"Revenue: £{round(revenue / 1e+9, 1)}bn\""
]
}
Expand Down
8 changes: 6 additions & 2 deletions docs/book/usage/getting-started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,9 @@
"source": [
"from policyengine_uk import Microsimulation\n",
"\n",
"sim = Microsimulation(dataset=\"hf://policyengine/policyengine-uk-data/enhanced_frs_2022_23.h5\")\n",
"sim = Microsimulation(\n",
" dataset=\"hf://policyengine/policyengine-uk-data/enhanced_frs_2022_23.h5\"\n",
")\n",
"\n",
"# The hf:// points to the private data-\n",
"# hf:// <- go get the data from huggingface\n",
Expand Down Expand Up @@ -211,7 +213,9 @@
"source": [
"ENHANCED_FRS = \"hf://policyengine/policyengine-uk-data/enhanced_frs_2022_23.h5\"\n",
"\n",
"baseline = Microsimulation(dataset=ENHANCED_FRS) # Enhanced FRS 2022 by default\n",
"baseline = Microsimulation(\n",
" dataset=ENHANCED_FRS\n",
") # Enhanced FRS 2022 by default\n",
"reformed = Microsimulation(dataset=ENHANCED_FRS, reform=increase_basic_rate)\n",
"\n",
"revenue = (\n",
Expand Down
49 changes: 49 additions & 0 deletions find_missing_enums.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import os
import re
import ast


def find_undefined_names(directory):
"""Find all potentially undefined names in Python files."""
undefined = set()

for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(".py"):
filepath = os.path.join(root, file)
try:
with open(filepath, "r") as f:
content = f.read()

# Look for patterns like "possible_values = SomeName" or "SomeName.VALUE"
enum_refs = re.findall(
r"possible_values\s*=\s*(\w+)", content
)
enum_refs.extend(
re.findall(
r"(\w+)\.(?:NONE|LOWER|HIGHER|MIDDLE|STANDARD|ENHANCED|MALE|FEMALE|SINGLE|COUPLE)",
content,
)
)

for name in enum_refs:
if name not in [
"self",
"person",
"household",
"benunit",
"parameters",
]:
# Check if it's defined in the file
if f"class {name}" not in content:
undefined.add((name, filepath))

except Exception as e:
print(f"Error processing {filepath}: {e}")

return undefined


undefined_names = find_undefined_names("policyengine_uk/variables/")
for name, filepath in sorted(undefined_names):
print(f"{name} used in {filepath}")
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ metadata:
propagate_metadata_to_children: true
reference:
- href: https://www.legislation.gov.uk/uksi/2002/2008/regulation/9
name: The Tax Credits (Income Thresholds and Determination of Rates) Regulations
2002
name: The Tax Credits (Income Thresholds and Determination of Rates) Regulations 2002
unit: currency-USD
values:
2002-08-01: 26
2 changes: 1 addition & 1 deletion policyengine_uk/tests/microsimulation/test_validity.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from policyengine import Simulation
from policyengine_uk import Simulation
import pytest

YEARS = range(2024, 2026)
Expand Down
81 changes: 81 additions & 0 deletions policyengine_uk/variables/contrib/labour/attends_private_school.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from policyengine_uk.model_api import *


def interpolate_percentile(param, percentile):
if str(percentile) in param:
return param[str(percentile)]
else:
idx = percentile - (percentile % 5)
p1 = idx
p2 = idx + 5
v1 = param[str(idx)]
v2 = param[str(idx + 5)]
return v1 + (v2 - v1) * (percentile - p1) / (p2 - p1)


class attends_private_school(Variable):
label = "attends private school"
entity = Person
definition_period = YEAR
value_type = bool

def formula(person, period, parameters):
if not hasattr(person.simulation, "dataset"):
return 0
household = person.household
# To ensure that our model matches
# total number of students actually enrolled

ps_vat_params = parameters(period).gov.simulation.private_school_vat
private_school_attendance_rate = (
ps_vat_params.private_school_attendance_rate
)

population_adjustment_factor = ps_vat_params.private_school_factor

person = household.members

is_child = person("is_child", period)

taxes = household.sum(
person("income_tax", period) + person("national_insurance", period)
)

net_income = (
household("household_market_income", period)
+ household("household_benefits", period)
- taxes
)

household_weight = household("household_weight", period)
weighted_income = MicroSeries(net_income, weights=household_weight)

if household_weight.sum() < 1e6:
return 0

percentile = np.zeros_like(weighted_income).astype(numpy.int64)
mask = household_weight > 0

percentile[mask] = (
weighted_income[mask]
.percentile_rank()
.clip(0, 100)
.values.astype(numpy.int64)
)
# STUDENT_POPULATION_ADJUSTMENT_FACTOR = 0.78
STUDENT_POPULATION_ADJUSTMENT_FACTOR = population_adjustment_factor

p_attends_private_school = (
np.array(
[
interpolate_percentile(private_school_attendance_rate, p)
for p in percentile
]
)
* STUDENT_POPULATION_ADJUSTMENT_FACTOR
* is_child
)

value = random(person) < p_attends_private_school

return value
81 changes: 0 additions & 81 deletions policyengine_uk/variables/contrib/labour/private_school_vat.py
Original file line number Diff line number Diff line change
@@ -1,73 +1,4 @@
from policyengine_uk.model_api import *
from policyengine_uk.variables.gov.hmrc.tax import household_tax


class attends_private_school(Variable):
label = "attends private school"
entity = Person
definition_period = YEAR
value_type = bool

def formula(person, period, parameters):
if not hasattr(person.simulation, "dataset"):
return 0
household = person.household
# To ensure that our model matches
# total number of students actually enrolled

ps_vat_params = parameters(period).gov.simulation.private_school_vat
private_school_attendance_rate = (
ps_vat_params.private_school_attendance_rate
)

population_adjustment_factor = ps_vat_params.private_school_factor

person = household.members

is_child = person("is_child", period)

taxes = household.sum(
person("income_tax", period) + person("national_insurance", period)
)

net_income = (
household("household_market_income", period)
+ household("household_benefits", period)
- taxes
)

household_weight = household("household_weight", period)
weighted_income = MicroSeries(net_income, weights=household_weight)

if household_weight.sum() < 1e6:
return 0

percentile = np.zeros_like(weighted_income).astype(numpy.int64)
mask = household_weight > 0

percentile[mask] = (
weighted_income[mask]
.percentile_rank()
.clip(0, 100)
.values.astype(numpy.int64)
)
# STUDENT_POPULATION_ADJUSTMENT_FACTOR = 0.78
STUDENT_POPULATION_ADJUSTMENT_FACTOR = population_adjustment_factor

p_attends_private_school = (
np.array(
[
interpolate_percentile(private_school_attendance_rate, p)
for p in percentile
]
)
* STUDENT_POPULATION_ADJUSTMENT_FACTOR
* is_child
)

value = random(person) < p_attends_private_school

return value


class private_school_vat(Variable):
Expand All @@ -94,15 +25,3 @@ def formula(household, period, parameters):
* private_school_vat_rate
* private_school_vat_basis
)


def interpolate_percentile(param, percentile):
if str(percentile) in param:
return param[str(percentile)]
else:
idx = percentile - (percentile % 5)
p1 = idx
p2 = idx + 5
v1 = param[str(idx)]
v2 = param[str(idx + 5)]
return v1 + (v2 - v1) * (percentile - p1) / (p2 - p1)
Loading
Loading