Skip to content

Commit 4c06956

Browse files
MaxGhenisclaude
andcommitted
Join region from household frame in stage-2 QRF predictors
CI surfaced the KeyError: 'region' because the FRS build stores region on the household frame, not the person frame. Route the lookup through person_household_id so the stage-2 QRF trains and predicts on the household-derived region column without needing a full Microsimulation bootstrap (which would require a host of unrelated household columns like council_tax, tenure_type, etc., that test fixtures don't carry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c886547 commit 4c06956

1 file changed

Lines changed: 34 additions & 4 deletions

File tree

policyengine_uk_data/datasets/imputations/frs_only.py

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,38 @@ def _align_columns(
125125
)
126126

127127

128+
def _build_predictor_frame(dataset: UKSingleYearDataset) -> pd.DataFrame:
129+
"""Return a person-indexed DataFrame of stage-2 predictor columns.
130+
131+
``region`` lives on the household frame in the enhanced-FRS build,
132+
so it is joined onto each person row via ``person_household_id``.
133+
Remaining predictors (age, gender, the six income components) are
134+
read directly from the person frame. If the person frame already
135+
carries ``region`` (as in some test fixtures and the standalone SPI
136+
build) that value wins and no join is performed.
137+
"""
138+
person = dataset.person
139+
predictors = STAGE2_DEMOGRAPHIC_PREDICTORS + STAGE2_INCOME_PREDICTORS
140+
141+
if "region" in person.columns:
142+
frame = person[predictors].copy()
143+
elif (
144+
"region" in dataset.household.columns
145+
and "person_household_id" in person.columns
146+
):
147+
hh_region = dataset.household.set_index("household_id")["region"]
148+
person_region = person["person_household_id"].map(hh_region)
149+
frame = person[[c for c in predictors if c != "region"]].copy()
150+
frame["region"] = person_region.values
151+
frame = frame[predictors]
152+
else:
153+
raise KeyError(
154+
"Stage-2 imputation needs 'region' either on the person frame "
155+
"or on the household frame with a 'person_household_id' join key."
156+
)
157+
return frame
158+
159+
128160
def impute_frs_only_variables(
129161
train_dataset: UKSingleYearDataset,
130162
target_dataset: UKSingleYearDataset,
@@ -171,10 +203,8 @@ def impute_frs_only_variables(
171203
)
172204
return target_dataset
173205

174-
predictors = STAGE2_DEMOGRAPHIC_PREDICTORS + STAGE2_INCOME_PREDICTORS
175-
176-
train_inputs_raw = train_person[predictors].copy()
177-
target_inputs_raw = target_person[predictors].copy()
206+
train_inputs_raw = _build_predictor_frame(train_dataset)
207+
target_inputs_raw = _build_predictor_frame(target_dataset)
178208

179209
train_inputs = _one_hot_encode(train_inputs_raw, columns=["gender", "region"])
180210
target_inputs = _one_hot_encode(target_inputs_raw, columns=["gender", "region"])

0 commit comments

Comments
 (0)