Fixing #872#892
Conversation
`FinetunedTabPFNBase._fit` was overwriting `self.X_`/`self.y_` with the numpy arrays returned by sklearn validation, so the final inference estimator was refit on numpy inputs and never recorded the original DataFrame's feature names. Retain the raw inputs before validation so the inference model sees the DataFrame and sets `feature_names_in_`, avoiding spurious "X does not have valid feature names" warnings when predicting on DataFrames. Fixes #872.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue where pandas feature names were dropped from the final inference model in FinetunedTabPFN estimators. The changes involve moving the assignment of raw training inputs before the validation step to ensure feature names are retained. A review comment suggests refactoring this logic to avoid setting fitted attributes before validation is complete, which could lead to inconsistent estimator states, and recommends explicitly capturing feature names and counts for better scikit-learn API compliance.
Issue
Closes #872
We can just store
self.X_ = Xwith the original data instead of afterensure_compatible_fit_inputs_sklearnwhich then anyways refits at a later point viaself.finetuned_inference_classifier_.fit(self.X_, self.y_)which internally re-runsensure_compatible_fit_inputs_sklearn.I could not identify why this could only be an "issue" when providing X_val as pointed out in the opened issue.