WIP: add preprocessing presets kwarg to TabPFNTSPipeline#113
Conversation
Exposes tabpfn's inference-time X/y preprocessing pipeline via a simple
string enum on TabPFNTSPipeline. Today, changing it requires threading
`inference_config` through `tabpfn_model_config` with PreprocessorConfig
objects imported from a nested tabpfn module — not discoverable.
New kwarg: preprocessing in {"default", "none", "squashing_scaler"}.
- "default" keeps current behaviour (tabpfn library defaults).
- "none" disables X/y preprocessing entirely
(PREPROCESS_TRANSFORMS=[none], REGRESSION_Y_PREPROCESS_TRANSFORMS=[None]).
- "squashing_scaler" uses squashing_scaler_max10 + svd_quarter_components
followed by a numeric "none" config.
Empirically on the fev-bench small/non-lite split, "none" and
"squashing_scaler" give a +0.05 SQL skill boost over defaults for both
the library default checkpoint and OOD-finetuned variants. On the
fev-bench lite split the defaults usually win (small regression from
removing preprocessing). Exposing the knob lets users try both.
An explicit `inference_config` in `tabpfn_model_config` still wins over
the preset; a warning is emitted if both are supplied.
There was a problem hiding this comment.
Code Review
This pull request introduces a preprocessing parameter to the TabPFNTSPipeline constructor, allowing users to select from predefined inference-time preprocessing presets. These presets are defined in a new preprocessing_presets module. A review comment pointed out that the _apply_preprocessing_preset method should consistently return a copy of the configuration dictionary to prevent accidental mutation of the global default settings and to match the method's docstring.
| preset_cfg = build_preprocessing_inference_config(preprocessing) | ||
| if preset_cfg is None: | ||
| return tabpfn_model_config | ||
| if "inference_config" in tabpfn_model_config: | ||
| # User-supplied inference_config takes precedence; warn so the | ||
| # mismatch between kwargs is discoverable. | ||
| warnings.warn( | ||
| "Both `preprocessing` and `tabpfn_model_config['inference_config']` " | ||
| "were provided. Using the explicit `inference_config` from " | ||
| "`tabpfn_model_config` and ignoring the preset.", | ||
| stacklevel=3, | ||
| ) | ||
| return tabpfn_model_config | ||
| return {**tabpfn_model_config, "inference_config": preset_cfg} |
There was a problem hiding this comment.
The docstring for _apply_preprocessing_preset states that it injects the configuration into a copy of the input, but the implementation returns the original tabpfn_model_config object in several branches (lines 293 and 303). Since the default value for tabpfn_model_config in the constructor is a shared global dictionary (TABPFN_DEFAULT_CONFIG), returning it directly can lead to accidental mutation of the default configuration if the predictor or other components modify it in-place.
Always returning a copy ensures the original configuration remains immutable and consistent with the docstring.
config = tabpfn_model_config.copy()
preset_cfg = build_preprocessing_inference_config(preprocessing)
if preset_cfg is None:
return config
if "inference_config" in config:
# User-supplied inference_config takes precedence; warn so the
# mismatch between kwargs is discoverable.
warnings.warn(
"Both `preprocessing` and `tabpfn_model_config['inference_config']` "
"were provided. Using the explicit `inference_config` from "
"`tabpfn_model_config` and ignoring the preset.",
stacklevel=3,
)
return config
config["inference_config"] = preset_cfg
return config
Summary
Expose tabpfn's inference-time X/y preprocessing pipeline via a simple string enum on
TabPFNTSPipeline. Today changing it requires threadinginference_configthroughtabpfn_model_configwithPreprocessorConfigobjects imported from a nested tabpfn module — not discoverable.New kwarg:
preprocessing: Literal["default", "none", "squashing_scaler"] = "default"."default"keeps current behaviour (tabpfn library defaults)."none"disables X/y preprocessing entirely (PREPROCESS_TRANSFORMS=[PreprocessorConfig("none")],REGRESSION_Y_PREPROCESS_TRANSFORMS=[None])."squashing_scaler"usessquashing_scaler_max10+svd_quarter_componentsfollowed by a numeric"none"config.An explicit
inference_configintabpfn_model_configstill wins over the preset; a warning is emitted if both are supplied.Motivation
Empirical sweep on fev-bench (3 checkpoints × 2 splits × 3 preprocs, 846 per-task results): on the non-lite/small-datasets split,
"none"and"squashing_scaler"both give a +0.05 SQL skill boost over defaults for both the library default checkpoint and OOD-finetuned variants. On the lite split the library defaults usually win (small regression from removing preprocessing). Worth exposing as a user-facing knob.Changes
tabpfn_time_series/preprocessing_presets.py(new) —PreprocessingPresettype +build_preprocessing_inference_config()helper.tabpfn_time_series/pipeline.py— addpreprocessingkwarg +_apply_preprocessing_preset()merging logic.tests/test_preprocessing_presets.py— 6 tests covering the helper and the pipeline integration (both with and without user-suppliedinference_config).Test plan
pytest tests/test_preprocessing_presets.py).test_preprocessing.py,test_predictor.pystill pass.test_pipeline.py::test_predict_client_moderequires TabPFN-cloud credentials (prompts for login interactively), unrelated to this change.🤖 Generated with Claude Code