Skip to content

WIP: add preprocessing presets kwarg to TabPFNTSPipeline#113

Draft
LeoGrin wants to merge 1 commit into
mainfrom
leo/preprocessing-options
Draft

WIP: add preprocessing presets kwarg to TabPFNTSPipeline#113
LeoGrin wants to merge 1 commit into
mainfrom
leo/preprocessing-options

Conversation

@LeoGrin
Copy link
Copy Markdown
Contributor

@LeoGrin LeoGrin commented Apr 22, 2026

Summary

Expose tabpfn's inference-time X/y preprocessing pipeline via a simple string enum on TabPFNTSPipeline. Today changing it requires threading inference_config through tabpfn_model_config with PreprocessorConfig objects imported from a nested tabpfn module — not discoverable.

New kwarg: preprocessing: Literal["default", "none", "squashing_scaler"] = "default".

  • "default" keeps current behaviour (tabpfn library defaults).
  • "none" disables X/y preprocessing entirely (PREPROCESS_TRANSFORMS=[PreprocessorConfig("none")], REGRESSION_Y_PREPROCESS_TRANSFORMS=[None]).
  • "squashing_scaler" uses squashing_scaler_max10 + svd_quarter_components followed by a numeric "none" config.

An explicit inference_config in tabpfn_model_config still wins over the preset; a warning is emitted if both are supplied.

Motivation

Empirical sweep on fev-bench (3 checkpoints × 2 splits × 3 preprocs, 846 per-task results): on the non-lite/small-datasets split, "none" and "squashing_scaler" both give a +0.05 SQL skill boost over defaults for both the library default checkpoint and OOD-finetuned variants. On the lite split the library defaults usually win (small regression from removing preprocessing). Worth exposing as a user-facing knob.

Changes

  • tabpfn_time_series/preprocessing_presets.py (new) — PreprocessingPreset type + build_preprocessing_inference_config() helper.
  • tabpfn_time_series/pipeline.py — add preprocessing kwarg + _apply_preprocessing_preset() merging logic.
  • tests/test_preprocessing_presets.py — 6 tests covering the helper and the pipeline integration (both with and without user-supplied inference_config).

Test plan

  • New unit tests pass (pytest tests/test_preprocessing_presets.py).
  • Existing test_preprocessing.py, test_predictor.py still pass.
  • test_pipeline.py::test_predict_client_mode requires TabPFN-cloud credentials (prompts for login interactively), unrelated to this change.
  • User-facing example in README / quickstart notebook — left for follow-up.

🤖 Generated with Claude Code

Exposes tabpfn's inference-time X/y preprocessing pipeline via a simple
string enum on TabPFNTSPipeline. Today, changing it requires threading
`inference_config` through `tabpfn_model_config` with PreprocessorConfig
objects imported from a nested tabpfn module — not discoverable.

New kwarg: preprocessing in {"default", "none", "squashing_scaler"}.

- "default" keeps current behaviour (tabpfn library defaults).
- "none" disables X/y preprocessing entirely
  (PREPROCESS_TRANSFORMS=[none], REGRESSION_Y_PREPROCESS_TRANSFORMS=[None]).
- "squashing_scaler" uses squashing_scaler_max10 + svd_quarter_components
  followed by a numeric "none" config.

Empirically on the fev-bench small/non-lite split, "none" and
"squashing_scaler" give a +0.05 SQL skill boost over defaults for both
the library default checkpoint and OOD-finetuned variants. On the
fev-bench lite split the defaults usually win (small regression from
removing preprocessing). Exposing the knob lets users try both.

An explicit `inference_config` in `tabpfn_model_config` still wins over
the preset; a warning is emitted if both are supplied.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 22, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a preprocessing parameter to the TabPFNTSPipeline constructor, allowing users to select from predefined inference-time preprocessing presets. These presets are defined in a new preprocessing_presets module. A review comment pointed out that the _apply_preprocessing_preset method should consistently return a copy of the configuration dictionary to prevent accidental mutation of the global default settings and to match the method's docstring.

Comment on lines +291 to +304
preset_cfg = build_preprocessing_inference_config(preprocessing)
if preset_cfg is None:
return tabpfn_model_config
if "inference_config" in tabpfn_model_config:
# User-supplied inference_config takes precedence; warn so the
# mismatch between kwargs is discoverable.
warnings.warn(
"Both `preprocessing` and `tabpfn_model_config['inference_config']` "
"were provided. Using the explicit `inference_config` from "
"`tabpfn_model_config` and ignoring the preset.",
stacklevel=3,
)
return tabpfn_model_config
return {**tabpfn_model_config, "inference_config": preset_cfg}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for _apply_preprocessing_preset states that it injects the configuration into a copy of the input, but the implementation returns the original tabpfn_model_config object in several branches (lines 293 and 303). Since the default value for tabpfn_model_config in the constructor is a shared global dictionary (TABPFN_DEFAULT_CONFIG), returning it directly can lead to accidental mutation of the default configuration if the predictor or other components modify it in-place.

Always returning a copy ensures the original configuration remains immutable and consistent with the docstring.

        config = tabpfn_model_config.copy()
        preset_cfg = build_preprocessing_inference_config(preprocessing)
        if preset_cfg is None:
            return config
        if "inference_config" in config:
            # User-supplied inference_config takes precedence; warn so the
            # mismatch between kwargs is discoverable.
            warnings.warn(
                "Both `preprocessing` and `tabpfn_model_config['inference_config']` "
                "were provided. Using the explicit `inference_config` from "
                "`tabpfn_model_config` and ignoring the preset.",
                stacklevel=3,
            )
            return config
        config["inference_config"] = preset_cfg
        return config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants