Skip to content

Refactor preprocessing logic for Chronos-2#493

Draft
shchur wants to merge 5 commits intoamazon-science:mainfrom
shchur:prepare-inputs-refactor
Draft

Refactor preprocessing logic for Chronos-2#493
shchur wants to merge 5 commits intoamazon-science:mainfrom
shchur:prepare-inputs-refactor

Conversation

@shchur
Copy link
Copy Markdown
Contributor

@shchur shchur commented Apr 28, 2026

Problem

Converting raw data to list[PreparedInput] is extremely slow when categorical covariates are present. For M5 (30K items, 50M rows, 7 categorical columns), preprocessing takes ~8 minutes. The bottlenecks are:

  1. Per-item target encoding loop using sklearn
  2. Redundant validation across validate_df_inputs and validate_and_prepare_single_dict_input

Proposed design

New module chronos/chronos2/preprocess.py with four entry points:

Entry point Input format
from_tensor() 3D tensor
from_tensor_list() list of 1D/2D tensors
from_dataframe() pd.DataFrame + optional future_df
from_dict_list() list[dict]

Key changes:

  • Direct tensor conversion — no intermediate dict representation, no unnecessary validation
  • Vectorized target encoding_target_encode() uses bincount across all items at once instead of per-item sklearn calls
  • Single validation point — each entry point validates once, internal methods assume valid input
  • Removes sklearn dependency for encoding

Breaking changes

  • Drops support for heterogeneous list[dict] inputs (all items must have same structure)

Expected speedup

~20x faster on M5 dataset (8 min → ~25s based on earlier prototyping)


This PR contains only the design (signatures + docstrings). Implementation to follow.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@shchur shchur marked this pull request as draft April 28, 2026 08:48
df: "pd.DataFrame",
target_columns: list[str],
prediction_length: int,
future_df: "pd.DataFrame | None" = None,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also introduce an optional kwarg here (e.g. future_covariates_names: list[str] | None) to specify the names of known covariates during training when future_df is unavailable.

Same for from_dict_list to remove the need for the {col: None for col in future_covariates_name} hack there.

@shchur shchur requested a review from abdulfatir April 28, 2026 09:37
Copy link
Copy Markdown
Contributor

@abdulfatir abdulfatir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shchur!

Overall, I don't have major concerns. The main thing we need to be fine with is that now there will be multiple validation paths which must be maintained. However, I think the major ones are list_of_dicts and dataframe, so it should not be too difficult to keep them tidy.

...


def from_dict_list(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe list_of_dicts to align with the naming elsewhere?

Comment on lines +264 to +265
- Unseen (item, category) pairs get the item mean as fallback (via smoothing formula)
- Completely unseen categories in future (cat_code=-1) get the item mean
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between these two?

- future_id_codes (if provided) are valid item IDs that appear in id_codes
- future_cat_codes may contain -1 for unseen categories (encoded as NaN)

Edge cases
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about NaNs in the categories (not in target)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants