Refactor preprocessing logic for Chronos-2 by shchur · Pull Request #493 · amazon-science/chronos-forecasting

shchur · 2026-04-28T08:48:39Z

Problem

Converting raw data to list[PreparedInput] is extremely slow when categorical covariates are present. For M5 (30K items, 50M rows, 7 categorical columns), preprocessing takes ~8 minutes. The bottlenecks are:

Per-item target encoding loop using sklearn
Redundant validation across validate_df_inputs and validate_and_prepare_single_dict_input

Proposed design

New module chronos/chronos2/preprocess.py with four entry points:

Entry point	Input format
`from_tensor()`	3D tensor
`from_tensor_list()`	list of 1D/2D tensors
`from_dataframe()`	pd.DataFrame + optional future_df
`from_dict_list()`	list[dict]

Key changes:

Direct tensor conversion — no intermediate dict representation, no unnecessary validation
Vectorized target encoding — _target_encode() uses bincount across all items at once instead of per-item sklearn calls
Single validation point — each entry point validates once, internal methods assume valid input
Removes sklearn dependency for encoding

Breaking changes

Drops support for heterogeneous list[dict] inputs (all items must have same structure)

Expected speedup

~20x faster on M5 dataset (8 min → ~25s based on earlier prototyping)

This PR contains only the design (signatures + docstrings). Implementation to follow.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

shchur · 2026-04-28T09:34:57Z

+    df: "pd.DataFrame",
+    target_columns: list[str],
+    prediction_length: int,
+    future_df: "pd.DataFrame | None" = None,


We could also introduce an optional kwarg here (e.g. future_covariates_names: list[str] | None) to specify the names of known covariates during training when future_df is unavailable.

Same for from_dict_list to remove the need for the {col: None for col in future_covariates_name} hack there.

abdulfatir

Thanks @shchur!

Overall, I don't have major concerns. The main thing we need to be fine with is that now there will be multiple validation paths which must be maintained. However, I think the major ones are list_of_dicts and dataframe, so it should not be too difficult to keep them tidy.

abdulfatir · 2026-04-30T11:27:30Z

+    ...
+
+
+def from_dict_list(


Maybe list_of_dicts to align with the naming elsewhere?

abdulfatir · 2026-04-30T11:28:55Z

+    - Unseen (item, category) pairs get the item mean as fallback (via smoothing formula)
+    - Completely unseen categories in future (cat_code=-1) get the item mean


What's the difference between these two?

abdulfatir · 2026-04-30T11:29:44Z

+    - future_id_codes (if provided) are valid item IDs that appear in id_codes
+    - future_cat_codes may contain -1 for unseen categories (encoded as NaN)
+
+    Edge cases


What about NaNs in the categories (not in target)?

shchur added 4 commits April 28, 2026 07:52

Add efficient df preprocessing option

14bab0f

Streamline Chronos2 preprocessing logic

beaf2db

Remove sklearn dependency

a7a1bf9

Remove comments

717c2f4

shchur marked this pull request as draft April 28, 2026 08:48

Keep sklearn

2907643

shchur commented Apr 28, 2026

View reviewed changes

shchur requested a review from abdulfatir April 28, 2026 09:37

abdulfatir reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor preprocessing logic for Chronos-2#493

Refactor preprocessing logic for Chronos-2#493
shchur wants to merge 5 commits intoamazon-science:mainfrom
shchur:prepare-inputs-refactor

shchur commented Apr 28, 2026

Uh oh!

shchur Apr 28, 2026

Uh oh!

abdulfatir left a comment

Uh oh!

abdulfatir Apr 30, 2026

Uh oh!

abdulfatir Apr 30, 2026

Uh oh!

abdulfatir Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- Unseen (item, category) pairs get the item mean as fallback (via smoothing formula)
		- Completely unseen categories in future (cat_code=-1) get the item mean

Conversation

shchur commented Apr 28, 2026

Problem

Proposed design

Breaking changes

Expected speedup

Uh oh!

shchur Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

abdulfatir left a comment

Choose a reason for hiding this comment

Uh oh!

abdulfatir Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

abdulfatir Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

abdulfatir Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants