Refactor preprocessing logic for Chronos-2#493
Refactor preprocessing logic for Chronos-2#493shchur wants to merge 5 commits intoamazon-science:mainfrom
Conversation
| df: "pd.DataFrame", | ||
| target_columns: list[str], | ||
| prediction_length: int, | ||
| future_df: "pd.DataFrame | None" = None, |
There was a problem hiding this comment.
We could also introduce an optional kwarg here (e.g. future_covariates_names: list[str] | None) to specify the names of known covariates during training when future_df is unavailable.
Same for from_dict_list to remove the need for the {col: None for col in future_covariates_name} hack there.
abdulfatir
left a comment
There was a problem hiding this comment.
Thanks @shchur!
Overall, I don't have major concerns. The main thing we need to be fine with is that now there will be multiple validation paths which must be maintained. However, I think the major ones are list_of_dicts and dataframe, so it should not be too difficult to keep them tidy.
| ... | ||
|
|
||
|
|
||
| def from_dict_list( |
There was a problem hiding this comment.
Maybe list_of_dicts to align with the naming elsewhere?
| - Unseen (item, category) pairs get the item mean as fallback (via smoothing formula) | ||
| - Completely unseen categories in future (cat_code=-1) get the item mean |
There was a problem hiding this comment.
What's the difference between these two?
| - future_id_codes (if provided) are valid item IDs that appear in id_codes | ||
| - future_cat_codes may contain -1 for unseen categories (encoded as NaN) | ||
|
|
||
| Edge cases |
There was a problem hiding this comment.
What about NaNs in the categories (not in target)?
Problem
Converting raw data to
list[PreparedInput]is extremely slow when categorical covariates are present. For M5 (30K items, 50M rows, 7 categorical columns), preprocessing takes ~8 minutes. The bottlenecks are:validate_df_inputsandvalidate_and_prepare_single_dict_inputProposed design
New module
chronos/chronos2/preprocess.pywith four entry points:from_tensor()from_tensor_list()from_dataframe()from_dict_list()Key changes:
_target_encode()uses bincount across all items at once instead of per-item sklearn callsBreaking changes
list[dict]inputs (all items must have same structure)Expected speedup
~20x faster on M5 dataset (8 min → ~25s based on earlier prototyping)
This PR contains only the design (signatures + docstrings). Implementation to follow.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.