Add partition_by support for lag transforms #609
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e62aa03ee5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
simonez-tuidi
left a comment
There was a problem hiding this comment.
I think there are some inconsistent results when using the partition_by aggregations in global_ or groupby mode, probably has to do with how the operations are combined. All looks good for the partition_by results when executed in the default mode.
Overall, great work, thanks so much @janrth!
| ("a", 1): np.nan, | ||
| ("a", 2): 1.0, | ||
| ("a", 3): 10.0, | ||
| ("a", 4): 11.5, |
There was a problem hiding this comment.
@janrth I think possibly here the values for the 4th timestamp should be average of all the previous values that have promo=True across all series, which would be the average of [1,2,20] so 7.66
There was a problem hiding this comment.
@simonez-tuidi thx for your close review!! super helpful. I will try to check all your comments asap :)
| "promo": [True, True, False, True, False, True, False, True], | ||
| } | ||
| if include_brand: | ||
| data["brand"] = ["x"] * 8 |
There was a problem hiding this comment.
Might be worth expanding the test coverage to have more timestamps and scenarios where there's more than one value in the groupby column, e.g. two brands etc
| ("a", 1): np.nan, | ||
| ("a", 2): 1.0, | ||
| ("a", 3): 10.0, | ||
| ("a", 4): 11.5, |
There was a problem hiding this comment.
Same as for the groupby case
| col = tfm._get_name(1) | ||
| np.testing.assert_allclose( | ||
| features.loc[features["step"].eq(0), col].to_numpy(), | ||
| np.array([np.nan, 30.0]), |
There was a problem hiding this comment.
Shouldn't the result here be [3.0, 30.0]?
The third value in series a has promo=False and y=3 so it's the only one to be used in the ExpandingMean
For series b 30.0 is the mean of [20, 40] that have promo=True so it's correct
| ) | ||
| np.testing.assert_allclose( | ||
| features.loc[features["step"].eq(1), col].to_numpy(), | ||
| np.array([2.3333333333333335, 20.0]), |
| col = tfm._get_name(1) | ||
| np.testing.assert_allclose( | ||
| features.loc[features["step"].eq(0), col].to_numpy(), | ||
| np.array([21.5, 22.333333333333332]), |
There was a problem hiding this comment.
Step 5 of series a has promo=False, so all previous False values are [3, 10, 30] -> mean = 14.33
Step 5 of series b has promo=True, so all previous True values are [1, 2, 4, 20, 40] -> mean = 13.8
| col = tfm._get_name(1) | ||
| np.testing.assert_allclose( | ||
| feats[col].to_numpy(), | ||
| np.array([2.3333333333333335, 20.0]), |
| ( | ||
| ExpandingMean(groupby=["brand"], partition_by=["promo"]), | ||
| {"static_features": ["brand"]}, | ||
| np.array([29.25, 16.0]), |
There was a problem hiding this comment.
Step 6 of series a has promo=True, so all previous True values across series are [1, 2, 4, 20, 40, 50] -> mean = 19.5
Step 6 of series b has promo=False, so all previous False values are [3, 5, 10, 30] -> mean = 12
This PR adds partition_by support to lag/rolling transforms.
partition_by enables SQL-like PARTITION BY behavior for lag transforms: features are still computed in time order, but only using past observations that match the current row on one or more partition columns.
This is useful for regime-aware forecasting features such as:
Supported all three execution modes:
local: partition within each unique_id
groupby=[...]: aggregate across series within each (groupby, partition_by) bucket
global_=True: aggregate globally within each partition_by bucket
Integrated the new behavior across:
fit_transform
recursive/direct predict
update
cross_validation
transfer-learning predict(new_df=...)
AutoMLForecast
Added tests covering core functionality, update behavior, CV, transfer learning, and AutoML integration.
Usage example:
Closes #587
Checklist: