@@ -1168,6 +1168,18 @@ def splitter(X, y, index_generator):
11681168# %% [markdown]
11691169#
11701170# ## Predicting multiple horizons with a multi-output model
1171+ #
1172+ # Usually, it is really common to predict values for multiple horizons at once. The most
1173+ # naive approach is to train as many models as there are horizons. To achieve this,
1174+ # scikit-learn provides a meta-estimator called `MultiOutputRegressor` that can be used
1175+ # to train a single model that predicts multiple horizons at once.
1176+ #
1177+ # In short, we only need to provide multiple targets where each column corresponds to
1178+ # an horizon and this meta-estimator will train an independent model for each column.
1179+ # However, we could expect that the quality of the forecast might degrade as the horizon
1180+ # increases.
1181+ #
1182+ # Let's train a gradient boosting regressor for each horizon.
11711183
11721184# %%
11731185from sklearn .multioutput import MultiOutputRegressor
@@ -1179,6 +1191,11 @@ def splitter(X, y, index_generator):
11791191 y = targets .skb .drop (cols = ["prediction_time" , "load_mw" ]).skb .mark_as_y (),
11801192).skb .set_name ("multioutput_hgbr" )
11811193
1194+ # %% [markdown]
1195+ #
1196+ # Now, let's just rename the columns for the predictions to make it easier to plot
1197+ # the horizon forecast.
1198+
11821199# %%
11831200target_column_names = [target_column_name_pattern .format (horizon = h ) for h in horizons ]
11841201predicted_target_column_names = [
@@ -1188,6 +1205,11 @@ def splitter(X, y, index_generator):
11881205 {k : v for k , v in zip (target_column_names , predicted_target_column_names )}
11891206)
11901207
1208+ # %% [markdown]
1209+ #
1210+ # Let's plot the horizon forecast on a training data to check the validity of the
1211+ # output.
1212+
11911213# %%
11921214plot_at_time = datetime .datetime (2021 , 4 , 19 , 0 , 0 , tzinfo = datetime .timezone .utc )
11931215historical_timedelta = datetime .timedelta (hours = 24 * 5 )
@@ -1199,15 +1221,22 @@ def splitter(X, y, index_generator):
11991221 target_column_name_pattern ,
12001222).skb .preview ()
12011223
1202- # %%
1203- plot_at_time = datetime .datetime (2021 , 4 , 20 , 0 , 0 , tzinfo = datetime .timezone .utc )
1204- plot_horizon_forecast (
1205- targets ,
1206- named_predictions ,
1207- plot_at_time ,
1208- historical_timedelta ,
1209- target_column_name_pattern ,
1210- ).skb .preview ()
1224+ # %% [markdown]
1225+ #
1226+ # On this curve, the red line corresponds to the observed values past to the the date
1227+ # for which we would like to forecast. The orange line corresponds to the observed
1228+ # values for the next 24 hours and the blue line corresponds to the predicted values
1229+ # for the next 24 hours.
1230+ #
1231+ # Since we are using a strong model and very few training data to check the validity
1232+ # we observe that our model perfectly fits the training data.
1233+ #
1234+ # So, we are now ready to assess the performance of this multi-output model and we need
1235+ # to cross-validate it. Since we do not want to aggregate the metrics for the different
1236+ # horizons, we need to create a scikit-learn scorer in which we set
1237+ # `multioutput="raw_values"` to get the scores for each horizon.
1238+ #
1239+ # Passing this scorer to the `cross_validate` function returns all horizons scores.
12111240
12121241# %%
12131242from sklearn .metrics import r2_score
@@ -1236,10 +1265,20 @@ def scoring(regressor, X, y):
12361265 return_train_score = True ,
12371266 verbose = 1 ,
12381267 n_jobs = - 1 ,
1239- ).round (3 )
1268+ )
1269+
1270+ # %% [markdown]
1271+ #
1272+ # One thing that we observe is that training such multi-output model is expensive. It is
1273+ # expected since each horizon involves a different model and thus a training.
12401274
12411275# %%
1242- multioutput_cv_results
1276+ multioutput_cv_results .round (3 )
1277+
1278+ # %% [markdown]
1279+ #
1280+ # Instead of reading the results in the table, we can plot the scores depending on the
1281+ # type of data and the metric.
12431282
12441283# %%
12451284import itertools
@@ -1281,6 +1320,10 @@ def scoring(regressor, X, y):
12811320
12821321# %% [markdown]
12831322#
1323+ # An interesting and unexpected observation is that the MAPE error on the test
1324+ # data is first increases and then decreases once past the horizon 18h. We
1325+ # would not necessarily expect this behaviour.
1326+ #
12841327# ## Native multi-output handling using `RandomForestRegressor`
12851328#
12861329# In the previous section, we showed how to wrap a `HistGradientBoostingRegressor`
@@ -1410,6 +1453,10 @@ def scoring(regressor, X, y):
14101453# %% [markdown]
14111454#
14121455# ## Uncertainty quantification using quantile regression
1456+ #
1457+ # In this section, we show how one can use a gradient boosting but modify the loss
1458+ # function to predict different quantiles and thus obtain an uncertainty quantification
1459+ # of the predictions.
14131460
14141461# %%
14151462from sklearn .metrics import d2_pinball_score
0 commit comments