Skip to content

Commit a04d8e4

Browse files
committed
add more text
1 parent 6003103 commit a04d8e4

1 file changed

Lines changed: 58 additions & 11 deletions

File tree

content/python_files/feature_engineering.py

Lines changed: 58 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1168,6 +1168,18 @@ def splitter(X, y, index_generator):
11681168
# %% [markdown]
11691169
#
11701170
# ## Predicting multiple horizons with a multi-output model
1171+
#
1172+
# Usually, it is really common to predict values for multiple horizons at once. The most
1173+
# naive approach is to train as many models as there are horizons. To achieve this,
1174+
# scikit-learn provides a meta-estimator called `MultiOutputRegressor` that can be used
1175+
# to train a single model that predicts multiple horizons at once.
1176+
#
1177+
# In short, we only need to provide multiple targets where each column corresponds to
1178+
# an horizon and this meta-estimator will train an independent model for each column.
1179+
# However, we could expect that the quality of the forecast might degrade as the horizon
1180+
# increases.
1181+
#
1182+
# Let's train a gradient boosting regressor for each horizon.
11711183

11721184
# %%
11731185
from sklearn.multioutput import MultiOutputRegressor
@@ -1179,6 +1191,11 @@ def splitter(X, y, index_generator):
11791191
y=targets.skb.drop(cols=["prediction_time", "load_mw"]).skb.mark_as_y(),
11801192
).skb.set_name("multioutput_hgbr")
11811193

1194+
# %% [markdown]
1195+
#
1196+
# Now, let's just rename the columns for the predictions to make it easier to plot
1197+
# the horizon forecast.
1198+
11821199
# %%
11831200
target_column_names = [target_column_name_pattern.format(horizon=h) for h in horizons]
11841201
predicted_target_column_names = [
@@ -1188,6 +1205,11 @@ def splitter(X, y, index_generator):
11881205
{k: v for k, v in zip(target_column_names, predicted_target_column_names)}
11891206
)
11901207

1208+
# %% [markdown]
1209+
#
1210+
# Let's plot the horizon forecast on a training data to check the validity of the
1211+
# output.
1212+
11911213
# %%
11921214
plot_at_time = datetime.datetime(2021, 4, 19, 0, 0, tzinfo=datetime.timezone.utc)
11931215
historical_timedelta = datetime.timedelta(hours=24 * 5)
@@ -1199,15 +1221,22 @@ def splitter(X, y, index_generator):
11991221
target_column_name_pattern,
12001222
).skb.preview()
12011223

1202-
# %%
1203-
plot_at_time = datetime.datetime(2021, 4, 20, 0, 0, tzinfo=datetime.timezone.utc)
1204-
plot_horizon_forecast(
1205-
targets,
1206-
named_predictions,
1207-
plot_at_time,
1208-
historical_timedelta,
1209-
target_column_name_pattern,
1210-
).skb.preview()
1224+
# %% [markdown]
1225+
#
1226+
# On this curve, the red line corresponds to the observed values past to the the date
1227+
# for which we would like to forecast. The orange line corresponds to the observed
1228+
# values for the next 24 hours and the blue line corresponds to the predicted values
1229+
# for the next 24 hours.
1230+
#
1231+
# Since we are using a strong model and very few training data to check the validity
1232+
# we observe that our model perfectly fits the training data.
1233+
#
1234+
# So, we are now ready to assess the performance of this multi-output model and we need
1235+
# to cross-validate it. Since we do not want to aggregate the metrics for the different
1236+
# horizons, we need to create a scikit-learn scorer in which we set
1237+
# `multioutput="raw_values"` to get the scores for each horizon.
1238+
#
1239+
# Passing this scorer to the `cross_validate` function returns all horizons scores.
12111240

12121241
# %%
12131242
from sklearn.metrics import r2_score
@@ -1236,10 +1265,20 @@ def scoring(regressor, X, y):
12361265
return_train_score=True,
12371266
verbose=1,
12381267
n_jobs=-1,
1239-
).round(3)
1268+
)
1269+
1270+
# %% [markdown]
1271+
#
1272+
# One thing that we observe is that training such multi-output model is expensive. It is
1273+
# expected since each horizon involves a different model and thus a training.
12401274

12411275
# %%
1242-
multioutput_cv_results
1276+
multioutput_cv_results.round(3)
1277+
1278+
# %% [markdown]
1279+
#
1280+
# Instead of reading the results in the table, we can plot the scores depending on the
1281+
# type of data and the metric.
12431282

12441283
# %%
12451284
import itertools
@@ -1281,6 +1320,10 @@ def scoring(regressor, X, y):
12811320

12821321
# %% [markdown]
12831322
#
1323+
# An interesting and unexpected observation is that the MAPE error on the test
1324+
# data is first increases and then decreases once past the horizon 18h. We
1325+
# would not necessarily expect this behaviour.
1326+
#
12841327
# ## Native multi-output handling using `RandomForestRegressor`
12851328
#
12861329
# In the previous section, we showed how to wrap a `HistGradientBoostingRegressor`
@@ -1410,6 +1453,10 @@ def scoring(regressor, X, y):
14101453
# %% [markdown]
14111454
#
14121455
# ## Uncertainty quantification using quantile regression
1456+
#
1457+
# In this section, we show how one can use a gradient boosting but modify the loss
1458+
# function to predict different quantiles and thus obtain an uncertainty quantification
1459+
# of the predictions.
14131460

14141461
# %%
14151462
from sklearn.metrics import d2_pinball_score

0 commit comments

Comments
 (0)