MLflow for Darts implementation by jakubchlapek · Pull Request #3022 · unit8co/darts

jakubchlapek · 2026-02-18T14:11:27Z

Checklist before merging this PR:

Mentioned all issues that this PR fixes or addresses.
Summarized the updates of this PR under Summary.
Added an entry under Unreleased in the Changelog.

Addresses #2092 .

Summary

Provides a custom MLflow flavor for Darts on Darts' side. Supports autologging, logging, saving and loading of the models.
This PR focuses on the base MLflow integration, leaving serving of the models to be discussed in the future.

Included an example quickstart for the integration, however consider all of this a draft :)
Find example code in the .ipynb, however also providing a code snippet here as a quick reproducible example:

import mlflow
import tempfile
import os
from darts.metrics.metrics import smape
from darts.utils.mlflow import load_model, autolog
from darts.models import NBEATSModel, LinearRegressionModel
from darts.datasets import AirPassengersDataset
from torchmetrics import MeanAbsoluteError

# temp file setup
tmpdir = tempfile.mkdtemp()
mlflow_db = os.path.join(tmpdir, "mlflow.db")
mlflow.set_tracking_uri(f"sqlite:///{mlflow_db}")
mlflow.set_experiment("darts-forecasting")

train, val = AirPassengersDataset().load().astype("float32").split_before(0.7)

# autologging - patches .fit() on all ForecastingModel subclasses.
# for PyTorch-based models, inject_per_epoch_callbacks injects a Lightning callback
# that logs train/val loss or/and  user-specified torch metrics at the end of each epoch automatically.
autolog(
    log_models=True,
    log_params=True,
    log_training_metrics=True,
    log_validation_metrics=True,   # requires val_series in .fit()
    inject_per_epoch_callbacks=True, 
    extra_metrics=[smape],         # optional extra darts metric functions
)

with mlflow.start_run(run_name="nbeats") as run:
    model = NBEATSModel(
        input_chunk_length=24, 
        output_chunk_length=12,
        torch_metrics=MeanAbsoluteError())
    # val_series is forwarded to Lightning's val_dataloaders;
    # autolog captures per-epoch val metrics via the injected callback
    model.fit(train, val_series=val, epochs=10)
    run_id = run.info.run_id


# regression/sklearn models work identically
with mlflow.start_run(run_name="linreg"):
    model = LinearRegressionModel(lags=12)
    model.fit(train)  # logs params + in-sample metrics

# load back from MLflow
loaded = load_model(f"runs:/{run_id}/model")
preds = loaded.predict(12, series=train) # need to specify series as we save with clean=True in save_model

# import shutil
# shutil.rmtree(tmpdir)

review-notebook-app · 2026-02-18T14:11:34Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jakubchlapek · 2026-02-18T14:16:18Z

Hey @daidahao, adding this draft PR in the meantime so you and @dennisbader can have a look at what I have currently regarding the integration. There are still some decisions I am not too thrilled about and decisions to be made about the overall direction, but I'm happy to talk more about it during the meeting. Thanks for being so active for the library, really nice to be working together :)

dennisbader · 2026-05-21T07:21:37Z

Thanks everyone for all the work and the recent pushes to this PR 🚀
@mizeller could you give a quick summary of the current state and what is still missing before the PR can be finalized?

mizeller · 2026-05-27T08:21:48Z

Off the top of my head, the status on the MLFlow PR:

historical forecasts / backtesting is patched now. i.e. metrics are logged correctly in both cases. tested
- w/ local/global forecasting, torch models
- w/ all backtest(reduction=XXX) flag
deprecated the managed_run flag in (autolog())
reason: following discussion w/ @dennisbader we decided to enforce a "desired" way of using MLFlow x Darts (& make our lifes easier in the process)

TODO

so far I've always worked with only one timeseries object. the following cases should be handled in a user-friendly manner:

series = AirPassengersDataset().load().astype(np.float32)
series_multiple = [series, series / 3.]
series_multivariate = series.stack(series / 3.)
series_multiple_multivariate = [series.stack(series / 3.), series.stack(series / 10.)]

there's a problem ("bug") w/ metrics logging. currently, a metric's name is used in MLFlow, which is generally fine. but i.e. for mase + different kwargs, it is only logged once (same key). solution: when passing metrics_kwargs augment the metric name used on MLFlow, i.e.:

    model.backtest(
        series=series,
        historical_forecasts=hfc,
        last_points_only=False,
        metric=[darts_metrics.mape, darts_metrics.rmse, darts_metrics.ape, darts_metrics.mase, darts_metrics.mase],
        metric_kwargs=[{}, {}, {}, {"m": 1}, {"m": 2}],
        reduction=None,
    )

ensure in the multiple series case, the results are usable; the plots will probably explode with very long lists of timeseries.
(@dennisbader I think we talked about more todos/permutation of input params but I can't recall exactly which ones)

Also, I believe the TODO regarding metrics/kwargs was implemented in the most recent commit by @jakubchlapek - very cool! :)

jakubchlapek

Hey, looks nice @mizeller, just a few comments on the historical forecasts. The hfcs solution is nice.

jakubchlapek · 2026-05-26T13:16:37Z

+        if metric is None:
+            try:
+                sig = inspect.signature(original)
+                bound = sig.bind(self, *args, **kwargs)
+                bound.apply_defaults()
+                metric = bound.arguments.get("metric")
+            except Exception:
+                pass


I'd say we can remove this, I don't believe anyone will pass in metrics positionally and it adds unnecessary complexity to the code (default mape will then still be covered by else branch)

jakubchlapek · 2026-05-26T13:19:32Z

+            # 2-D and higher: skip to keep MVP simple
+


i'd prefer to include this in the PR if possible :)

jakubchlapek · 2026-05-26T13:23:09Z

+        if isinstance(metric, (list | tuple)) and isinstance(result, list):
+            # multiple metrics → result is list[scalar_or_array], one per metric
+            for name, r in zip(names, result):
+                _log(f"backtest_{name}", r)
+        elif (
+            isinstance(metric, list | tuple)
+            and result_arr is not None
+            and result_arr.ndim == 1
+            and len(result_arr) == len(names)
+        ):
+            # multiple metrics with scalar reduction returned as a 1-D ndarray
+            # (e.g. np.mean/median/percentile) — log each as a separate scalar
+            for name, r in zip(names, result_arr):
+                autologging_client.log_metrics(
+                    run_id=run_id, metrics={f"backtest_{name}": float(r)}
+                )
+        elif result_arr is not None and result_arr.ndim == 2:
+            # (N_windows, N_metrics) ndarray — multi-metric + reduction=None
+            for col_i, name in enumerate(names[: result_arr.shape[1]]):
+                for step, val in enumerate(result_arr[:, col_i]):
+                    autologging_client.log_metrics(
+                        run_id=run_id,
+                        metrics={f"backtest_{name}": float(val)},
+                        step=step,
+                    )
+        elif isinstance(result, list):
+            # single metric, multiple series → result is list[scalar_or_array]
+            for s_i, r in enumerate(result):
+                _log(f"backtest_{names[0]}_{s_i}", r)
+        else:
+            _log(f"backtest_{names[0]}", result)


ideally we would like to also support multivariate series where we can log per component if no reduction (e.g. maybe [backtest_MAE_x, backtest_MAE_y]). I worry that this approach can then get a bit complex with all the branches. Maybe we can think about normalizing the result to a dataframe first which could simplify logging? Let me know what you think here

Yes, we should definitely support this. Can we somehow be smarter here for inferring what the output dimension represent? Right now we only look at the output which I think can be dangerous because the dimensions might not be what we think (depending on the reductions, ...). In theory we should be able to look at the metric kwargs and the metric signature defaults to know what the output dimension should be (I say should because in the end the metrics will try to unpack the final results if possible).

There are many kwargs and input type that affect the output shape:

metric kwargs:

time/component/series_reduction/label_reduction: aggregates over an axis

q and q_interval (for computation on quantiles and quantile interval metrics -> goes into component dimension

label (for classification): goes into the component dimension (I believe)

metric input series that also affect the output shape:

series: either a single series or multiple

series: either univariate or multivariate

If we could bring the metrics into an expected shape that would make it more safe for downstream logging.

Something like:

shape: (n series, n times, n components * q/q_interval/label)

The reduced dimensions would be of length 1.

…ic kwargs)

…to example notebook

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


I could not find the logged model tags (model_type, dataset) in the model UI (not sure if I looked at the wrong places). Does it work as intended?

For autolog runs I find the tags

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


a (bullet) list of what is supported would be nice. Also mention backtesting, and anything that we forgot here

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


Line #3. with mlflow.start_run(run_name="linear-regression-autolog") as run:
When I inspect linear-regression-autolog run I find 2 logged models. Could this be a bug?

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


Line #9. auto_mape = darts.metrics.mape(val, auto_predictions)
In my opinion we should ignore the name of that actual_series (e.g. "val"). The variable name should not matter. Imagine we're looping through a set of series, then we still have the same problem of identical names (we discussed and said it's okay if metrics are overwritten). So it's not adding a benefit and at the same time users need to be aware of how they name their variables.

for pred in preds: mape(pred, ...)

Also, this is ambiguous with the train/val loss logged by the torch models under the hood.

Instead, simply ignore it. You could show here already the metric "name" parameter, but i would choose a different name than "val_*"

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


Line #6. with mlflow.start_run(run_name="nbeats-epoch-metrics"):

Some notes here for Torch Autologging:

I also see parameters logged which are not part of the TorchForecastingModel wrapper, for example:

optimizer_name Adam lr 0.001 betas (0.9, 0.999) eps 1e-08 weight_decay 0
Shouldn't we just log the top level Model parameters (e.g. NBEATSModel-level) to allow re-creating the model with the same parameters?
There is an additional checkpoints folder under run > Artifacts > checkpoints . This folder contains the latest checkpoint, which basically means we store the model twice. Is there away to remove this one and only rely on our Artifacts store under model > Artifacts ?
Under run > Metrics , I find some metrics (val_loss, ...) are assigned to model NBEATSModel but others are not (val_mape, ...). It should be identical for all

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


Line #11. multi_preds = multi_model.predict(n=len(val), series=series_list)

single and multi preds on different levels seems unintuitive.

multi_preds = multi_model.predict(n=len(val), series=series_list) single_pred = multi_preds[0]

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


Line #19. dm.ae(val, single_pred)
What happens when we compute non-aggregate metrics on a list of predictions with different horizons?
e.g.

dm.ae(val_list, [multi_preds[0], multi_preds[1][1:]])
I assume the multi-series aggregation will not work properly.
I don't really think that we need to support this case, but it should maybe raise an exception in the auto-logging that this isn't supported

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:42Z

@@ -0,0 +1,6067 @@
+{


Line #23. per_series_mae = dm.mae(val_list, multi_preds)
When we call two metrics on a list of series, then it writes two CSVs. Could / should we just write one CSV and appending subsequent metric calls to the first one?

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:43Z

@@ -0,0 +1,6067 @@
+{


Line #39. with open(csv_path) as f:
would be much simpler to use pandas here

print("\nPer-series breakdown (val_list_mae_per_series.csv):") pd.read_csv(csv_path)

Reply via ReviewNB

dennisbader · 2026-06-30T15:12:43Z

@@ -0,0 +1,6067 @@
+{


the python code block should be an actual executable notebook cell

Reply via ReviewNB

dennisbader

Really nice work on this one @jakubchlapek, @mizeller and @daidahao 🚀

I've played a bit around with it and it's looking great! Really cool what has all been included in the logging. This will help users a lot during experimentation and modelling :)

I do have a couple of suggestions that revolve mainly about:

currently some parts of the code can lead to ambiguous / incorrect logging (e.g. aggregation backtests for multi-series that have different time indices)
we do a lot of skipping in case the metrics don't have the expected shape. This can lead to missing logs which might not be intuitive for the user, or can even silently ignore actual bugs. I would prefer raising exceptions, especially since the feature is new, we need to know what is not working.
agreeing on the naming of what is logged
alternatives to the metric and model method patchings
and some other minor things

After this we should be good to go 💯

dennisbader · 2026-07-01T07:25:23Z

    "Programming Language :: Python :: Implementation :: PyPy",
 ]
 dependencies = [
+    "coolname>=4.2.0",


are coolname and loguru really required? Would prefer to not include them in the core dependencies. I tested the notebook without these, and had no issues

dennisbader · 2026-07-01T07:33:07Z

    "statsforecast>=1.4",
    "xgboost>=2.1.4",
 ]
+mlflow = ["mlflow>=3.0"]


either leave mlflow only in the optional dependency group below, or drop it from optional and leave it here.

dennisbader · 2026-07-01T07:34:17Z

 .venv
 .env
 uv.lock
+repl/


what is repl/ ?

dennisbader · 2026-07-01T07:34:46Z

+mlruns/*
+examples/mlruns/*


Suggested change

mlruns/*

examples/mlruns/*

*mlruns/

dennisbader · 2026-07-01T07:44:40Z

+MLflow Integration for Darts
+-----------------------------


Suggested change

MLflow Integration for Darts

-----------------------------

MLflow Integration

------------------

dennisbader · 2026-07-03T13:09:56Z

+            series = args[0]
+        else:
+            series = kwargs.get("actual_series", None)
+        if series is None:


is this even possible?

dennisbader · 2026-07-03T13:23:31Z

+            name_prefix = metric_names[0] if len(metric_names) == 1 else "metrics"
+            flat = np.asarray(r, dtype=float).flatten()
+            for i, val in enumerate(flat):
+                key = _sanitize_mlflow_key(f"backtest_{name_prefix}_{i}")


I don't fully understand why we can't produce the correct naming here. We do know what the axes and quantile / label names are per metric, no?

dennisbader · 2026-07-03T13:24:12Z

+        # this is an issue, then I'd suggest falling back to flat integer-indexed keys
+        # and enforcing explicit labels.
+        if labels_unknown:
+            inferred_labels = np.unique(s.values())


as mentioned somewhere else, we can drop support for this

dennisbader · 2026-07-03T13:24:56Z

+        rest, extra = divmod(arr.size, c_size * n_metrics)
+        if extra:
+            logger.warning(
+                "Backtest metric logging skipped: result size (%d) is not "


raise instead of skipping (and the other occurrences)

dennisbader · 2026-07-03T13:32:28Z

+        # both time and window axes present: backtest returns (W*T*C*M,) in C order so we can
+        # recover W and T only if forecast_horizon is known (T = forecast_horizon)
+        if has_time_axis and has_windows:
+            if not forecast_horizon or rest % forecast_horizon:


Can we infer the horizon from the backtest input args?

historical_forecasts=None: it is guaranteed to be the user forecast_horizon

historical_forecasts!=None:

last_points_only=False: it is the length of the first historical forecast window

last_points_only=True: the forecast horizon shouldn't really matter, since the metrics are only computed on a single TimeSeries forecast that consists of the last predicted steps from each window

jakubchlapek added 24 commits February 10, 2026 16:01

draft MVP

5789a99

covariate support

dc48b97

pyfunc series info extension

0cc488e

unit tests

b6ce538

add mlflow to dependencies

605ecbe

changed to sqlite

3c2a2e2

kwargs pyfunc extension

1750b11

removing pyfunc draft support: tbd if to include in future

30456f5

slight refactor by leveraging built-in mflflow validation util methods

b69de4a

restructuring module

c488cd7

autologging refactor w/ mlflow decorator

7fe3cf3

refactoring log_model to leverage Model.log

ada47b7

unit test improvenemnts

9fbe42a

ForecastingModel subclasses handling for autolog

a0e663f

save models with clean=True

93ca22a

logging update

ac4b0c6

unused var, tfmodel.load handles .ckpt internally

a35a834

documentation

2695cc9

added autolog logging default/provided metrics for all models

662b8e3

autolog metric unit tests

f19d6bf

removed redundant tests

bd932a9

changed callback inject to true by default

3b0ecb2

feat: ensure contiguous tensors in metric updates

2d6158d

example quickstart for mlflow

f684041

Merge branch 'master' into feat/mlflow-base

4495115

jakubchlapek added 3 commits February 18, 2026 16:47

unit test mps fix for torch

894779b

typehinting fix

c9a1301

CI hotfix

a65612d

mizeller added 5 commits April 21, 2026 09:15

fix: mlflow test script

5acd8de

fix: remove obsolte file

c0b1c6e

fix: formatting

d5977c8

fix: deprecate manage_run

bbe0bf9

Update mlflow_test_v2.py

09ad117

feat: metric kwarg support for autolog

5a9f493

jakubchlapek commented May 27, 2026

View reviewed changes

jakubchlapek and others added 11 commits June 23, 2026 12:54

chore: remove redundant metric check

9113c41

fix: check for active run in fit patch

845dcd6

feat: infer backtest output dim based on metric/backtests kwargs

b9a7b62

feat: backtest metric testing suite

bf497e6

fix: update failing tests due to manage_run deprecation

9e4934d

Merge branch 'master' into feat/mlflow-base

83bf76c

chore: remove temporary test file

7b8dbcd

feat: update mlflow notebook

502db71

fix: add MLFLOW_AVAILABLE flag

6dda789

feat: align direct metric logging shape to backtesting (based on metr…

c3f1df6

…ic kwargs)

feat: add missing autolog shape unit test

8b8157b

dennisbader mentioned this pull request Jun 25, 2026

MLflow autologging issue # 1618 #2092

Closed

jakubchlapek added 5 commits June 25, 2026 13:15

feat: save per-series metrics to csv and the aggregate to mlflow

3deccb1

feat: autolog per-series csv saving tests

277153f

feat: add metric shape and per-series csv saving explanation section …

f746d7b

…to example notebook

feat: handle name param for metrics

a486b25

chore: changelog

e2c6758

jakubchlapek marked this pull request as ready for review June 25, 2026 13:43

Merge branch 'master' into feat/mlflow-base

248770f

dennisbader reviewed Jun 30, 2026

View reviewed changes

dennisbader requested changes Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

jakubchlapek commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

review-notebook-app Bot commented Feb 18, 2026

Uh oh!

jakubchlapek commented Feb 18, 2026

Uh oh!

dennisbader commented May 21, 2026

Uh oh!

mizeller commented May 27, 2026

Uh oh!

jakubchlapek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dennisbader left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jakubchlapek commented Feb 18, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader Jun 30, 2026 •

edited

Loading

dennisbader left a comment •

edited

Loading