Fix log_training_metric crash for statistical time series models by Copilot · Pull Request #1468 · microsoft/FLAML

Copilot · 2026-01-10T02:17:11Z

Statistical time series models (ARIMA, SARIMAX, Holt-Winters) fail with IndexError: single positional indexer is out-of-bounds when log_training_metric=True.

Root Cause

The _eval_estimator function attempts to compute training metrics by calling predict() on training data. Statistical models use statsmodels' predict interface designed for out-of-sample forecasting—it requires timestamps and cannot handle in-sample predictions on training data like ML models.

Changes

flaml/automl/ml.py: Skip training metric computation for ARIMA, SARIMAX, and HoltWinters when log_training_metric=True
test/automl/test_log_training_metric_ts.py: Add tests covering all three statistical models individually and together, plus ML models to verify normal behavior preserved

Example

from flaml import AutoML

automl = AutoML()
automl.fit(
    dataframe=df,
    label="y",
    task="ts_forecast",
    estimator_list=["arima", "sarimax", "holt-winters"],
    log_training_metric=True,  # Now works without IndexError
    period=12
)

Statistical models will log validation metrics but not training loss (which is less meaningful for models that fit the entire sequence). ML models continue computing training metrics normally.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Bug]: Forecasting: log_training_metric causes arima, sarimax, holt-winters to fail when set to true.</issue_title>
<issue_description>### Describe the bug

The key findings are:

Individual TS estimators (arima, sarimax, holt-winters) FAIL with log_training_metric=True
ML estimators (xgboost, lgbm, catboost) PASS
When log_training_metric is NOT set, arima PASSES (see the holdout split test)

ROOT CAUSE HYPOTHESIS:

log_training_metric=True causes FLAML to call get_y_pred() on X_train
For time series models (arima, sarimax, holt-winters), this fails because
the TS model's predict() method expects X to have timestamps, but during
internal validation, X_train can be empty or malformed.

Steps to reproduce

Script for reproduction

"""
FLAML Root Cause Verification Test

Hypothesis: The bug is triggered by `log_training_metric=True` with time series models.

When log_training_metric=True, FLAML tries to compute training predictions
via get_y_pred() which calls estimator.predict(X_train). For TS models,
this fails because X_train can be empty during certain validation scenarios.
"""

import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

import numpy as np
import pandas as pd
import sktime.datasets
from flaml import AutoML

def prepare_airline_data():
    """Prepare Airline data in FLAML format."""
    airline = sktime.datasets.load_airline()
    airline.index = airline.index.to_timestamp()
    
    return pd.DataFrame({
        "ds": airline.index,
        "y": airline.values.astype(np.float64),
    })


def test_log_training_metric_hypothesis():
    """Test if log_training_metric=True is the root cause."""
    print("\n" + "="*70)
    print("ROOT CAUSE VERIFICATION: log_training_metric")
    print("="*70)
    
    train_df = prepare_airline_data()
    
    # Base config
    base_config = {
        "task": "ts_forecast",
        "time_budget": 10,
        "metric": "mape",
        "eval_method": "holdout",
        "seed": 42,
        "verbose": 0,
        "estimator_list": ["arima"],
    }
    
    # Test 1: WITHOUT log_training_metric
    print("\n--- Test 1: WITHOUT log_training_metric ---")
    config1 = base_config.copy()
    
    try:
        automl = AutoML()
        automl.fit(dataframe=train_df, label="y", period=1, **config1)
        print(f"  ✅ SUCCESS - Best: {automl.best_estimator}")
    except Exception as e:
        print(f"  ❌ FAILED - {type(e).__name__}: {e}")
    
    # Test 2: WITH log_training_metric=True
    print("\n--- Test 2: WITH log_training_metric=True ---")
    config2 = base_config.copy()
    config2["log_training_metric"] = True
    
    try:
        automl = AutoML()
        automl.fit(dataframe=train_df, label="y", period=1, **config2)
        print(f"  ✅ SUCCESS - Best: {automl.best_estimator}")
    except Exception as e:
        print(f"  ❌ FAILED - {type(e).__name__}: {e}")
    
    # Test 3: WITH log_training_metric=False (explicit)
    print("\n--- Test 3: WITH log_training_metric=False ---")
    config3 = base_config.copy()
    config3["log_training_metric"] = False
    
    try:
        automl = AutoML()
        automl.fit(dataframe=train_df, label="y", period=1, **config3)
        print(f"  ✅ SUCCESS - Best: {automl.best_estimator}")
    except Exception as e:
        print(f"  ❌ FAILED - {type(e).__name__}: {e}")


def test_all_ts_estimators_with_and_without_logging():
    """Test all TS estimators with and without log_training_metric."""
    print("\n" + "="*70)
    print("ALL TS ESTIMATORS: with/without log_training_metric")
    print("="*70)
    
    train_df = prepare_airline_data()
    
    ts_estimators = ["arima", "sarimax", "holt-winters"]
    
    for est in ts_estimators:
        print(f"\n--- Estimator: {est} ---")
        
        # Without logging
        config_no_log = {
            "task": "ts_forecast",
            "time_budget": 5,
            "metric": "mape",
            "eval_method": "holdout",
            "seed": 42,
            "verbose": 0,
            "estimator_list": [est],
        }
        
        try:
            automl = AutoML()
            automl.fit(dataframe=train_df, label="y", period=1, **config_no_log)
            print(f"  log_training_metric=False: ✅ SUCCESS")
        except Exception as e:
            print(f"  log_training_metric=False: ❌ FAILED - {str(e)[:50]}")
        
        # With logging
        config_with_log = config_no_log.copy()
        config_with_log["log_training_metric"] = True
        
        try:
            automl = AutoML()
            automl.fit(dataframe=train_df, label="y", period=1, **config_with_log)
            print(f"  log_training_metric=True:  ✅ SUCCESS")
        except Exception as e:
            print(f"  log_training_metric=True:  ❌ FAILED - {str(e)[:50]}")


def test_fix_remove_log_training_metric():
    """Test the fix: remove log_trainin...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes microsoft/FLAML#1464

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/FLAML/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Skip training metric computation for ARIMA, SARIMAX, and Holt-Winters models when log_training_metric=True to avoid IndexError. These statistical models don't support in-sample predictions the same way ML models do. Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Remove print statements and extract test data preparation to helper function Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Initial plan

b997712

Copilot AI assigned Copilot and thinkall Jan 10, 2026

Copilot started work on behalf of thinkall January 10, 2026 02:17 View session

Copilot AI and others added 3 commits January 10, 2026 02:29

Clean up test file to address code review comments

3f09201

Remove print statements and extract test data preparation to helper function Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Normalize line endings in ml.py to Unix format

305210c

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix bug with log_training_metric causing time series models to fail~~ Fix log_training_metric crash for statistical time series models Jan 10, 2026

Copilot AI requested a review from thinkall January 10, 2026 02:37

Copilot finished work on behalf of thinkall January 10, 2026 02:37

Merge branch 'main' into copilot/fix-log-training-metric-bug-again

8ca88dd

thinkall closed this Jan 10, 2026

thinkall deleted the copilot/fix-log-training-metric-bug-again branch January 10, 2026 04:21

thinkall mentioned this pull request Jan 10, 2026

Fix log_training_metric causing IndexError for time series models #1469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix log_training_metric crash for statistical time series models#1468

Fix log_training_metric crash for statistical time series models#1468
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-log-training-metric-bug-again

Copilot AI commented Jan 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Changes

Example

Steps to reproduce

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 10, 2026 •

edited

Loading