Skip to content

Potential Look-ahead Bias in Feature Engineering and Model Training #1

@DyncEric

Description

@DyncEric

Summary

I found a potential look-ahead bias issue in the LSTM training pipeline (not the HMM labeling, which is correctly designed for regime discovery). The issue is specifically about feature timing alignment that may cause the model to use incomplete bar data during live inference.

Important: HMM Design is Correct

First, I want to clarify that I understand the HMM is correctly using full data for regime discovery (unsupervised learning), not prediction. The README clearly states:

"Unsupervised → Supervised learning: HMM discovers latent regimes first. LSTM then learns temporal structure to predict them."

This is the right approach. The issue I'm raising is specifically about the LSTM training phase and how features align with targets.

Issue Details

1. Feature Computation (src/compute_features.py)

Current behavior:

  • Context features (15m) are shifted by 1 bar (Line 656-658)
  • Main features (5m) are NOT shifted
# Lines 646-658 in src/compute_features.py
# Shift context features by 1 main bar to prevent lookahead bias
context_feature_cols = []
for tf in context_tfs:
    suffix = f'_{tf}'
    cols = [
        c for c in df_features.columns
        if c.endswith(suffix) and not any(c.startswith(prefix) for prefix in ('open_', 'high_', 'low_', 'close_', 'volume_'))
    ]
    context_feature_cols.extend(cols)

if context_feature_cols:
    # shift by 1 row: assumption is df rows are main_tf cadence
    df_features.loc[:, context_feature_cols] = df_features.loc[:, context_feature_cols].shift(1)

Problem:
Main timeframe features (5m) like log_ret_1_5m, rsi_14_5m, atr_norm_5m are calculated using current bar's OHLCV data but are not shifted.

2. Training Target (dashboard/pages/4_Model_Training.py)

# Lines 85-86
df['target'] = df['regime'].map(regime_map)
df['target'] = df['target'].shift(-1) # Predict next bar's regime

3. Combined Effect - Look-ahead Bias

Training Logic:

At time t=10:00:
  Features X[t]: close_5m[10:00], volume_5m[10:00], RSI[10:00]  # Current bar
  Target y[t]:   regime[10:05]                                   # Next bar (shifted -1)

Why this is problematic in live trading:

  • At 10:00, the 5-minute bar from 10:00-10:05 is not yet complete
  • We don't have close_5m[10:00], volume_5m[10:00], or any indicators calculated from them
  • But the model was trained using these "future" values

Expected behavior for live trading:

At time t=10:00:
  Features X[t]: close_5m[09:55], volume_5m[09:55], RSI[09:55]  # Previous completed bar
  Target y[t]:   regime[10:00]                                   # Current bar regime

Reproduction Steps

  1. Load feature file: XAUUSD_combined_klines_*_features.csv
  2. Check column log_ret_1_5m at row i
  3. Calculate: log(close_5m[i] / close_5m[i-1])
  4. Observe that log_ret_1_5m[i] equals the calculated value without any shift
  5. In training, this feature is paired with regime[i+1] (due to shift(-1))

Expected Behavior

All features should be shifted by 1 bar to ensure we only use information available at prediction time:

  • Use features from bar t-1 to predict regime at bar t
  • This matches real-world scenario where we can only use completed bars

Proposed Fix

Option 1: Shift all features forward (recommended)

# In compute_features.py, after computing features:

# Shift main timeframe features
main_feature_cols = [
    c for c in df_features.columns
    if c.endswith(f'_{main_tf}') and not c.startswith(('open_', 'high_', 'low_', 'close_', 'volume_'))
]
if main_feature_cols:
    df_features.loc[:, main_feature_cols] = df_features.loc[:, main_feature_cols].shift(1)

# (Keep existing context feature shift)
# In training, remove target shift:
df['target'] = df['regime'].map(regime_map)
# Remove: df['target'] = df['target'].shift(-1)

Interpretation: Use bar t-1 features to identify bar t regime

Option 2: Remove target shift

Keep features as-is but don't shift target:

df['target'] = df['regime'].map(regime_map)
# Remove shift(-1)

Interpretation: Use bar t features to identify bar t regime (only works if we wait for bar completion)

Questions for Author

  1. Is this intended behavior? If so, what's the assumption for live trading?

    • Do we wait for the current 5m bar to complete before making predictions?
    • Or should we be using the previous bar's features?
  2. Have you deployed this to live trading? If so, how do you handle the timing of feature computation?

  3. Would you accept a PR to add the feature shift to prevent look-ahead bias?

Additional Context

  • This is a common issue in time-series ML projects
  • The HMM labeling using full data is fine (it's discovering regime definitions)
  • The problem is specifically about feature timing alignment in LSTM training
  • Similar projects often use X[t-1] -> y[t] or X[t] -> y[t+1] with careful timing

Environment

  • Python 3.13
  • Using streamlit dashboard for training
  • Data: XAUUSD 5m/15m klines (2024-2025)

Thank you for this great project! I'm trying to use it for gold trading and want to make sure the timing logic is sound before going live.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions