Potential Look-ahead Bias in Feature Engineering and Model Training

## Summary
I found a potential look-ahead bias issue in the LSTM training pipeline (not the HMM labeling, which is correctly designed for regime discovery). The issue is specifically about feature timing alignment that may cause the model to use incomplete bar data during live inference.

## Important: HMM Design is Correct

First, I want to clarify that I understand the HMM is correctly using full data for **regime discovery** (unsupervised learning), not prediction. The README clearly states:

> "Unsupervised → Supervised learning: HMM discovers latent regimes first. LSTM then learns temporal structure to predict them."

This is the right approach. The issue I'm raising is specifically about the **LSTM training phase** and how features align with targets.

## Issue Details

### 1. Feature Computation (`src/compute_features.py`)

**Current behavior:**
- Context features (15m) are shifted by 1 bar (Line 656-658)
- **Main features (5m) are NOT shifted**

```python
# Lines 646-658 in src/compute_features.py
# Shift context features by 1 main bar to prevent lookahead bias
context_feature_cols = []
for tf in context_tfs:
    suffix = f'_{tf}'
    cols = [
        c for c in df_features.columns
        if c.endswith(suffix) and not any(c.startswith(prefix) for prefix in ('open_', 'high_', 'low_', 'close_', 'volume_'))
    ]
    context_feature_cols.extend(cols)

if context_feature_cols:
    # shift by 1 row: assumption is df rows are main_tf cadence
    df_features.loc[:, context_feature_cols] = df_features.loc[:, context_feature_cols].shift(1)
```

**Problem:**
Main timeframe features (5m) like `log_ret_1_5m`, `rsi_14_5m`, `atr_norm_5m` are calculated using current bar's OHLCV data but are **not shifted**.

### 2. Training Target (`dashboard/pages/4_Model_Training.py`)

```python
# Lines 85-86
df['target'] = df['regime'].map(regime_map)
df['target'] = df['target'].shift(-1) # Predict next bar's regime
```

### 3. Combined Effect - Look-ahead Bias

**Training Logic:**
```
At time t=10:00:
  Features X[t]: close_5m[10:00], volume_5m[10:00], RSI[10:00]  # Current bar
  Target y[t]:   regime[10:05]                                   # Next bar (shifted -1)
```

**Why this is problematic in live trading:**
- At 10:00, the 5-minute bar from 10:00-10:05 is **not yet complete**
- We don't have `close_5m[10:00]`, `volume_5m[10:00]`, or any indicators calculated from them
- But the model was trained using these "future" values

**Expected behavior for live trading:**
```
At time t=10:00:
  Features X[t]: close_5m[09:55], volume_5m[09:55], RSI[09:55]  # Previous completed bar
  Target y[t]:   regime[10:00]                                   # Current bar regime
```

## Reproduction Steps

1. Load feature file: `XAUUSD_combined_klines_*_features.csv`
2. Check column `log_ret_1_5m` at row i
3. Calculate: `log(close_5m[i] / close_5m[i-1])`
4. Observe that `log_ret_1_5m[i]` equals the calculated value **without any shift**
5. In training, this feature is paired with `regime[i+1]` (due to `shift(-1)`)

## Expected Behavior

All features should be shifted by 1 bar to ensure we only use information available **at prediction time**:
- Use features from bar `t-1` to predict regime at bar `t`
- This matches real-world scenario where we can only use completed bars

## Proposed Fix

### Option 1: Shift all features forward (recommended)
```python
# In compute_features.py, after computing features:

# Shift main timeframe features
main_feature_cols = [
    c for c in df_features.columns
    if c.endswith(f'_{main_tf}') and not c.startswith(('open_', 'high_', 'low_', 'close_', 'volume_'))
]
if main_feature_cols:
    df_features.loc[:, main_feature_cols] = df_features.loc[:, main_feature_cols].shift(1)

# (Keep existing context feature shift)
```

```python
# In training, remove target shift:
df['target'] = df['regime'].map(regime_map)
# Remove: df['target'] = df['target'].shift(-1)
```

**Interpretation:** Use bar `t-1` features to identify bar `t` regime

### Option 2: Remove target shift
Keep features as-is but don't shift target:
```python
df['target'] = df['regime'].map(regime_map)
# Remove shift(-1)
```

**Interpretation:** Use bar `t` features to identify bar `t` regime (only works if we wait for bar completion)

## Questions for Author

1. Is this intended behavior? If so, what's the assumption for live trading?
   - Do we wait for the current 5m bar to complete before making predictions?
   - Or should we be using the previous bar's features?

2. Have you deployed this to live trading? If so, how do you handle the timing of feature computation?

3. Would you accept a PR to add the feature shift to prevent look-ahead bias?

## Additional Context

- This is a common issue in time-series ML projects
- The HMM labeling using full data is fine (it's discovering regime definitions)
- The problem is specifically about feature timing alignment in LSTM training
- Similar projects often use `X[t-1] -> y[t]` or `X[t] -> y[t+1]` with careful timing

## Environment
- Python 3.13
- Using streamlit dashboard for training
- Data: XAUUSD 5m/15m klines (2024-2025)

---

Thank you for this great project! I'm trying to use it for gold trading and want to make sure the timing logic is sound before going live.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Look-ahead Bias in Feature Engineering and Model Training #1

Summary

Important: HMM Design is Correct

Issue Details

1. Feature Computation (`src/compute_features.py`)

2. Training Target (`dashboard/pages/4_Model_Training.py`)

3. Combined Effect - Look-ahead Bias

Reproduction Steps

Expected Behavior

Proposed Fix

Option 1: Shift all features forward (recommended)

Option 2: Remove target shift

Questions for Author

Additional Context

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Potential Look-ahead Bias in Feature Engineering and Model Training #1

Description

Summary

Important: HMM Design is Correct

Issue Details

1. Feature Computation (src/compute_features.py)

2. Training Target (dashboard/pages/4_Model_Training.py)

3. Combined Effect - Look-ahead Bias

Reproduction Steps

Expected Behavior

Proposed Fix

Option 1: Shift all features forward (recommended)

Option 2: Remove target shift

Questions for Author

Additional Context

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Feature Computation (`src/compute_features.py`)

2. Training Target (`dashboard/pages/4_Model_Training.py`)