Epic: ML-Based Tournament Duration Prediction

### Overview
This epic introduces a two-phase machine learning system that predicts remaining tournament time and displays it in the live bracket progress embed. Phase 1 silently collects round-level data during real tourneys. Phase 2 trains a Random Forest model on that data and surfaces a time range estimate to staff, refreshing every 5 minutes alongside the existing progress dashboard.

### Success Metrics
- **Prediction Accuracy:** Predicted time range contains the actual end time in the majority of completed tourneys once the model has sufficient training data.
- **Data Collection Reliability:** Round duration snapshots are successfully captured for every round of every opted-in tourney without missing entries.
- **Cold Start Handling:** ETA field is correctly hidden until the minimum data threshold is met, with no erroneous predictions shown on insufficient data.
- **Model Stability:** Model retrains successfully after every `!endtourney` without errors and loads correctly from MongoDB on bot startup.

### User Personas
- **Tourney Admin:** Needs a reliable at-a-glance estimate of how much longer the tourney will run so they can communicate with participants, plan staffing, and manage expectations. Does not need to understand the model — just needs a trustworthy time range in the progress embed.
- **Bot Developer:** Needs clear opt-in controls so test tourneys never pollute the training data, and needs the model to retrain and persist automatically without manual intervention.

### High-Level Requirements
- **Data Collection Toggle:** Add an optional `collect_data` boolean parameter to the set-matcherino-id command (defaults to False). `!starttourney` should remind staff to set this flag if they want data collected. `!endtourney` automatically resets it to False.
- **Round Snapshot Storage:** Piggyback on the existing 5-minute `progress_dashboard_task` poll to detect round transitions and store per-round data in MongoDB.
- **Round Transition Detection:** Use `max(endAt)` across done matches per round to determine round end time, and `min(statusAt)` across in-progress/done matches to approximate round start time. Fall back to `statusAt` if `endAt` is null on a done match.
- **Statistical Validation Layer:** Before introducing the ML model, validate predictions using a simple per-round average and standard deviation range across historical data. Only proceed to the full ML model if the statistical approach proves insufficient.
- **ML Model:** Train a Random Forest Regressor via scikit-learn using round-level features to predict `duration_per_match` per remaining round. Serialize with joblib and store as binary in MongoDB. Load into memory on bot startup.
- **ETA Display:** Add an ETA field to the existing bracket progress embed showing a time range derived from the spread across Random Forest decision trees. Hide entirely until the minimum tourney threshold is met.

### Technical Notes
- **Dependencies:** Requires adding `scikit-learn` to `requirements.txt`. `joblib` is bundled with scikit-learn and does not need to be added separately.
- **Model Storage:** The trained model is serialized to bytes via joblib and stored as a binary field in MongoDB. This avoids local file storage on Pella and ensures the model persists across restarts.
- **Features:** `round_position_from_end`, `match_count_in_round`, `bottleneck_count`, `time_of_day` (hour), `day_of_week`, `team_count`
- **Target:** `duration_per_match` per round (i.e. `round_duration / match_count_in_round`)
- **Round Position:** Stored relative to the end of the bracket (e.g. finals, semis, quarters) rather than absolute round number, so data generalizes across different bracket sizes.
- **Cold Start:** ETA field is hidden entirely until a minimum number of tourneys have been recorded. Threshold to be configured in `features/config.py`.
- **Retraining:** Model retrains automatically on `!endtourney` for opted-in tourneys only. Test tourneys (collect_data=False) never contribute to training data.
- **Validation:** Phase 1 data should be collected and reviewed across several tourneys before Phase 2 is built. Compare statistical predictions vs actual end times before committing to the full ML approach.
- **Data Integrity:** `endAt` anomaly noted — `endAt` may be populated on "in-progress" matches reflecting individual game/set completion. Only use `endAt` from matches with status "done" to avoid incorrect round end timestamps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: ML-Based Tournament Duration Prediction #203

Overview

Success Metrics

User Personas

High-Level Requirements

Technical Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Epic: ML-Based Tournament Duration Prediction #203

Description

Overview

Success Metrics

User Personas

High-Level Requirements

Technical Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions