Skip to content

Epic: ML-Based Tournament Duration Prediction #203

Description

@RemainingDelta

Overview

This epic introduces a two-phase machine learning system that predicts remaining tournament time and displays it in the live bracket progress embed. Phase 1 silently collects round-level data during real tourneys. Phase 2 trains a Random Forest model on that data and surfaces a time range estimate to staff, refreshing every 5 minutes alongside the existing progress dashboard.

Success Metrics

  • Prediction Accuracy: Predicted time range contains the actual end time in the majority of completed tourneys once the model has sufficient training data.
  • Data Collection Reliability: Round duration snapshots are successfully captured for every round of every opted-in tourney without missing entries.
  • Cold Start Handling: ETA field is correctly hidden until the minimum data threshold is met, with no erroneous predictions shown on insufficient data.
  • Model Stability: Model retrains successfully after every !endtourney without errors and loads correctly from MongoDB on bot startup.

User Personas

  • Tourney Admin: Needs a reliable at-a-glance estimate of how much longer the tourney will run so they can communicate with participants, plan staffing, and manage expectations. Does not need to understand the model — just needs a trustworthy time range in the progress embed.
  • Bot Developer: Needs clear opt-in controls so test tourneys never pollute the training data, and needs the model to retrain and persist automatically without manual intervention.

High-Level Requirements

  • Data Collection Toggle: Add an optional collect_data boolean parameter to the set-matcherino-id command (defaults to False). !starttourney should remind staff to set this flag if they want data collected. !endtourney automatically resets it to False.
  • Round Snapshot Storage: Piggyback on the existing 5-minute progress_dashboard_task poll to detect round transitions and store per-round data in MongoDB.
  • Round Transition Detection: Use max(endAt) across done matches per round to determine round end time, and min(statusAt) across in-progress/done matches to approximate round start time. Fall back to statusAt if endAt is null on a done match.
  • Statistical Validation Layer: Before introducing the ML model, validate predictions using a simple per-round average and standard deviation range across historical data. Only proceed to the full ML model if the statistical approach proves insufficient.
  • ML Model: Train a Random Forest Regressor via scikit-learn using round-level features to predict duration_per_match per remaining round. Serialize with joblib and store as binary in MongoDB. Load into memory on bot startup.
  • ETA Display: Add an ETA field to the existing bracket progress embed showing a time range derived from the spread across Random Forest decision trees. Hide entirely until the minimum tourney threshold is met.

Technical Notes

  • Dependencies: Requires adding scikit-learn to requirements.txt. joblib is bundled with scikit-learn and does not need to be added separately.
  • Model Storage: The trained model is serialized to bytes via joblib and stored as a binary field in MongoDB. This avoids local file storage on Pella and ensures the model persists across restarts.
  • Features: round_position_from_end, match_count_in_round, bottleneck_count, time_of_day (hour), day_of_week, team_count
  • Target: duration_per_match per round (i.e. round_duration / match_count_in_round)
  • Round Position: Stored relative to the end of the bracket (e.g. finals, semis, quarters) rather than absolute round number, so data generalizes across different bracket sizes.
  • Cold Start: ETA field is hidden entirely until a minimum number of tourneys have been recorded. Threshold to be configured in features/config.py.
  • Retraining: Model retrains automatically on !endtourney for opted-in tourneys only. Test tourneys (collect_data=False) never contribute to training data.
  • Validation: Phase 1 data should be collected and reviewed across several tourneys before Phase 2 is built. Compare statistical predictions vs actual end times before committing to the full ML approach.
  • Data Integrity: endAt anomaly noted — endAt may be populated on "in-progress" matches reflecting individual game/set completion. Only use endAt from matches with status "done" to avoid incorrect round end timestamps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions