Overview
This epic introduces a two-phase machine learning system that predicts remaining tournament time and displays it in the live bracket progress embed. Phase 1 silently collects round-level data during real tourneys. Phase 2 trains a Random Forest model on that data and surfaces a time range estimate to staff, refreshing every 5 minutes alongside the existing progress dashboard.
Success Metrics
- Prediction Accuracy: Predicted time range contains the actual end time in the majority of completed tourneys once the model has sufficient training data.
- Data Collection Reliability: Round duration snapshots are successfully captured for every round of every opted-in tourney without missing entries.
- Cold Start Handling: ETA field is correctly hidden until the minimum data threshold is met, with no erroneous predictions shown on insufficient data.
- Model Stability: Model retrains successfully after every
!endtourney without errors and loads correctly from MongoDB on bot startup.
User Personas
- Tourney Admin: Needs a reliable at-a-glance estimate of how much longer the tourney will run so they can communicate with participants, plan staffing, and manage expectations. Does not need to understand the model — just needs a trustworthy time range in the progress embed.
- Bot Developer: Needs clear opt-in controls so test tourneys never pollute the training data, and needs the model to retrain and persist automatically without manual intervention.
High-Level Requirements
- Data Collection Toggle: Add an optional
collect_data boolean parameter to the set-matcherino-id command (defaults to False). !starttourney should remind staff to set this flag if they want data collected. !endtourney automatically resets it to False.
- Round Snapshot Storage: Piggyback on the existing 5-minute
progress_dashboard_task poll to detect round transitions and store per-round data in MongoDB.
- Round Transition Detection: Use
max(endAt) across done matches per round to determine round end time, and min(statusAt) across in-progress/done matches to approximate round start time. Fall back to statusAt if endAt is null on a done match.
- Statistical Validation Layer: Before introducing the ML model, validate predictions using a simple per-round average and standard deviation range across historical data. Only proceed to the full ML model if the statistical approach proves insufficient.
- ML Model: Train a Random Forest Regressor via scikit-learn using round-level features to predict
duration_per_match per remaining round. Serialize with joblib and store as binary in MongoDB. Load into memory on bot startup.
- ETA Display: Add an ETA field to the existing bracket progress embed showing a time range derived from the spread across Random Forest decision trees. Hide entirely until the minimum tourney threshold is met.
Technical Notes
- Dependencies: Requires adding
scikit-learn to requirements.txt. joblib is bundled with scikit-learn and does not need to be added separately.
- Model Storage: The trained model is serialized to bytes via joblib and stored as a binary field in MongoDB. This avoids local file storage on Pella and ensures the model persists across restarts.
- Features:
round_position_from_end, match_count_in_round, bottleneck_count, time_of_day (hour), day_of_week, team_count
- Target:
duration_per_match per round (i.e. round_duration / match_count_in_round)
- Round Position: Stored relative to the end of the bracket (e.g. finals, semis, quarters) rather than absolute round number, so data generalizes across different bracket sizes.
- Cold Start: ETA field is hidden entirely until a minimum number of tourneys have been recorded. Threshold to be configured in
features/config.py.
- Retraining: Model retrains automatically on
!endtourney for opted-in tourneys only. Test tourneys (collect_data=False) never contribute to training data.
- Validation: Phase 1 data should be collected and reviewed across several tourneys before Phase 2 is built. Compare statistical predictions vs actual end times before committing to the full ML approach.
- Data Integrity:
endAt anomaly noted — endAt may be populated on "in-progress" matches reflecting individual game/set completion. Only use endAt from matches with status "done" to avoid incorrect round end timestamps.
Overview
This epic introduces a two-phase machine learning system that predicts remaining tournament time and displays it in the live bracket progress embed. Phase 1 silently collects round-level data during real tourneys. Phase 2 trains a Random Forest model on that data and surfaces a time range estimate to staff, refreshing every 5 minutes alongside the existing progress dashboard.
Success Metrics
!endtourneywithout errors and loads correctly from MongoDB on bot startup.User Personas
High-Level Requirements
collect_databoolean parameter to the set-matcherino-id command (defaults to False).!starttourneyshould remind staff to set this flag if they want data collected.!endtourneyautomatically resets it to False.progress_dashboard_taskpoll to detect round transitions and store per-round data in MongoDB.max(endAt)across done matches per round to determine round end time, andmin(statusAt)across in-progress/done matches to approximate round start time. Fall back tostatusAtifendAtis null on a done match.duration_per_matchper remaining round. Serialize with joblib and store as binary in MongoDB. Load into memory on bot startup.Technical Notes
scikit-learntorequirements.txt.joblibis bundled with scikit-learn and does not need to be added separately.round_position_from_end,match_count_in_round,bottleneck_count,time_of_day(hour),day_of_week,team_countduration_per_matchper round (i.e.round_duration / match_count_in_round)features/config.py.!endtourneyfor opted-in tourneys only. Test tourneys (collect_data=False) never contribute to training data.endAtanomaly noted —endAtmay be populated on "in-progress" matches reflecting individual game/set completion. Only useendAtfrom matches with status "done" to avoid incorrect round end timestamps.