Predicting network throughput under congested and non-congested conditions using machine learning — built on a Mininet-emulated topology running on Microsoft Azure.
Modern applications like video streaming, cloud gaming, and real-time conferencing depend on stable network throughput. This project builds an end-to-end ML pipeline that predicts achievable throughput (Mbps) from lightweight, externally observable network metrics — RTT and jitter — without needing access to internal network state.
Five regression models were trained and evaluated on 600 samples generated from controlled Mininet experiments. Tree-based models significantly outperformed linear baselines, confirming non-linear relationships between delay dynamics and throughput.
| Model | MAE (Mbps) | RMSE (Mbps) | R² |
|---|---|---|---|
| Linear Regression | 8.14 | 11.58 | 0.148 |
| Ridge Regression | 8.14 | 11.59 | 0.148 |
| KNN Regressor | 6.76 | 9.96 | 0.370 |
| Decision Tree | 6.28 | 8.90 | 0.497 |
| Random Forest | 6.18 | 9.01 | 0.484 |
Random Forest achieved the lowest MAE. Decision Tree achieved the highest R². Both confirm that non-linear, tree-based models are best suited for throughput prediction under mixed network conditions.
Feature importance (Random Forest): RTT dominates as the primary predictor of throughput. Jitter contributes a secondary signal. RTT and jitter have low mutual correlation, confirming they provide independent predictive information.
Azure VM (Ubuntu 24.04)
└── Mininet (2-host, 1-switch topology)
├── Traffic Control (tc) — congestion/bandwidth limits
├── ping → RTT + jitter measurement
└── iperf3 → throughput measurement
└── final_dataset.csv (600 samples)
└── ML Pipeline
├── 70/30 train-test split
├── Feature standardization
├── 5 regression models
└── Metrics + diagnostic plots
- 600 samples from 150 experimental runs × 4 scenarios per run
- Features: RTT (ms), Jitter (ms)
- Target: Throughput (Mbps)
- Scenarios: Congested (Token Bucket Filter bandwidth limiting via
tc) and non-congested (netem delay/loss emulation) - Split: 70% train / 30% test with standardized feature scaling
| Model | Notes |
|---|---|
| Linear Regression | Baseline — assumes linear RTT/jitter → throughput relationship |
| Ridge Regression | L2 regularization for noise stability |
| KNN Regressor | k=15, distance-weighted — captures local non-linearity |
| Decision Tree | max_depth=6, min_samples_leaf=10 — learns threshold-like congestion behavior |
| Random Forest | 150 trees, max_depth=6 — ensemble reduces variance, best MAE |
- Correlation heatmap — RTT vs jitter vs throughput (RTT: +0.37, jitter: -0.10)
- Actual vs predicted plot — Random Forest predictions vs ground truth
- Residual plot — errors centered around zero, larger deviations at high throughput
- Feature importance chart — RTT dominates over jitter in Random Forest
- Python, Pandas, NumPy, scikit-learn
- Mininet (network emulation)
- Azure VM — Ubuntu 24.04
- iperf3 (throughput measurement)
- ping / ICMP (RTT + jitter measurement)
- Linux
tc(traffic control for congestion generation) - Matplotlib, Seaborn (visualization)
ACN-Final-Project/
├── data/ # Raw and processed dataset files
├── models/ # Saved model artifacts
├── results/ # Output plots and evaluation results
├── collect_quick.sh # Mininet data collection script
├── split_data.py # Train/test split and preprocessing
├── train_models.py # Model training
├── test_models.py # Model evaluation and metrics
├── eda_heatmap.py # Correlation heatmap
├── plot_actual_vs_predicted.py # Actual vs predicted visualization
├── plot_residuals.py # Residual analysis
├── plot_feature_importance.py # Random Forest feature importance
├── final_dataset.csv # Complete dataset (600 samples)
├── train_data.csv # Training split
└── test_data.csv # Test split
git clone https://github.com/RajaReddy1718/ACN-Final-Project.git
cd ACN-Final-Projectpip install pandas numpy scikit-learn matplotlib seabornRequires Mininet installed on Linux/Azure VM
sudo python3 -c "from mininet.net import Mininet; print('Mininet ready')"
sudo bash collect_quick.shpython train_models.pypython test_models.py
python eda_heatmap.py
python plot_actual_vs_predicted.py
python plot_residuals.py
python plot_feature_importance.py- Linear models fail to capture congestion behavior — throughput drops are threshold-like, not gradual
- Decision Tree best explains variance (R² = 0.497), learning abrupt congestion boundaries
- Random Forest is most accurate overall (MAE = 6.18 Mbps), robust through ensembling
- RTT is the dominant predictor — jitter adds a secondary independent signal
- Throughput can be predicted from lightweight external probes only — no internal network state required
- Rajasekhar Reddy Kallam — University of Nebraska at Omaha
- Dev Patel — University of Nebraska at Omaha
- Murali Krishna Panguluri — University of Nebraska at Omaha
Advanced Computer Networks — University of Nebraska at Omaha, Fall 2024