ML-Based Network Throughput Prediction

Predicting network throughput under congested and non-congested conditions using machine learning — built on a Mininet-emulated topology running on Microsoft Azure.

Overview

Modern applications like video streaming, cloud gaming, and real-time conferencing depend on stable network throughput. This project builds an end-to-end ML pipeline that predicts achievable throughput (Mbps) from lightweight, externally observable network metrics — RTT and jitter — without needing access to internal network state.

Five regression models were trained and evaluated on 600 samples generated from controlled Mininet experiments. Tree-based models significantly outperformed linear baselines, confirming non-linear relationships between delay dynamics and throughput.

Results

Model	MAE (Mbps)	RMSE (Mbps)	R²
Linear Regression	8.14	11.58	0.148
Ridge Regression	8.14	11.59	0.148
KNN Regressor	6.76	9.96	0.370
Decision Tree	6.28	8.90	0.497
Random Forest	6.18	9.01	0.484

Random Forest achieved the lowest MAE. Decision Tree achieved the highest R². Both confirm that non-linear, tree-based models are best suited for throughput prediction under mixed network conditions.

Feature importance (Random Forest): RTT dominates as the primary predictor of throughput. Jitter contributes a secondary signal. RTT and jitter have low mutual correlation, confirming they provide independent predictive information.

Pipeline

Azure VM (Ubuntu 24.04)
    └── Mininet (2-host, 1-switch topology)
            ├── Traffic Control (tc) — congestion/bandwidth limits
            ├── ping → RTT + jitter measurement
            └── iperf3 → throughput measurement
                    └── final_dataset.csv (600 samples)
                            └── ML Pipeline
                                    ├── 70/30 train-test split
                                    ├── Feature standardization
                                    ├── 5 regression models
                                    └── Metrics + diagnostic plots

Dataset

600 samples from 150 experimental runs × 4 scenarios per run
Features: RTT (ms), Jitter (ms)
Target: Throughput (Mbps)
Scenarios: Congested (Token Bucket Filter bandwidth limiting via tc) and non-congested (netem delay/loss emulation)
Split: 70% train / 30% test with standardized feature scaling

Models

Model	Notes
Linear Regression	Baseline — assumes linear RTT/jitter → throughput relationship
Ridge Regression	L2 regularization for noise stability
KNN Regressor	k=15, distance-weighted — captures local non-linearity
Decision Tree	max_depth=6, min_samples_leaf=10 — learns threshold-like congestion behavior
Random Forest	150 trees, max_depth=6 — ensemble reduces variance, best MAE

Diagnostic Visualizations

Correlation heatmap — RTT vs jitter vs throughput (RTT: +0.37, jitter: -0.10)
Actual vs predicted plot — Random Forest predictions vs ground truth
Residual plot — errors centered around zero, larger deviations at high throughput
Feature importance chart — RTT dominates over jitter in Random Forest

Tech Stack

Python, Pandas, NumPy, scikit-learn
Mininet (network emulation)
Azure VM — Ubuntu 24.04
iperf3 (throughput measurement)
ping / ICMP (RTT + jitter measurement)
Linux tc (traffic control for congestion generation)
Matplotlib, Seaborn (visualization)

Repository Structure

ACN-Final-Project/
├── data/                        # Raw and processed dataset files
├── models/                      # Saved model artifacts
├── results/                     # Output plots and evaluation results
├── collect_quick.sh             # Mininet data collection script
├── split_data.py                # Train/test split and preprocessing
├── train_models.py              # Model training
├── test_models.py               # Model evaluation and metrics
├── eda_heatmap.py               # Correlation heatmap
├── plot_actual_vs_predicted.py  # Actual vs predicted visualization
├── plot_residuals.py            # Residual analysis
├── plot_feature_importance.py   # Random Forest feature importance
├── final_dataset.csv            # Complete dataset (600 samples)
├── train_data.csv               # Training split
└── test_data.csv                # Test split

How to Run

1. Clone the repo

git clone https://github.com/RajaReddy1718/ACN-Final-Project.git
cd ACN-Final-Project

2. Install dependencies

pip install pandas numpy scikit-learn matplotlib seaborn

3. (Optional) Re-generate the dataset

Requires Mininet installed on Linux/Azure VM

sudo python3 -c "from mininet.net import Mininet; print('Mininet ready')"
sudo bash collect_quick.sh

4. Train models

python train_models.py

5. Evaluate and generate plots

python test_models.py
python eda_heatmap.py
python plot_actual_vs_predicted.py
python plot_residuals.py
python plot_feature_importance.py

Key Findings

Linear models fail to capture congestion behavior — throughput drops are threshold-like, not gradual
Decision Tree best explains variance (R² = 0.497), learning abrupt congestion boundaries
Random Forest is most accurate overall (MAE = 6.18 Mbps), robust through ensembling
RTT is the dominant predictor — jitter adds a secondary independent signal
Throughput can be predicted from lightweight external probes only — no internal network state required

Authors

Rajasekhar Reddy Kallam — University of Nebraska at Omaha
Dev Patel — University of Nebraska at Omaha
Murali Krishna Panguluri — University of Nebraska at Omaha

Course

Advanced Computer Networks — University of Nebraska at Omaha, Fall 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-Based Network Throughput Prediction

Overview

Results

Pipeline

Dataset

Models

Diagnostic Visualizations

Tech Stack

Repository Structure

How to Run

1. Clone the repo

2. Install dependencies

3. (Optional) Re-generate the dataset

4. Train models

5. Evaluate and generate plots

Key Findings

Authors

Course

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
models		models
results		results
.gitignore		.gitignore
README.md		README.md
collect_quick.sh		collect_quick.sh
eda_heatmap.py		eda_heatmap.py
final_dataset.csv		final_dataset.csv
optimize_network.py		optimize_network.py
plot_actual_vs_predicted.py		plot_actual_vs_predicted.py
plot_feature_importance.py		plot_feature_importance.py
plot_residuals.py		plot_residuals.py
split_data.py		split_data.py
test_data.csv		test_data.csv
test_models.py		test_models.py
train_data.csv		train_data.csv
train_models.py		train_models.py

Folders and files

Latest commit

History

Repository files navigation

ML-Based Network Throughput Prediction

Overview

Results

Pipeline

Dataset

Models

Diagnostic Visualizations

Tech Stack

Repository Structure

How to Run

1. Clone the repo

2. Install dependencies

3. (Optional) Re-generate the dataset

4. Train models

5. Evaluate and generate plots

Key Findings

Authors

Course

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages