Skip to content

RajaReddy1718/ACN-Final-Project

Repository files navigation

ML-Based Network Throughput Prediction

Predicting network throughput under congested and non-congested conditions using machine learning — built on a Mininet-emulated topology running on Microsoft Azure.


Overview

Modern applications like video streaming, cloud gaming, and real-time conferencing depend on stable network throughput. This project builds an end-to-end ML pipeline that predicts achievable throughput (Mbps) from lightweight, externally observable network metrics — RTT and jitter — without needing access to internal network state.

Five regression models were trained and evaluated on 600 samples generated from controlled Mininet experiments. Tree-based models significantly outperformed linear baselines, confirming non-linear relationships between delay dynamics and throughput.


Results

Model MAE (Mbps) RMSE (Mbps)
Linear Regression 8.14 11.58 0.148
Ridge Regression 8.14 11.59 0.148
KNN Regressor 6.76 9.96 0.370
Decision Tree 6.28 8.90 0.497
Random Forest 6.18 9.01 0.484

Random Forest achieved the lowest MAE. Decision Tree achieved the highest R². Both confirm that non-linear, tree-based models are best suited for throughput prediction under mixed network conditions.

Feature importance (Random Forest): RTT dominates as the primary predictor of throughput. Jitter contributes a secondary signal. RTT and jitter have low mutual correlation, confirming they provide independent predictive information.


Pipeline

Azure VM (Ubuntu 24.04)
    └── Mininet (2-host, 1-switch topology)
            ├── Traffic Control (tc) — congestion/bandwidth limits
            ├── ping → RTT + jitter measurement
            └── iperf3 → throughput measurement
                    └── final_dataset.csv (600 samples)
                            └── ML Pipeline
                                    ├── 70/30 train-test split
                                    ├── Feature standardization
                                    ├── 5 regression models
                                    └── Metrics + diagnostic plots

Dataset

  • 600 samples from 150 experimental runs × 4 scenarios per run
  • Features: RTT (ms), Jitter (ms)
  • Target: Throughput (Mbps)
  • Scenarios: Congested (Token Bucket Filter bandwidth limiting via tc) and non-congested (netem delay/loss emulation)
  • Split: 70% train / 30% test with standardized feature scaling

Models

Model Notes
Linear Regression Baseline — assumes linear RTT/jitter → throughput relationship
Ridge Regression L2 regularization for noise stability
KNN Regressor k=15, distance-weighted — captures local non-linearity
Decision Tree max_depth=6, min_samples_leaf=10 — learns threshold-like congestion behavior
Random Forest 150 trees, max_depth=6 — ensemble reduces variance, best MAE

Diagnostic Visualizations

  • Correlation heatmap — RTT vs jitter vs throughput (RTT: +0.37, jitter: -0.10)
  • Actual vs predicted plot — Random Forest predictions vs ground truth
  • Residual plot — errors centered around zero, larger deviations at high throughput
  • Feature importance chart — RTT dominates over jitter in Random Forest

Tech Stack

  • Python, Pandas, NumPy, scikit-learn
  • Mininet (network emulation)
  • Azure VM — Ubuntu 24.04
  • iperf3 (throughput measurement)
  • ping / ICMP (RTT + jitter measurement)
  • Linux tc (traffic control for congestion generation)
  • Matplotlib, Seaborn (visualization)

Repository Structure

ACN-Final-Project/
├── data/                        # Raw and processed dataset files
├── models/                      # Saved model artifacts
├── results/                     # Output plots and evaluation results
├── collect_quick.sh             # Mininet data collection script
├── split_data.py                # Train/test split and preprocessing
├── train_models.py              # Model training
├── test_models.py               # Model evaluation and metrics
├── eda_heatmap.py               # Correlation heatmap
├── plot_actual_vs_predicted.py  # Actual vs predicted visualization
├── plot_residuals.py            # Residual analysis
├── plot_feature_importance.py   # Random Forest feature importance
├── final_dataset.csv            # Complete dataset (600 samples)
├── train_data.csv               # Training split
└── test_data.csv                # Test split

How to Run

1. Clone the repo

git clone https://github.com/RajaReddy1718/ACN-Final-Project.git
cd ACN-Final-Project

2. Install dependencies

pip install pandas numpy scikit-learn matplotlib seaborn

3. (Optional) Re-generate the dataset

Requires Mininet installed on Linux/Azure VM

sudo python3 -c "from mininet.net import Mininet; print('Mininet ready')"
sudo bash collect_quick.sh

4. Train models

python train_models.py

5. Evaluate and generate plots

python test_models.py
python eda_heatmap.py
python plot_actual_vs_predicted.py
python plot_residuals.py
python plot_feature_importance.py

Key Findings

  • Linear models fail to capture congestion behavior — throughput drops are threshold-like, not gradual
  • Decision Tree best explains variance (R² = 0.497), learning abrupt congestion boundaries
  • Random Forest is most accurate overall (MAE = 6.18 Mbps), robust through ensembling
  • RTT is the dominant predictor — jitter adds a secondary independent signal
  • Throughput can be predicted from lightweight external probes only — no internal network state required

Authors

  • Rajasekhar Reddy Kallam — University of Nebraska at Omaha
  • Dev Patel — University of Nebraska at Omaha
  • Murali Krishna Panguluri — University of Nebraska at Omaha

Course

Advanced Computer Networks — University of Nebraska at Omaha, Fall 2024

Releases

No releases published

Packages

 
 
 

Contributors