Malware Classification System

A professional, modular malware classification system that analyzes network traffic patterns to detect and classify malware using machine learning.

Features

Multiple Feature Sets: Core flow features, SPL packet size features, and combined feature sets
Multiple Models: Neural Networks, Random Forest, and FAISS k-NN
Binary & Multiclass Classification: Detect malware vs benign, or classify specific malware families
Reproducible Results: Fixed random seeds ensure consistent results across runs
Comprehensive Analysis: Detailed metrics, confusion matrices, ROC-AUC, cross-validation, and feature importance analysis
Professional Architecture: Clean, modular, and extensible codebase

Project Structure

├── config/                 # Configuration files
│   ├── features.py         # Feature definitions
│   └── hyperparameters.py  # Model configurations
├── data/                   # Data handling modules
│   ├── loader.py          # Data loading utilities
│   └── preprocessor.py    # Feature preprocessing
├── models/                # Model implementations
│   ├── base.py           # Abstract base class
│   ├── neural_network.py # Neural network classifier
│   ├── random_forest.py  # Random forest classifier
│   └── faiss_knn.py      # FAISS k-NN classifier
├── evaluation/           # Evaluation and visualization
│   ├── metrics.py       # Metrics calculation and analysis
│   └── visualization.py # Result visualization
├── experiments/         # Experiment orchestration
│   └── runner.py       # Experiment runner
└── notebooks/          # Jupyter notebooks and results
    ├── experiments.ipynb       # Main experimental notebook
    ├── results/                # Output CSV files
    │   ├── classification_results.csv
    │   ├── cv_results.csv
    │   └── feature_importance.csv
    └── plots/                  # Confusion matrices and visualizations

Quick Start

Setup Environment:

# Install required packages
pip install pandas numpy scikit-learn tensorflow faiss-cpu matplotlib seaborn

Run Experiments:

from experiments.runner import ExperimentRunner

# Initialize runner with your data path
runner = ExperimentRunner('path/to/nfs_all_datasets_clean_final.csv')

# Run a single experiment
results = runner.run_single_experiment(
    feature_set='core',      # 'core', 'splt', or 'core+splt'
    task='binary',           # 'binary', 'malware_multiclass', or 'multiclass'
    model_type='random_forest'  # 'neural_network', 'random_forest', or 'faiss'
)

# Run all experiments
all_results = runner.run_all_experiments()
runner.save_results('results.csv')

Use Jupyter Notebook: Open notebooks/experiments.ipynb for an interactive experience with all experiments, ROC curves, cross-validation, and feature importance analysis.

Configuration

Feature Sets

core: 33 network flow features (duration, bytes, packets, packet sizes, inter-arrival times)
splt: 25 packet size features from SPL analysis
core+splt: Combined 58 features

Models

Neural Network: Multi-layer perceptron with batch normalization and dropout
Random Forest: Ensemble classifier with 300 estimators
FAISS k-NN: Fast similarity search with cosine similarity

Preprocessing policy:

Random Forest is scale-invariant, so feature scaling is skipped for RF across all feature sets.
Neural Network and FAISS use standardized features.

Hyperparameters

All hyperparameters are centrally configured in config/hyperparameters.py to ensure reproducible results:

Fixed random seeds (42)
Model-specific configurations
Preprocessing parameters

Data Format

The system expects a CSV file with the following structure:

SourceFolder: Data source (desktop-malware, mobile-malware, desktop-apps, mobile-apps)
StandardizedAppName: Malware family name (for multiclass classification)
splt_ps: Packet size sequence (for SPL features)
Network flow features (bidirectional_, src2dst_, dst2src_*)

Notes on SPL features:

NFStream typically provides 25-length SPL arrays (direction, packet sizes, and inter-arrival times). This project extracts all 25 elements into ps_1..ps_25.
Shorter flows are padded by NFStream (with -1), longer flows are truncated, so all sequences are equal-length and do not require imputation.

Results Analysis

The system provides comprehensive analysis including:

Classification metrics (accuracy, precision, recall, F1-score, ROC-AUC)
ROC curves for binary and multiclass classification
5-fold stratified cross-validation for stability assessment
Feature importance analysis from Random Forest models
Confusion matrices with heatmap visualizations
Per-class error analysis for multiclass tasks
Feature set and model comparisons

Example Usage

# Compare all feature sets for binary classification
feature_comparison = runner.run_feature_set_comparison(
    task='binary', 
    model_type='random_forest'
)

# Compare all models for specific configuration
model_comparison = runner.run_model_comparison(
    feature_set='core', 
    task='binary'
)

# Get best performing configurations
best_results = runner.get_best_results(metric='f1')

# Run 5-fold cross-validation
cv_results = runner.run_kfold_experiment(
    feature_set='core+splt',
    task='binary',
    model_type='random_forest',
    n_folds=5
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malware Classification System

Features

Project Structure

Quick Start

Configuration

Feature Sets

Models

Hyperparameters

Data Format

Results Analysis

Example Usage

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Malware Classification System

Features

Project Structure

Quick Start

Configuration

Feature Sets

Models

Hyperparameters

Data Format

Results Analysis

Example Usage