Skip to content

Latest commit

 

History

History
142 lines (115 loc) · 5.56 KB

File metadata and controls

142 lines (115 loc) · 5.56 KB

Malware Classification System

A professional, modular malware classification system that analyzes network traffic patterns to detect and classify malware using machine learning.

Features

  • Multiple Feature Sets: Core flow features, SPL packet size features, and combined feature sets
  • Multiple Models: Neural Networks, Random Forest, and FAISS k-NN
  • Binary & Multiclass Classification: Detect malware vs benign, or classify specific malware families
  • Reproducible Results: Fixed random seeds ensure consistent results across runs
  • Comprehensive Analysis: Detailed metrics, confusion matrices, ROC-AUC, cross-validation, and feature importance analysis
  • Professional Architecture: Clean, modular, and extensible codebase

Project Structure

├── config/                 # Configuration files
│   ├── features.py         # Feature definitions
│   └── hyperparameters.py  # Model configurations
├── data/                   # Data handling modules
│   ├── loader.py          # Data loading utilities
│   └── preprocessor.py    # Feature preprocessing
├── models/                # Model implementations
│   ├── base.py           # Abstract base class
│   ├── neural_network.py # Neural network classifier
│   ├── random_forest.py  # Random forest classifier
│   └── faiss_knn.py      # FAISS k-NN classifier
├── evaluation/           # Evaluation and visualization
│   ├── metrics.py       # Metrics calculation and analysis
│   └── visualization.py # Result visualization
├── experiments/         # Experiment orchestration
│   └── runner.py       # Experiment runner
└── notebooks/          # Jupyter notebooks and results
    ├── experiments.ipynb       # Main experimental notebook
    ├── results/                # Output CSV files
    │   ├── classification_results.csv
    │   ├── cv_results.csv
    │   └── feature_importance.csv
    └── plots/                  # Confusion matrices and visualizations

Quick Start

  1. Setup Environment:

    # Install required packages
    pip install pandas numpy scikit-learn tensorflow faiss-cpu matplotlib seaborn
  2. Run Experiments:

    from experiments.runner import ExperimentRunner
    
    # Initialize runner with your data path
    runner = ExperimentRunner('path/to/nfs_all_datasets_clean_final.csv')
    
    # Run a single experiment
    results = runner.run_single_experiment(
        feature_set='core',      # 'core', 'splt', or 'core+splt'
        task='binary',           # 'binary', 'malware_multiclass', or 'multiclass'
        model_type='random_forest'  # 'neural_network', 'random_forest', or 'faiss'
    )
    
    # Run all experiments
    all_results = runner.run_all_experiments()
    runner.save_results('results.csv')
  3. Use Jupyter Notebook: Open notebooks/experiments.ipynb for an interactive experience with all experiments, ROC curves, cross-validation, and feature importance analysis.

Configuration

Feature Sets

  • core: 33 network flow features (duration, bytes, packets, packet sizes, inter-arrival times)
  • splt: 25 packet size features from SPL analysis
  • core+splt: Combined 58 features

Models

  • Neural Network: Multi-layer perceptron with batch normalization and dropout
  • Random Forest: Ensemble classifier with 300 estimators
  • FAISS k-NN: Fast similarity search with cosine similarity

Preprocessing policy:

  • Random Forest is scale-invariant, so feature scaling is skipped for RF across all feature sets.
  • Neural Network and FAISS use standardized features.

Hyperparameters

All hyperparameters are centrally configured in config/hyperparameters.py to ensure reproducible results:

  • Fixed random seeds (42)
  • Model-specific configurations
  • Preprocessing parameters

Data Format

The system expects a CSV file with the following structure:

  • SourceFolder: Data source (desktop-malware, mobile-malware, desktop-apps, mobile-apps)
  • StandardizedAppName: Malware family name (for multiclass classification)
  • splt_ps: Packet size sequence (for SPL features)
  • Network flow features (bidirectional_, src2dst_, dst2src_*)

Notes on SPL features:

  • NFStream typically provides 25-length SPL arrays (direction, packet sizes, and inter-arrival times). This project extracts all 25 elements into ps_1..ps_25.
  • Shorter flows are padded by NFStream (with -1), longer flows are truncated, so all sequences are equal-length and do not require imputation.

Results Analysis

The system provides comprehensive analysis including:

  • Classification metrics (accuracy, precision, recall, F1-score, ROC-AUC)
  • ROC curves for binary and multiclass classification
  • 5-fold stratified cross-validation for stability assessment
  • Feature importance analysis from Random Forest models
  • Confusion matrices with heatmap visualizations
  • Per-class error analysis for multiclass tasks
  • Feature set and model comparisons

Example Usage

# Compare all feature sets for binary classification
feature_comparison = runner.run_feature_set_comparison(
    task='binary', 
    model_type='random_forest'
)

# Compare all models for specific configuration
model_comparison = runner.run_model_comparison(
    feature_set='core', 
    task='binary'
)

# Get best performing configurations
best_results = runner.get_best_results(metric='f1')

# Run 5-fold cross-validation
cv_results = runner.run_kfold_experiment(
    feature_set='core+splt',
    task='binary',
    model_type='random_forest',
    n_folds=5
)