Skip to content

MatteoGuglielmi-tech/VuLLM

Repository files navigation

VuLLM

Fine-tuning Large Language Models for C Vulnerability Detection and Classification (CWE) with Structured Reasoning

DOI

License

Caution

Ongoing experimental work — agentic RAG pipeline with a CWE knowledge base over the full MITRE corpus, vector retrieval, multi-pass inference (LangGraph), and an MCP server — lives on the agent branch. main is the frozen fine-tuning baseline.

Overview

This repository contains the code and experimental pipeline for my master's thesis on using fine-tuned LLMs for automated vulnerability detection and classification in C code. The approach generates structured JSON outputs containing both natural language security reasoning and CWE (Common Weakness Enumeration) classifications.

Key findings:

  • Pessimistic assumptions combined with CWE guidance achieve 4.3× higher F1-score than Random Forest baseline
  • Neither assumptions alone nor CWE guidance alone achieves substantial improvement—only their combination is effective
  • Training data quality (not prompt design) is the primary bottleneck, with a 15.5% recall ceiling across all configurations
  • Diagnostic suite validation shows 91.7% accuracy on handcrafted test cases

Repository Structure

VuLLM/
├── .clang-format               # Clang-format config file
├── deepspeed                   # Deepspeed files
├── DoneBot                     # Submodule for async notifications
├── LICENSE
├── pixi.toml                   # Dependency management
├── pyproject.toml
├── README.md
├── rusty                       # Rust implementation
│   ├── Cargo.toml
│   ├── pixi.toml
│   ├── tests
│   ├── src
│   │   ├── lib.rs
│   │   ├── main.rs
│   │   ├── mitre               # MITRE db entries
│   │   └── processor_lib       # Tree-sitter parsing, GCC repair, AST validation, CWE enrichment
├── src                         # Python
│   ├── core
│   │   ├── cot                 # CoT generation and Jury for quality assessment
│   │   ├── cot_training        # Fine-tuning
│   │   └── random_forest       # RFC baseline
│   ├── dataset                 # Dataset utilities
│   └── test_env_integrity      # Environment validation
└── text_prompts                # Prompts applied

Requirements

  • Preprocessing: Rust 1.89+
  • Training/Evaluation: Python 3.12
  • Hardware: NVIDIA L40s GPU with 48GB VRAM for training
  • Dependencies: Managed via pixi

Installation

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/MatteoGuglielmi-tech/VuLLM.git
cd VuLLM

# If already cloned without submodules, initialize them:
# git submodule update --init --recursive

# Install dependencies
pixi install

# Build Rust preprocessing pipeline
cd rusty && cargo build --release

Usage

Each component includes built-in argument parsing. Use --help for available options and usage examples.

Component Command
Preprocessing cd rusty && cargo run --release -- --help
Training/Evaluation pixi run python python -m src.core.cot_training.main --help

Experimental Configurations

The thesis evaluates 6 configurations in a 3×2 factorial design:

Config Assumption Mode CWE Guidance F1-Score Recall
1 Free No 15.1% 8.7%
2 Free Yes 15.9% 9.1%
3 Optimistic No 12.9% 7.2%
4 Optimistic Yes 17.3% 9.8%
5 Pessimistic No 15.2% 8.7%
6 Pessimistic Yes 24.8% 15.5%

Dataset

This work uses the DiverseVul dataset. Due to licensing, we cannot redistribute processed data. The preprocessing pipeline can be applied to the publicly available dataset to reproduce our results.

Final dataset: 5888 samples (4302 train / 743 val / 843 test)

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

  • DiverseVul dataset authors for making vulnerability data publicly available
  • Qwen team for the Qwen2.5-Coder model
  • Unsloth for efficient fine-tuning infrastructure

About

LLM for vulnerability detection and classification

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors