Fine-tuning Large Language Models for C Vulnerability Detection and Classification (CWE) with Structured Reasoning
Caution
Ongoing experimental work — agentic RAG pipeline with a CWE knowledge base over the full MITRE corpus, vector retrieval, multi-pass inference (LangGraph), and an MCP server — lives on the agent branch. main is the frozen fine-tuning baseline.
This repository contains the code and experimental pipeline for my master's thesis on using fine-tuned LLMs for automated vulnerability detection and classification in C code. The approach generates structured JSON outputs containing both natural language security reasoning and CWE (Common Weakness Enumeration) classifications.
Key findings:
- Pessimistic assumptions combined with CWE guidance achieve 4.3× higher F1-score than Random Forest baseline
- Neither assumptions alone nor CWE guidance alone achieves substantial improvement—only their combination is effective
- Training data quality (not prompt design) is the primary bottleneck, with a 15.5% recall ceiling across all configurations
- Diagnostic suite validation shows 91.7% accuracy on handcrafted test cases
VuLLM/
├── .clang-format # Clang-format config file
├── deepspeed # Deepspeed files
├── DoneBot # Submodule for async notifications
├── LICENSE
├── pixi.toml # Dependency management
├── pyproject.toml
├── README.md
├── rusty # Rust implementation
│ ├── Cargo.toml
│ ├── pixi.toml
│ ├── tests
│ ├── src
│ │ ├── lib.rs
│ │ ├── main.rs
│ │ ├── mitre # MITRE db entries
│ │ └── processor_lib # Tree-sitter parsing, GCC repair, AST validation, CWE enrichment
├── src # Python
│ ├── core
│ │ ├── cot # CoT generation and Jury for quality assessment
│ │ ├── cot_training # Fine-tuning
│ │ └── random_forest # RFC baseline
│ ├── dataset # Dataset utilities
│ └── test_env_integrity # Environment validation
└── text_prompts # Prompts applied- Preprocessing: Rust 1.89+
- Training/Evaluation: Python 3.12
- Hardware: NVIDIA L40s GPU with 48GB VRAM for training
- Dependencies: Managed via pixi
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/MatteoGuglielmi-tech/VuLLM.git
cd VuLLM
# If already cloned without submodules, initialize them:
# git submodule update --init --recursive
# Install dependencies
pixi install
# Build Rust preprocessing pipeline
cd rusty && cargo build --releaseEach component includes built-in argument parsing. Use --help for available options and usage examples.
| Component | Command |
|---|---|
| Preprocessing | cd rusty && cargo run --release -- --help |
| Training/Evaluation | pixi run python python -m src.core.cot_training.main --help |
The thesis evaluates 6 configurations in a 3×2 factorial design:
| Config | Assumption Mode | CWE Guidance | F1-Score | Recall |
|---|---|---|---|---|
| 1 | Free | No | 15.1% | 8.7% |
| 2 | Free | Yes | 15.9% | 9.1% |
| 3 | Optimistic | No | 12.9% | 7.2% |
| 4 | Optimistic | Yes | 17.3% | 9.8% |
| 5 | Pessimistic | No | 15.2% | 8.7% |
| 6 | Pessimistic | Yes | 24.8% | 15.5% |
This work uses the DiverseVul dataset. Due to licensing, we cannot redistribute processed data. The preprocessing pipeline can be applied to the publicly available dataset to reproduce our results.
Final dataset: 5888 samples (4302 train / 743 val / 843 test)
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- DiverseVul dataset authors for making vulnerability data publicly available
- Qwen team for the Qwen2.5-Coder model
- Unsloth for efficient fine-tuning infrastructure