This repository contains the code for reproducing the experiments described in the paper "Evaluating Prompting Strategies for LLM-Based Code Generation in Security-Critical Tasks" submitted to Information and Software Technology Journal.
The repository provides a complete experimental pipeline for evaluating the reliability and functional correctness of Python code generated by Large Language Models (LLMs). The main components include:
- Application of state-of-the-art (SOTA) prompt engineering techniques.
- Python code generation from the Violent Python dataset.
- Secure and reproducible execution within isolated Docker environments.
- Dynamic validation using automated testing and statement coverage metrics.
- Combined analysis of textual similarity metrics (Edit Distance, BLEU, ROUGE, METEOR) and dynamic testing results inspired by the Ballista framework.
- Correlation analysis between textual metrics and functional test outcomes using Kendall’s τ coefficient.
This experimental design aims to quantify the gap between textual similarity and functional correctness, offering insights into the capabilities and limitations of LLMs in generating offensive Python code.
- Code Generation – Generate Python code using multiple prompt engineering strategies.
- Textual Similarity Analysis – Compare generated code against the ground-truth implementation.
- Execution Analysis – Assess correctness and robustness through automated tests inside Docker Environment.
- Correlation Analysis – Examine statistical relationships between textual metrics and test outcomes.
- GPT-4o (OpenAI)
- Meta-Llama-3.1-70B-Instruct (Meta)
- Qwen2.5-Coder-32B-Instruct (Alibaba Cloud)
- Phi-3.5-mini-instruct (Microsoft)
Tested Versions:
- Python 3.10.12
- pip 22.0.2
- Docker 23.0.1
- Docker Compose 1.29.2
Automatically installed by setup_env.sh:
- evaluate==0.4.5
- pylcs==0.1.1
- rouge==1.0.1
- nltk==3.9.1
- pandas==2.3.1
- openpyxl==3.1.5
The setup script verifies system requirements (Python, pip, Docker, Compose), installs the required libraries, and configures the necessary Docker networks and services.
.
├── code generation/ # Code generation scripts and prompt patterns
│ ├── generate_code.py
| ├── utils/
│ └── patterns_4o/
├── docker environment/ # Docker setup and environment configuration
│ ├── docker-sandbox.yml
│ ├── docker-service.yml
│ ├── requirements.txt
│ ├── setup_env.sh
│ ├── Dockerfile_sandbox
│ ├── docker_images/
│ ├── config/
│ ├── resources/
│ └── utils/
├── metrics evaluation/ # Similarity metrics and functional validation scripts
│ ├── bleu_score.py
│ ├── coverage_report.py
│ ├── expected_data.py
│ ├── output_validator.py
│ ├── similarity_metrics_analysis.py
│ └── textual_metrics.py
├── correlation analysis/ # Correlation computation and visualization
│ └── correlation_analysis.py
└── README.md
-
Code Generation:
- Generate code based on predefined prompt patterns:
python3 generate_code.py
- Generate code based on predefined prompt patterns:
-
Docker Environment:
-
Set up the environment and configure networks and services:
bash setup_env.sh
-
Build the sandbox containers:
docker-compose -f docker-sandbox.yml build
-
Run dynamic tests (specify the dataset ID as argument):
python3 dynamic_runner.py <id_dataset>
Dove
<id_dataset>indica quale dataset eseguire (ad esempio1per il dataset originale,2per quello generato da un modello, ecc.). Where<id_dataset>identifies which dataset to execute (e.g.,1for the original dataset,2for generated code, etc.). -
Results will be stored in:
- output_{model}/
- coverage_{model}/
-
-
Similarity Analysis:
- CCompute textual similarity metrics:
python3 textual_metrics.py python3 similarity_metrics_analysis.py
- CCompute textual similarity metrics:
-
Dynamic Analysis:
- Validate outputs and generate coverage reports:
python3 output_validator.py python3 coverage_report.py
- Validate outputs and generate coverage reports:
-
Correlation Analysis:
- Compute and visualize correlation results:
python3 correlation_analysis.py
- Compute and visualize correlation results:
| File/Folder | Description |
|---|---|
generate_code.py |
Generates code using predefined prompting patterns. |
patterns_4o/ |
Contains all prompt templates. |
docker-sandbox.yml |
Defines test container configuration. |
docker-service.yml |
Defines service container configuration. |
setup_env.sh |
Checks prerequisites, installs dependencies, and sets up Docker services. |
execute_all_cases.py |
Defines test cases for each function. |
dynamic_runner.py |
Orchestrates the execution of functional tests. |
output_validator.py |
Validates the output of generated code. |
coverage_report.py |
Produces statement coverage reports. |
textual_metrics.py |
Computes textual similarity metrics (ED, BLEU, ROUGE, METEOR). |
similarity_metrics_analysis.py |
Aggregates and analyzes similarity results. |
correlation_analysis.py |
Computes and plots the correlation results. |
resources/, config/, utils/ |
Auxiliary files and configurations. |