Skip to content

dessertlab/Prompting-LLMs-for-Security-Critical-Tasks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluating Prompting Strategies for LLM-Based Code Generation in Security-Critical Tasks

This repository contains the code for reproducing the experiments described in the paper "Evaluating Prompting Strategies for LLM-Based Code Generation in Security-Critical Tasks" submitted to Information and Software Technology Journal.

Overview

The repository provides a complete experimental pipeline for evaluating the reliability and functional correctness of Python code generated by Large Language Models (LLMs). The main components include:

  • Application of state-of-the-art (SOTA) prompt engineering techniques.
  • Python code generation from the Violent Python dataset.
  • Secure and reproducible execution within isolated Docker environments.
  • Dynamic validation using automated testing and statement coverage metrics.
  • Combined analysis of textual similarity metrics (Edit Distance, BLEU, ROUGE, METEOR) and dynamic testing results inspired by the Ballista framework.
  • Correlation analysis between textual metrics and functional test outcomes using Kendall’s τ coefficient.

This experimental design aims to quantify the gap between textual similarity and functional correctness, offering insights into the capabilities and limitations of LLMs in generating offensive Python code.

Project Workflow

  1. Code Generation – Generate Python code using multiple prompt engineering strategies.
  2. Textual Similarity Analysis – Compare generated code against the ground-truth implementation.
  3. Execution Analysis – Assess correctness and robustness through automated tests inside Docker Environment.
  4. Correlation Analysis – Examine statistical relationships between textual metrics and test outcomes.

Models Evaluated

  • GPT-4o (OpenAI)
  • Meta-Llama-3.1-70B-Instruct (Meta)
  • Qwen2.5-Coder-32B-Instruct (Alibaba Cloud)
  • Phi-3.5-mini-instruct (Microsoft)

Requirements

Tested Versions:

  • Python 3.10.12
  • pip 22.0.2
  • Docker 23.0.1
  • Docker Compose 1.29.2

Python Dependencies

Automatically installed by setup_env.sh:

  • evaluate==0.4.5
  • pylcs==0.1.1
  • rouge==1.0.1
  • nltk==3.9.1
  • pandas==2.3.1
  • openpyxl==3.1.5

The setup script verifies system requirements (Python, pip, Docker, Compose), installs the required libraries, and configures the necessary Docker networks and services.

Struttura della Repository

.
├── code generation/         # Code generation scripts and prompt patterns
│   ├── generate_code.py
|   ├── utils/
│   └── patterns_4o/
├── docker environment/      # Docker setup and environment configuration
│   ├── docker-sandbox.yml
│   ├── docker-service.yml
│   ├── requirements.txt
│   ├── setup_env.sh
│   ├── Dockerfile_sandbox
│   ├── docker_images/
│   ├── config/
│   ├── resources/
│   └── utils/
├── metrics evaluation/      # Similarity metrics and functional validation scripts
│   ├── bleu_score.py
│   ├── coverage_report.py
│   ├── expected_data.py
│   ├── output_validator.py
│   ├── similarity_metrics_analysis.py
│   └── textual_metrics.py
├── correlation analysis/    # Correlation computation and visualization
│   └── correlation_analysis.py
└── README.md

Usage Instructions

  1. Code Generation:

    • Generate code based on predefined prompt patterns:
      python3 generate_code.py
  2. Docker Environment:

    • Set up the environment and configure networks and services:

      bash setup_env.sh
    • Build the sandbox containers:

      docker-compose -f docker-sandbox.yml build
    • Run dynamic tests (specify the dataset ID as argument):

      python3 dynamic_runner.py <id_dataset>

      Dove <id_dataset> indica quale dataset eseguire (ad esempio 1 per il dataset originale, 2 per quello generato da un modello, ecc.). Where <id_dataset> identifies which dataset to execute (e.g., 1 for the original dataset, 2 for generated code, etc.).

    • Results will be stored in:

      • output_{model}/
      • coverage_{model}/
  3. Similarity Analysis:

    • CCompute textual similarity metrics:
      python3 textual_metrics.py
      python3 similarity_metrics_analysis.py
  4. Dynamic Analysis:

    • Validate outputs and generate coverage reports:
      python3 output_validator.py
      python3 coverage_report.py
  5. Correlation Analysis:

    • Compute and visualize correlation results:
      python3 correlation_analysis.py

Main Files

File/Folder Description
generate_code.py Generates code using predefined prompting patterns.
patterns_4o/ Contains all prompt templates.
docker-sandbox.yml Defines test container configuration.
docker-service.yml Defines service container configuration.
setup_env.sh Checks prerequisites, installs dependencies, and sets up Docker services.
execute_all_cases.py Defines test cases for each function.
dynamic_runner.py Orchestrates the execution of functional tests.
output_validator.py Validates the output of generated code.
coverage_report.py Produces statement coverage reports.
textual_metrics.py Computes textual similarity metrics (ED, BLEU, ROUGE, METEOR).
similarity_metrics_analysis.py Aggregates and analyzes similarity results.
correlation_analysis.py Computes and plots the correlation results.
resources/, config/, utils/ Auxiliary files and configurations.

About

This repository contains the code for reproducing the experiments described in the paper "Evaluating Prompting Strategies for LLM-Based Code Generation in Security-Critical Tasks" submitted to Information and Software Technology Journal.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages