Skip to content

Latest commit

 

History

History
82 lines (53 loc) · 1.78 KB

File metadata and controls

82 lines (53 loc) · 1.78 KB

TF-Bench

Towards Sound Evaluation of Program Logic Reasoning with Type Inference under System F

Setup

Python

We use Python 3.11. We recommend using uv to manage your Python dependencies.

cd TF-Bench
uv sync # create a virtual environment, and install dependencies

Building TF-Bench From Scratch (Optional)

TF-Bench

This script will build the benchmark (Prelude with NL) from the raw data.

uv run --project . scripts/preprocess_benchmark.py

TF-Bench_pure

git clone https://github.com/SecurityLab-UCD/alpharewrite.git
cd alpharewrite

stack build
stack exec alpharewrite-exe 1 TF-Bench.json > TF-Bench.pure.json

cd ..

For details, please refer to the README of alpharewrite.

Download Pre-build Benchmark

You can also download our pre-build benchmark from Zenodo.

DOI

Benchmarking!

GPT Models

To run single model:

export OPENAI_API_KEY=<OPENAI_API_KEY> # make sure your API key is in environment
uv run main.py -i TF-Bench.json -m gpt-3.5-turbo

To run all GPT models:

uv run run_all.py --option gpt

Open Source Models

We use Ollama to manage and run the OSS models.

curl -fsSL https://ollama.com/install.sh | sh # install ollama, you need sudo for this
ollama serve # start your own instance instead of system service
uv run ollama_pull.sh # install required models
uv run main.py -i Benchmark-F.json -m llama3:70b

To run all Ollama models:

uv run run_all.py --option ollama