Towards Sound Evaluation of Program Logic Reasoning with Type Inference under System F
We use Python 3.11. We recommend using uv to manage your Python dependencies.
cd TF-Bench
uv sync # create a virtual environment, and install dependenciesThis script will build the benchmark (Prelude with NL) from the raw data.
uv run --project . scripts/preprocess_benchmark.pygit clone https://github.com/SecurityLab-UCD/alpharewrite.git
cd alpharewrite
stack build
stack exec alpharewrite-exe 1 TF-Bench.json > TF-Bench.pure.json
cd ..For details, please refer to the README of alpharewrite.
You can also download our pre-build benchmark from Zenodo.
To run single model:
export OPENAI_API_KEY=<OPENAI_API_KEY> # make sure your API key is in environment
uv run main.py -i TF-Bench.json -m gpt-3.5-turboTo run all GPT models:
uv run run_all.py --option gptWe use Ollama to manage and run the OSS models.
curl -fsSL https://ollama.com/install.sh | sh # install ollama, you need sudo for this
ollama serve # start your own instance instead of system service
uv run ollama_pull.sh # install required modelsuv run main.py -i Benchmark-F.json -m llama3:70bTo run all Ollama models:
uv run run_all.py --option ollama