11# TF-Bench
22
3+ [ ![ python] ( https://img.shields.io/badge/Python-3.12-3776AB.svg?style=flat&logo=python&logoColor=white )] ( https://www.python.org )
4+ [ ![ Code style: black] ( https://img.shields.io/badge/code%20style-black-000000.svg )] ( https://github.com/psf/black )
5+
36Evaluating Program Semantics Reasoning with Type Inference in System _ F_
47
5- ## Setup
8+ ![ evaluation workflow] ( ./imgs/tfb.png )
9+
10+ If you find this work useful, please cite us as:
11+ ``` bibtex
12+ @inproceedings{he2025tfbench,
13+ author = {He, Yifeng and Yang, Luning and Gonzalo, Christopher and Chen, Hao},
14+ title = {Evaluating Program Semantics Reasoning with Type Inference in System F},
15+ booktitle = {Neural Information Processing Systems (NeurIPS)},
16+ date = {2025-11-30/2025-12-07},
17+ address = {San Diego, CA, USA},
18+ }
19+ ```
20+
21+ ## Development
622
723### Python
824
@@ -29,17 +45,17 @@ and [impredicative polymorphism](https://ghc.gitlab.haskell.org/ghc/doc/users_gu
2945so we require GHC version >= 9.2.1.
3046Our evaluation used GHC-9.6.7.
3147
32- ## Building TF-Bench From Scratch (Optional )
48+ ## Building TF-Bench from scratch (optional )
3349
34- ### TF-Bench
50+ ### TF-Bench (base)
3551
3652This script will build the benchmark (Prelude with NL) from the raw data.
3753
3854``` sh
3955uv run scripts/preprocess_benchmark.py -o tfb.json
4056```
4157
42- ### TF-Bench_pure
58+ ### TF-Bench (pure)
4359
4460``` sh
4561git clone https://github.com/SecurityLab-UCD/alpharewrite.git
@@ -53,38 +69,52 @@ cd ..
5369
5470For details, please check out the README of [ alpharewrite] ( https://github.com/SecurityLab-UCD/alpharewrite ) .
5571
56- ## Download Pre-built Benchmark
72+ ## Download pre-built benchmark
73+
74+ You can also use TF-Bench on HuggingFace datasets.
5775
58- You can also download our pre-built benchmark from [ Zenodo] ( https://doi.org/10.5281/zenodo.14751813 ) .
76+ ``` python
77+ from datasets import load_dataset
5978
60- <a href =" https://doi.org/10.5281/zenodo.14751813 " ><img src =" https://zenodo.org/badge/DOI/10.5281/zenodo.14751813.svg " alt =" DOI " ></a >
79+ split = " pure" # or "base"
80+ dataset = load_dataset(" SecLabUCD/TF-Bench" , split = split)
81+ ```
6182
62- ## Benchmarking!
83+ Or through our provided package.
6384
64- Please have your API key ready in ` .env ` .
65- Please note that the ` .env ` in the repository is tracked by git,
66- we recommend telling your git to ignore its changes by
85+ ``` python
86+ from tfbench import load_tfb_from_hf
87+
88+ dataset = load_tfb_from_hf(split)
89+ ```
90+
91+ ## Using as an application
6792
6893``` sh
69- git update-index --assume-unchanged .env
94+ git clone https://github.com/SecurityLab-UCD/TF-Bench.git
95+ cd TF-Bench
96+ uv sync
7097```
7198
72- ### GPT Models
99+ Please have your API key ready in ` .env ` .
73100
74- To run single model:
101+ ### Proprietary models
75102
76- ``` sh
77- export OPENAI_API_KEY=< OPENAI_API_KEY> # make sure your API key is in the environment
78- uv run main.py -i TF-Bench.json -m gpt-3.5-turbo
103+ We use each provider's official SDK to access their models.
104+ You can check our pre-supported models in ` tfbench.lm ` module.
105+
106+ ``` python
107+ from tfbench.lm import supported_models
108+ print (supported_models)
79109```
80110
81- To run all GPT models :
111+ To run single model, which runs both ` base ` and ` pure ` splits :
82112
83113``` sh
84- uv run run_all .py --option gpt
114+ uv run main .py -m gpt-5-2025-08-07
85115```
86116
87- ### Open Source Models with Ollama
117+ ### Open-weights models with Ollama
88118
89119We use [ Ollama] ( https://ollama.com/ ) to manage and run the OSS models reported in the Appendix.
90120We switched to vLLM for better performance and SDK design.
@@ -108,34 +138,77 @@ ollama version is 0.11.7
108138Run the benchmark.
109139
110140``` sh
111- uv run scripts/experiment_ollama.py -m llama3:8b
141+ uv run src/main.py -m llama3:8b
142+ ```
143+
144+ ### Running any model on HuggingFace Hub
145+
146+ We also support running any model that is on HuggingFace Hub out-of-the-box.
147+ We provide an example using Qwen3.
148+
149+ ``` sh
150+ uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models
112151```
113152
114- ### (WIP) Running Your Model with vLLM
153+ Note that our ` main.py ` uses a pre-defined model router,
154+ which routes all un-recognized model names to HuggingFace.
155+ We use the ` </think> ` token to parse thinking process,
156+ if the model do it differently, please see the next section.
115157
116- #### OpenAI-Compatible Server
158+ ### Running your own model
117159
118- First, launch the vLLM OpenAI-Compatible Server (with default values, please check vLLM's doc for setting your own):
160+ To support your customized model,
161+ you can input the path to your HuggingFace compatible checkpoint to our ` main.py ` .
119162
120163``` sh
121- uv run vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
164+ uv run src/main.py < path to your checkpoint >
122165```
123166
124- Then, run the benchmark:
167+ ## Using as a package
168+
169+ Our package is also available on PyPi.
125170
126171``` sh
127- uv run main.py -i Benchmark-F.json -m vllm_openai_chat_completion
172+ uv add tfbench
128173```
129174
130- NOTE: if you set your API key, host, and port when launching the vLLM server,
131- please add them to the ` .env ` file as well.
132- Please modify ` .env ` for your vLLM api-key, host, and port.
133- If they are left empty, the default values ("", "localhost", "8000") will be used.
134- We do not recommend using the default values on machine connect to the public web,
135- as they are not secure.
175+ Or directly using pip, you know the way
136176
177+ ``` sh
178+ pip install tfbench
137179```
138- VLLM_API_KEY=
139- VLLM_HOST=
140- VLLM_PORT=
180+
181+ ### Proprietary model checkpoints that are not currently supported
182+
183+ Our supported model list is used to route the model name to the correct SDK.
184+ Even a newly released model is not in our supported models list,
185+ you can still use it by specifying the SDK client directly.
186+ We take OpenAI GPT-4.1 as and example here.
187+
188+ ``` python
189+ from tfbench.lm import OpenAIResponse
190+ from tfbench import run_one_model
191+
192+ model = " gpt-4.1"
193+ split = " pure"
194+ client = OpenAIResponses(model_name = model, pure = split == " pure" , effort = None )
195+ eval_result = run_one_model(client, pure = split == " pure" , effort = None )
196+ ```
197+
198+ ### Support other customized models
199+
200+ You may implement an ` LM ` instance.
201+
202+ ``` python
203+ from tfbench.lm._types import LM , LMAnswer
204+
205+ class YourLM (LM ):
206+ def __init__ (self , model_name : str , pure : bool = False ):
207+ """ initialize your model"""
208+ super ().__init__ (model_name = model_name, pure = pure)
209+ ...
210+
211+ def _gen (self , prompt : str ) -> LMAnswer:
212+ """ your generation logic here"""
213+ return LMAnswer(answer = content, reasoning_steps = thinking_content)
141214```
0 commit comments