22
33Evaluating Program Semantics Reasoning with Type Inference in System _ F_
44
5- ## Setup
5+ ## Development
66
77### Python
88
@@ -29,17 +29,17 @@ and [impredicative polymorphism](https://ghc.gitlab.haskell.org/ghc/doc/users_gu
2929so we require GHC version >= 9.2.1.
3030Our evaluation used GHC-9.6.7.
3131
32- ## Building TF-Bench From Scratch (Optional )
32+ ## Building TF-Bench from scratch (optional )
3333
34- ### TF-Bench
34+ ### TF-Bench (base)
3535
3636This script will build the benchmark (Prelude with NL) from the raw data.
3737
3838``` sh
3939uv run scripts/preprocess_benchmark.py -o tfb.json
4040```
4141
42- ### TF-Bench_pure
42+ ### TF-Bench (pure)
4343
4444``` sh
4545git clone https://github.com/SecurityLab-UCD/alpharewrite.git
@@ -53,38 +53,52 @@ cd ..
5353
5454For details, please check out the README of [ alpharewrite] ( https://github.com/SecurityLab-UCD/alpharewrite ) .
5555
56- ## Download Pre -built Benchmark
56+ ## Download pre -built benchmark
5757
58- You can also download our pre-built benchmark from [ Zenodo ] ( https://doi.org/10.5281/zenodo.14751813 ) .
58+ You can also use TF-Bench on HuggingFace datasets .
5959
60- <a href =" https://doi.org/10.5281/zenodo.14751813 " ><img src =" https://zenodo.org/badge/DOI/10.5281/zenodo.14751813.svg " alt =" DOI " ></a >
60+ ``` python
61+ from datasets import load_dataset
6162
62- ## Benchmarking!
63+ split = " pure" # or "base"
64+ dataset = load_dataset(" SecLabUCD/TF-Bench" , split = split)
65+ ```
6366
64- Please have your API key ready in ` .env ` .
65- Please note that the ` .env ` in the repository is tracked by git,
66- we recommend telling your git to ignore its changes by
67+ Or through our provided package.
68+
69+ ``` python
70+ from tfbench import load_tfb_from_hf
71+
72+ dataset = load_tfb_from_hf(split)
73+ ```
74+
75+ ## Using as an application
6776
6877``` sh
69- git update-index --assume-unchanged .env
78+ git clone https://github.com/SecurityLab-UCD/TF-Bench.git
79+ cd TF-Bench
80+ uv sync
7081```
7182
72- ### GPT Models
83+ Please have your API key ready in ` .env ` .
7384
74- To run single model:
85+ ### Proprietary models
7586
76- ``` sh
77- export OPENAI_API_KEY=< OPENAI_API_KEY> # make sure your API key is in the environment
78- uv run main.py -i TF-Bench.json -m gpt-3.5-turbo
87+ We use each provider's official SDK to access their models.
88+ You can check our pre-supported models in ` tfbench.lm ` module.
89+
90+ ``` python
91+ from tfbench.lm import supported_models
92+ print (supported_models)
7993```
8094
81- To run all GPT models :
95+ To run single model, which runs both ` base ` and ` pure ` splits :
8296
8397``` sh
84- uv run run_all .py --option gpt
98+ uv run main .py -m gpt-5-2025-08-07
8599```
86100
87- ### Open Source Models with Ollama
101+ ### Open-weights models with Ollama
88102
89103We use [ Ollama] ( https://ollama.com/ ) to manage and run the OSS models reported in the Appendix.
90104We switched to vLLM for better performance and SDK design.
@@ -108,34 +122,77 @@ ollama version is 0.11.7
108122Run the benchmark.
109123
110124``` sh
111- uv run scripts/experiment_ollama .py -m llama3:8b
125+ uv run src/main .py -m llama3:8b
112126```
113127
114- ### (WIP) Running Your Model with vLLM
128+ ### Running any model on HuggingFace Hub
129+
130+ We also support running any model that is on HuggingFace Hub out-of-the-box.
131+ We provide an example using Qwen3.
132+
133+ ``` sh
134+ uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models
135+ ```
115136
116- #### OpenAI-Compatible Server
137+ Note that our ` main.py ` uses a pre-defined model router,
138+ which routes all un-recognized model names to HuggingFace.
139+ We use the ` </think> ` token to parse thinking process,
140+ if the model do it differently, please see the next section.
117141
118- First, launch the vLLM OpenAI-Compatible Server (with default values, please check vLLM's doc for setting your own):
142+ ### Running your own model
143+
144+ To support your customized model,
145+ you can input the path to your HuggingFace compatible checkpoint to our ` main.py ` .
119146
120147``` sh
121- uv run vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
148+ uv run src/main.py < path to your checkpoint >
122149```
123150
124- Then, run the benchmark:
151+ ## Using as a package
152+
153+ Our package is also available on PyPi.
125154
126155``` sh
127- uv run main.py -i Benchmark-F.json -m vllm_openai_chat_completion
156+ uv add tfbench
128157```
129158
130- NOTE: if you set your API key, host, and port when launching the vLLM server,
131- please add them to the ` .env ` file as well.
132- Please modify ` .env ` for your vLLM api-key, host, and port.
133- If they are left empty, the default values ("", "localhost", "8000") will be used.
134- We do not recommend using the default values on machine connect to the public web,
135- as they are not secure.
159+ Or directly using pip, you know the way
136160
161+ ``` sh
162+ pip install tfbench
137163```
138- VLLM_API_KEY=
139- VLLM_HOST=
140- VLLM_PORT=
164+
165+ ### Proprietary model checkpoints that are not currently supported
166+
167+ Our supported model list is used to route the model name to the correct SDK.
168+ Even a newly released model is not in our supported models list,
169+ you can still use it by specifying the SDK client directly.
170+ We take OpenAI GPT-4.1 as and example here.
171+
172+ ``` python
173+ from tfbench.lm import OpenAIResponse
174+ from tfbench import run_one_model
175+
176+ model = " gpt-4.1"
177+ split = " pure"
178+ client = OpenAIResponses(model_name = model, pure = split == " pure" , effort = None )
179+ eval_result = run_one_model(client, pure = split == " pure" , effort = None )
180+ ```
181+
182+ ### Support other customized models
183+
184+ You may implement an ` LM ` instance.
185+
186+ ``` python
187+ from tfbench.lm._types import LM , LMAnswer
188+
189+ class YourLM (LM ):
190+ def __init__ (self , model_name : str , pure : bool = False ):
191+ """ initialize your model"""
192+ super ().__init__ (model_name = model_name, pure = pure)
193+ ...
194+
195+ def _gen (self , prompt : str ) -> LMAnswer:
196+ """ your generation logic here"""
197+ return LMAnswer(answer = content, reasoning_steps = thinking_content)
141198```
0 commit comments