@@ -8,6 +8,7 @@ Evaluating Program Semantics Reasoning with Type Inference in System _F_
88![ evaluation workflow] ( ./imgs/tfb.png )
99
1010If you find this work useful, please cite us as:
11+
1112``` bibtex
1213@inproceedings{he2025tfbench,
1314 author = {He, Yifeng and Yang, Luning and Gonzalo, Christopher and Chen, Hao},
@@ -22,7 +23,7 @@ If you find this work useful, please cite us as:
2223
2324### Python
2425
25- We use Python 3.11 .
26+ We use Python 3.12 .
2627We recommend using [ uv] ( https://docs.astral.sh/uv/getting-started/installation/ ) to manage your Python dependencies.
2728
2829``` sh
@@ -71,7 +72,7 @@ For details, please check out the README of [alpharewrite](https://github.com/Se
7172
7273## Download pre-built benchmark
7374
74- You can also use TF-Bench on HuggingFace datasets.
75+ You can also use TF-Bench via HuggingFace datasets.
7576
7677``` python
7778from datasets import load_dataset
@@ -96,10 +97,9 @@ cd TF-Bench
9697uv sync
9798```
9899
99- Please have your API key ready in ` .env ` .
100-
101100### Proprietary models
102101
102+ Please have your API key ready in ` .env ` .
103103We use each provider's official SDK to access their models.
104104You can check our pre-supported models in ` tfbench.lm ` module.
105105
@@ -111,7 +111,7 @@ print(supported_models)
111111To run single model, which runs both ` base ` and ` pure ` splits:
112112
113113``` sh
114- uv run main.py -m gpt-5-2025-08-07
114+ uv run src/ main.py -m gpt-5-2025-08-07
115115```
116116
117117### Open-weights models with Ollama
@@ -153,7 +153,7 @@ uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models
153153Note that our ` main.py ` uses a pre-defined model router,
154154which routes all un-recognized model names to HuggingFace.
155155We use the ` </think> ` token to parse thinking process,
156- if the model do it differently, please see the next section .
156+ if the model do it differently, please see [ Supporting other customized models ] .
157157
158158### Running your own model
159159
@@ -190,14 +190,14 @@ from tfbench.lm import OpenAIResponse
190190from tfbench import run_one_model
191191
192192model = " gpt-4.1"
193- split = " pure "
194- client = OpenAIResponses(model_name = model, pure = split == " pure" , effort = None )
195- eval_result = run_one_model(client, pure = split == " pure" , effort = None )
193+ pure = True
194+ client = OpenAIResponses(model_name = model, pure = pure, effort = None )
195+ eval_result = run_one_model(client, pure = pure)
196196```
197197
198- ### Support other customized models
198+ ### Supporting other customized models
199199
200- You may implement an ` LM ` instance.
200+ Implementing an ` LM ` instance is all your need .
201201
202202``` python
203203from tfbench.lm._types import LM , LMAnswer
@@ -211,4 +211,7 @@ class YourLM(LM):
211211 def _gen (self , prompt : str ) -> LMAnswer:
212212 """ your generation logic here"""
213213 return LMAnswer(answer = content, reasoning_steps = thinking_content)
214+
215+ client = YourLM(" xxx" )
216+ eval_result = run_one_model(client)
214217```
0 commit comments