Skip to content

Commit d4bdb93

Browse files
EYH0602Copilot
andauthored
doc: make readme and export clearer (#68)
* add transformers generation as default * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update src/tfbench/lm/_hf.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove unnecessary imports * doc: improve instructions * fix: unused parameter and import * enable github actions on main commits * doc: add badges and images --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent dadb45e commit d4bdb93

File tree

9 files changed

+167
-51
lines changed

9 files changed

+167
-51
lines changed

.github/workflows/mypy.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
name: MyPy Type Checking
22

3-
on: [pull_request]
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
47

58
jobs:
69
type-check:

.github/workflows/pylint.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
name: Pylint Linting
22

3-
on: [pull_request]
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
47

58
jobs:
69
linting:

.github/workflows/ruff.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
name: Ruff Linting
22

3-
on: [pull_request]
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
47

58
jobs:
69
linting:

.github/workflows/unitttest.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
name: Unit Testing
22

3-
on: [pull_request]
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
47

58
jobs:
69
unittest:

README.md

Lines changed: 109 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,24 @@
11
# TF-Bench
22

3+
[![python](https://img.shields.io/badge/Python-3.12-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
4+
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
5+
36
Evaluating Program Semantics Reasoning with Type Inference in System _F_
47

5-
## Setup
8+
![evaluation workflow](./imgs/tfb.png)
9+
10+
If you find this work useful, please cite us as:
11+
```bibtex
12+
@inproceedings{he2025tfbench,
13+
author = {He, Yifeng and Yang, Luning and Gonzalo, Christopher and Chen, Hao},
14+
title = {Evaluating Program Semantics Reasoning with Type Inference in System F},
15+
booktitle = {Neural Information Processing Systems (NeurIPS)},
16+
date = {2025-11-30/2025-12-07},
17+
address = {San Diego, CA, USA},
18+
}
19+
```
20+
21+
## Development
622

723
### Python
824

@@ -29,17 +45,17 @@ and [impredicative polymorphism](https://ghc.gitlab.haskell.org/ghc/doc/users_gu
2945
so we require GHC version >= 9.2.1.
3046
Our evaluation used GHC-9.6.7.
3147

32-
## Building TF-Bench From Scratch (Optional)
48+
## Building TF-Bench from scratch (optional)
3349

34-
### TF-Bench
50+
### TF-Bench (base)
3551

3652
This script will build the benchmark (Prelude with NL) from the raw data.
3753

3854
```sh
3955
uv run scripts/preprocess_benchmark.py -o tfb.json
4056
```
4157

42-
### TF-Bench_pure
58+
### TF-Bench (pure)
4359

4460
```sh
4561
git clone https://github.com/SecurityLab-UCD/alpharewrite.git
@@ -53,38 +69,52 @@ cd ..
5369

5470
For details, please check out the README of [alpharewrite](https://github.com/SecurityLab-UCD/alpharewrite).
5571

56-
## Download Pre-built Benchmark
72+
## Download pre-built benchmark
73+
74+
You can also use TF-Bench on HuggingFace datasets.
5775

58-
You can also download our pre-built benchmark from [Zenodo](https://doi.org/10.5281/zenodo.14751813).
76+
```python
77+
from datasets import load_dataset
5978

60-
<a href="https://doi.org/10.5281/zenodo.14751813"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14751813.svg" alt="DOI"></a>
79+
split = "pure" # or "base"
80+
dataset = load_dataset("SecLabUCD/TF-Bench", split=split)
81+
```
6182

62-
## Benchmarking!
83+
Or through our provided package.
6384

64-
Please have your API key ready in `.env`.
65-
Please note that the `.env` in the repository is tracked by git,
66-
we recommend telling your git to ignore its changes by
85+
```python
86+
from tfbench import load_tfb_from_hf
87+
88+
dataset = load_tfb_from_hf(split)
89+
```
90+
91+
## Using as an application
6792

6893
```sh
69-
git update-index --assume-unchanged .env
94+
git clone https://github.com/SecurityLab-UCD/TF-Bench.git
95+
cd TF-Bench
96+
uv sync
7097
```
7198

72-
### GPT Models
99+
Please have your API key ready in `.env`.
73100

74-
To run single model:
101+
### Proprietary models
75102

76-
```sh
77-
export OPENAI_API_KEY=<OPENAI_API_KEY> # make sure your API key is in the environment
78-
uv run main.py -i TF-Bench.json -m gpt-3.5-turbo
103+
We use each provider's official SDK to access their models.
104+
You can check our pre-supported models in `tfbench.lm` module.
105+
106+
```python
107+
from tfbench.lm import supported_models
108+
print(supported_models)
79109
```
80110

81-
To run all GPT models:
111+
To run single model, which runs both `base` and `pure` splits:
82112

83113
```sh
84-
uv run run_all.py --option gpt
114+
uv run main.py -m gpt-5-2025-08-07
85115
```
86116

87-
### Open Source Models with Ollama
117+
### Open-weights models with Ollama
88118

89119
We use [Ollama](https://ollama.com/) to manage and run the OSS models reported in the Appendix.
90120
We switched to vLLM for better performance and SDK design.
@@ -108,34 +138,77 @@ ollama version is 0.11.7
108138
Run the benchmark.
109139

110140
```sh
111-
uv run scripts/experiment_ollama.py -m llama3:8b
141+
uv run src/main.py -m llama3:8b
142+
```
143+
144+
### Running any model on HuggingFace Hub
145+
146+
We also support running any model that is on HuggingFace Hub out-of-the-box.
147+
We provide an example using Qwen3.
148+
149+
```sh
150+
uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models
112151
```
113152

114-
### (WIP) Running Your Model with vLLM
153+
Note that our `main.py` uses a pre-defined model router,
154+
which routes all un-recognized model names to HuggingFace.
155+
We use the `</think>` token to parse thinking process,
156+
if the model do it differently, please see the next section.
115157

116-
#### OpenAI-Compatible Server
158+
### Running your own model
117159

118-
First, launch the vLLM OpenAI-Compatible Server (with default values, please check vLLM's doc for setting your own):
160+
To support your customized model,
161+
you can input the path to your HuggingFace compatible checkpoint to our `main.py`.
119162

120163
```sh
121-
uv run vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
164+
uv run src/main.py <path to your checkpoint>
122165
```
123166

124-
Then, run the benchmark:
167+
## Using as a package
168+
169+
Our package is also available on PyPi.
125170

126171
```sh
127-
uv run main.py -i Benchmark-F.json -m vllm_openai_chat_completion
172+
uv add tfbench
128173
```
129174

130-
NOTE: if you set your API key, host, and port when launching the vLLM server,
131-
please add them to the `.env` file as well.
132-
Please modify `.env` for your vLLM api-key, host, and port.
133-
If they are left empty, the default values ("", "localhost", "8000") will be used.
134-
We do not recommend using the default values on machine connect to the public web,
135-
as they are not secure.
175+
Or directly using pip, you know the way
136176

177+
```sh
178+
pip install tfbench
137179
```
138-
VLLM_API_KEY=
139-
VLLM_HOST=
140-
VLLM_PORT=
180+
181+
### Proprietary model checkpoints that are not currently supported
182+
183+
Our supported model list is used to route the model name to the correct SDK.
184+
Even a newly released model is not in our supported models list,
185+
you can still use it by specifying the SDK client directly.
186+
We take OpenAI GPT-4.1 as and example here.
187+
188+
```python
189+
from tfbench.lm import OpenAIResponse
190+
from tfbench import run_one_model
191+
192+
model = "gpt-4.1"
193+
split = "pure"
194+
client = OpenAIResponses(model_name=model, pure=split == "pure", effort=None)
195+
eval_result = run_one_model(client, pure=split == "pure", effort=None)
196+
```
197+
198+
### Support other customized models
199+
200+
You may implement an `LM` instance.
201+
202+
```python
203+
from tfbench.lm._types import LM, LMAnswer
204+
205+
class YourLM(LM):
206+
def __init__(self, model_name: str, pure: bool = False):
207+
"""initialize your model"""
208+
super().__init__(model_name=model_name, pure=pure)
209+
...
210+
211+
def _gen(self, prompt: str) -> LMAnswer:
212+
"""your generation logic here"""
213+
return LMAnswer(answer=content, reasoning_steps=thinking_content)
141214
```

imgs/tfb.png

147 KB
Loading

src/main.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from orjsonl import orjsonl
66

77
from tfbench import run_one_model, analysis_multi_runs, EvalResult
8+
from tfbench.lm import router
89

910

1011
def main(
@@ -13,20 +14,27 @@ def main(
1314
n_repeats: int = 3,
1415
log_file: str = "evaluation_log.jsonl",
1516
):
16-
"""Main script to run experiments reported in the paper"""
17+
"""Ready-to use evaluation script for a single model.
18+
19+
Args:
20+
model (str): The model's name, please refer to `tfbench.lm.supported_models` for supported models.
21+
effort (str | None, optional): The effort level to use for evaluation. Defaults to None.
22+
n_repeats (int, optional): The number of times to repeat the evaluation. Defaults to 3.
23+
log_file (str, optional): The file to log results to. Defaults to "evaluation_log.jsonl".
24+
"""
1725

1826
def _run(pure: bool):
27+
client = router(model, pure, effort)
1928
results: list[EvalResult] = []
2029
split = "pure" if pure else "base"
2130
result_dir = abspath(pjoin("results", model, split))
2231
for i in range(n_repeats):
2332
os.makedirs(result_dir, exist_ok=True)
2433
result_file = pjoin(result_dir, f"run-{i}.jsonl")
2534
r = run_one_model(
26-
model,
35+
client,
2736
pure=pure,
2837
output_file=result_file,
29-
effort=effort,
3038
)
3139
results.append(r)
3240
return analysis_multi_runs(results)

src/tfbench/experiment.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,19 @@
77

88
from .common import get_prompt
99
from .evaluation import prover_evaluate, EvalResult
10-
from .lm import router, LMAnswer
10+
from .lm import LMAnswer, LM
1111
from .load import load_tfb_from_hf
1212

1313

1414
def run_one_model(
15-
model: str,
15+
client: LM,
1616
pure: bool = False,
1717
output_file: str | None = None,
18-
effort: str | None = None,
1918
) -> EvalResult:
2019
"""Running the generation & evaluation pipeline for one pre-supported model
2120
2221
Args:
23-
model (str): name of the model to evaluate
22+
client (LM): some LM client wrapper to use `generate`
2423
pure (bool, optional): To evaluate on the `pure` split or not. Defaults to False.
2524
output_file (str | None, optional): The file to save generation result. Defaults to None.
2625
Warning: If None, generation results will not be saved to disk.
@@ -30,11 +29,10 @@ def run_one_model(
3029
Returns:
3130
EvalResult: evaluation result including accuracy
3231
"""
33-
client = router(model, pure, effort)
3432

3533
tasks = load_tfb_from_hf("pure" if pure else "base")
3634
gen_results: list[LMAnswer | None] = []
37-
for task in tqdm(tasks, desc=model):
35+
for task in tqdm(tasks, desc=client.model_name):
3836
prompt = get_prompt(task)
3937

4038
response = client.generate(prompt)

src/tfbench/lm/__init__.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,34 @@
22

33
from .prompts import get_sys_prompt
44
from .settings import MAX_TOKENS
5-
from ._openai import OpenAIChatCompletion, OpenAIResponses
6-
from ._google import GeminiChat, GeminiReasoning
5+
from ._openai import (
6+
OAI_MODELS,
7+
OAI_TTC_MODELS,
8+
OAI_O5,
9+
OpenAIChatCompletion,
10+
OpenAIResponses,
11+
)
12+
from ._google import GEMINI_MODELS, GEMINI_TTC_MODELS, GeminiChat, GeminiReasoning
13+
from ._anthropic import CLAUDE_MODELS, CLAUDE_TTC_MODELS, ClaudeChat, ClaudeReasoning
14+
from ._ollama import OLLAMA_TTC_MODELS, OllamaChat
15+
from ._hf import HFChat
716
from ._types import LM, LMAnswer
817
from .utils import router, extract_response
918

1019
logging.getLogger("openai").setLevel(logging.ERROR)
1120
logging.getLogger("httpx").setLevel(logging.ERROR)
1221

22+
supported_models = (
23+
OAI_MODELS
24+
+ OAI_TTC_MODELS
25+
+ OAI_O5
26+
+ GEMINI_MODELS
27+
+ GEMINI_TTC_MODELS
28+
+ CLAUDE_MODELS
29+
+ CLAUDE_TTC_MODELS
30+
+ OLLAMA_TTC_MODELS
31+
)
32+
1333
__all__ = [
1434
"get_sys_prompt",
1535
"MAX_TOKENS",
@@ -19,6 +39,11 @@
1939
"OpenAIResponses",
2040
"GeminiChat",
2141
"GeminiReasoning",
42+
"ClaudeChat",
43+
"ClaudeReasoning",
44+
"OllamaChat",
45+
"HFChat",
2246
"router",
2347
"extract_response",
48+
"supported_models",
2449
]

0 commit comments

Comments
 (0)