Skip to content

Commit cba92af

Browse files
committed
doc: improve instructions
1 parent ac87084 commit cba92af

File tree

4 files changed

+134
-45
lines changed

4 files changed

+134
-45
lines changed

README.md

Lines changed: 93 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Evaluating Program Semantics Reasoning with Type Inference in System _F_
44

5-
## Setup
5+
## Development
66

77
### Python
88

@@ -29,17 +29,17 @@ and [impredicative polymorphism](https://ghc.gitlab.haskell.org/ghc/doc/users_gu
2929
so we require GHC version >= 9.2.1.
3030
Our evaluation used GHC-9.6.7.
3131

32-
## Building TF-Bench From Scratch (Optional)
32+
## Building TF-Bench from scratch (optional)
3333

34-
### TF-Bench
34+
### TF-Bench (base)
3535

3636
This script will build the benchmark (Prelude with NL) from the raw data.
3737

3838
```sh
3939
uv run scripts/preprocess_benchmark.py -o tfb.json
4040
```
4141

42-
### TF-Bench_pure
42+
### TF-Bench (pure)
4343

4444
```sh
4545
git clone https://github.com/SecurityLab-UCD/alpharewrite.git
@@ -53,38 +53,52 @@ cd ..
5353

5454
For details, please check out the README of [alpharewrite](https://github.com/SecurityLab-UCD/alpharewrite).
5555

56-
## Download Pre-built Benchmark
56+
## Download pre-built benchmark
5757

58-
You can also download our pre-built benchmark from [Zenodo](https://doi.org/10.5281/zenodo.14751813).
58+
You can also use TF-Bench on HuggingFace datasets.
5959

60-
<a href="https://doi.org/10.5281/zenodo.14751813"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.14751813.svg" alt="DOI"></a>
60+
```python
61+
from datasets import load_dataset
6162

62-
## Benchmarking!
63+
split = "pure" # or "base"
64+
dataset = load_dataset("SecLabUCD/TF-Bench", split=split)
65+
```
6366

64-
Please have your API key ready in `.env`.
65-
Please note that the `.env` in the repository is tracked by git,
66-
we recommend telling your git to ignore its changes by
67+
Or through our provided package.
68+
69+
```python
70+
from tfbench import load_tfb_from_hf
71+
72+
dataset = load_tfb_from_hf(split)
73+
```
74+
75+
## Using as an application
6776

6877
```sh
69-
git update-index --assume-unchanged .env
78+
git clone https://github.com/SecurityLab-UCD/TF-Bench.git
79+
cd TF-Bench
80+
uv sync
7081
```
7182

72-
### GPT Models
83+
Please have your API key ready in `.env`.
7384

74-
To run single model:
85+
### Proprietary models
7586

76-
```sh
77-
export OPENAI_API_KEY=<OPENAI_API_KEY> # make sure your API key is in the environment
78-
uv run main.py -i TF-Bench.json -m gpt-3.5-turbo
87+
We use each provider's official SDK to access their models.
88+
You can check our pre-supported models in `tfbench.lm` module.
89+
90+
```python
91+
from tfbench.lm import supported_models
92+
print(supported_models)
7993
```
8094

81-
To run all GPT models:
95+
To run single model, which runs both `base` and `pure` splits:
8296

8397
```sh
84-
uv run run_all.py --option gpt
98+
uv run main.py -m gpt-5-2025-08-07
8599
```
86100

87-
### Open Source Models with Ollama
101+
### Open-weights models with Ollama
88102

89103
We use [Ollama](https://ollama.com/) to manage and run the OSS models reported in the Appendix.
90104
We switched to vLLM for better performance and SDK design.
@@ -108,34 +122,77 @@ ollama version is 0.11.7
108122
Run the benchmark.
109123

110124
```sh
111-
uv run scripts/experiment_ollama.py -m llama3:8b
125+
uv run src/main.py -m llama3:8b
112126
```
113127

114-
### (WIP) Running Your Model with vLLM
128+
### Running any model on HuggingFace Hub
129+
130+
We also support running any model that is on HuggingFace Hub out-of-the-box.
131+
We provide an example using Qwen3.
132+
133+
```sh
134+
uv run src/main.py Qwen/Qwen3-4B-Instruct-2507 # or other models
135+
```
115136

116-
#### OpenAI-Compatible Server
137+
Note that our `main.py` uses a pre-defined model router,
138+
which routes all un-recognized model names to HuggingFace.
139+
We use the `</think>` token to parse thinking process,
140+
if the model do it differently, please see the next section.
117141

118-
First, launch the vLLM OpenAI-Compatible Server (with default values, please check vLLM's doc for setting your own):
142+
### Running your own model
143+
144+
To support your customized model,
145+
you can input the path to your HuggingFace compatible checkpoint to our `main.py`.
119146

120147
```sh
121-
uv run vllm serve openai/gpt-oss-120b --tensor-parallel-size 2 --async-scheduling
148+
uv run src/main.py <path to your checkpoint>
122149
```
123150

124-
Then, run the benchmark:
151+
## Using as a package
152+
153+
Our package is also available on PyPi.
125154

126155
```sh
127-
uv run main.py -i Benchmark-F.json -m vllm_openai_chat_completion
156+
uv add tfbench
128157
```
129158

130-
NOTE: if you set your API key, host, and port when launching the vLLM server,
131-
please add them to the `.env` file as well.
132-
Please modify `.env` for your vLLM api-key, host, and port.
133-
If they are left empty, the default values ("", "localhost", "8000") will be used.
134-
We do not recommend using the default values on machine connect to the public web,
135-
as they are not secure.
159+
Or directly using pip, you know the way
136160

161+
```sh
162+
pip install tfbench
137163
```
138-
VLLM_API_KEY=
139-
VLLM_HOST=
140-
VLLM_PORT=
164+
165+
### Proprietary model checkpoints that are not currently supported
166+
167+
Our supported model list is used to route the model name to the correct SDK.
168+
Even a newly released model is not in our supported models list,
169+
you can still use it by specifying the SDK client directly.
170+
We take OpenAI GPT-4.1 as and example here.
171+
172+
```python
173+
from tfbench.lm import OpenAIResponse
174+
from tfbench import run_one_model
175+
176+
model = "gpt-4.1"
177+
split = "pure"
178+
client = OpenAIResponses(model_name=model, pure=split == "pure", effort=None)
179+
eval_result = run_one_model(client, pure=split == "pure", effort=None)
180+
```
181+
182+
### Support other customized models
183+
184+
You may implement an `LM` instance.
185+
186+
```python
187+
from tfbench.lm._types import LM, LMAnswer
188+
189+
class YourLM(LM):
190+
def __init__(self, model_name: str, pure: bool = False):
191+
"""initialize your model"""
192+
super().__init__(model_name=model_name, pure=pure)
193+
...
194+
195+
def _gen(self, prompt: str) -> LMAnswer:
196+
"""your generation logic here"""
197+
return LMAnswer(answer=content, reasoning_steps=thinking_content)
141198
```

src/main.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from orjsonl import orjsonl
66

77
from tfbench import run_one_model, analysis_multi_runs, EvalResult
8+
from tfbench.lm import router
89

910

1011
def main(
@@ -13,17 +14,25 @@ def main(
1314
n_repeats: int = 3,
1415
log_file: str = "evaluation_log.jsonl",
1516
):
16-
"""Main script to run experiments reported in the paper"""
17+
"""Ready-to use evaluation script for a single model.
18+
19+
Args:
20+
model (str): The model's name, please refer to `tfbench.lm.supported_models` for supported models.
21+
effort (str | None, optional): The effort level to use for evaluation. Defaults to None.
22+
n_repeats (int, optional): The number of times to repeat the evaluation. Defaults to 3.
23+
log_file (str, optional): The file to log results to. Defaults to "evaluation_log.jsonl".
24+
"""
1725

1826
def _run(pure: bool):
27+
client = router(model, pure, effort)
1928
results: list[EvalResult] = []
2029
split = "pure" if pure else "base"
2130
result_dir = abspath(pjoin("results", model, split))
2231
for i in range(n_repeats):
2332
os.makedirs(result_dir, exist_ok=True)
2433
result_file = pjoin(result_dir, f"run-{i}.jsonl")
2534
r = run_one_model(
26-
model,
35+
client,
2736
pure=pure,
2837
output_file=result_file,
2938
effort=effort,

src/tfbench/experiment.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,20 @@
77

88
from .common import get_prompt
99
from .evaluation import prover_evaluate, EvalResult
10-
from .lm import router, LMAnswer
10+
from .lm import router, LMAnswer, LM
1111
from .load import load_tfb_from_hf
1212

1313

1414
def run_one_model(
15-
model: str,
15+
client: LM,
1616
pure: bool = False,
1717
output_file: str | None = None,
1818
effort: str | None = None,
1919
) -> EvalResult:
2020
"""Running the generation & evaluation pipeline for one pre-supported model
2121
2222
Args:
23-
model (str): name of the model to evaluate
23+
client (LM): some LM client wrapper to use `generate`
2424
pure (bool, optional): To evaluate on the `pure` split or not. Defaults to False.
2525
output_file (str | None, optional): The file to save generation result. Defaults to None.
2626
Warning: If None, generation results will not be saved to disk.
@@ -30,11 +30,10 @@ def run_one_model(
3030
Returns:
3131
EvalResult: evaluation result including accuracy
3232
"""
33-
client = router(model, pure, effort)
3433

3534
tasks = load_tfb_from_hf("pure" if pure else "base")
3635
gen_results: list[LMAnswer | None] = []
37-
for task in tqdm(tasks, desc=model):
36+
for task in tqdm(tasks, desc=client.model_name):
3837
prompt = get_prompt(task)
3938

4039
response = client.generate(prompt)

src/tfbench/lm/__init__.py

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,34 @@
22

33
from .prompts import get_sys_prompt
44
from .settings import MAX_TOKENS
5-
from ._openai import OpenAIChatCompletion, OpenAIResponses
6-
from ._google import GeminiChat, GeminiReasoning
5+
from ._openai import (
6+
OAI_MODELS,
7+
OAI_TTC_MODELS,
8+
OAI_O5,
9+
OpenAIChatCompletion,
10+
OpenAIResponses,
11+
)
12+
from ._google import GEMINI_MODELS, GEMINI_TTC_MODELS, GeminiChat, GeminiReasoning
13+
from ._anthropic import CLAUDE_MODELS, CLAUDE_TTC_MODELS, ClaudeChat, ClaudeReasoning
14+
from ._ollama import OLLAMA_TTC_MODELS, OllamaChat
15+
from ._hf import HFChat
716
from ._types import LM, LMAnswer
817
from .utils import router, extract_response
918

1019
logging.getLogger("openai").setLevel(logging.ERROR)
1120
logging.getLogger("httpx").setLevel(logging.ERROR)
1221

22+
supported_models = (
23+
OAI_MODELS
24+
+ OAI_TTC_MODELS
25+
+ OAI_O5
26+
+ GEMINI_MODELS
27+
+ GEMINI_TTC_MODELS
28+
+ CLAUDE_MODELS
29+
+ CLAUDE_TTC_MODELS
30+
+ OLLAMA_TTC_MODELS
31+
)
32+
1333
__all__ = [
1434
"get_sys_prompt",
1535
"MAX_TOKENS",
@@ -19,6 +39,10 @@
1939
"OpenAIResponses",
2040
"GeminiChat",
2141
"GeminiReasoning",
42+
"ClaudeChat",
43+
"ClaudeReasoning",
44+
"HFChat",
2245
"router",
2346
"extract_response",
47+
"supported_models",
2448
]

0 commit comments

Comments
 (0)