Skip to content

Commit aeb237b

Browse files
ActuallyEdwardEdward Wanggemini-code-assist[bot]Edward Wang
authored
feat(example): add Terminal Bench training example (#1224)
* Add Terminal Bench training example * Update terminal bench example configs * Update examples/terminal_bench/command.sh Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix terminal bench lint issues * style: apply pre-commit fixes * Pin terminal bench example dependencies * chore: remove terminal bench example artifacts * chore: update terminal bench config dataset paths * chore: fix terminal bench npu dataset path --------- Co-authored-by: Edward Wang <edwardwang@Edwards-MacBook-Pro.local> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edward Wang <edwardwang@dhcp-128-189-195-4.ubcsecure.wireless.ubc.ca>
1 parent ae8c792 commit aeb237b

15 files changed

Lines changed: 1947 additions & 0 deletions

examples/terminal_bench/README.md

Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
# Terminal Agent Training with Terminal Bench 1.0
2+
3+
## Overview
4+
5+
This example demonstrates how to train terminal agents with AReaL's PPO/GRPO-style
6+
training pipeline on Terminal Bench tasks.
7+
8+
It is an AReaL adaptation of the training workflow originally developed in
9+
[SETA](https://github.com/camel-ai/seta), with the environment management and rollout
10+
loop refactored into an AReaL example. In this example, we focus on an easy subset of
11+
Terminal Bench 1.0 derived from the SETA conversion of Terminal Bench tasks.
12+
13+
[Terminal Bench](https://github.com/harbor-framework/terminal-bench) is a benchmark for
14+
evaluating AI agents in real terminal environments. It provides a task dataset plus an
15+
execution harness, where each task includes a natural language instruction, a runnable
16+
environment, and outcome-based verification. This example targets the Terminal Bench 1.0
17+
style workflow used in SETA and trains on the easy subset prepared for that pipeline.
18+
19+
## Relation to SETA
20+
21+
This directory is not a copy of SETA. It is a conversion of the Terminal Bench training
22+
path in SETA into AReaL's workflow abstraction and launcher model.
23+
24+
Compared with SETA:
25+
26+
- updated to work with the current AReaL stack (`v1.0.2`)
27+
- supports single-controller mode through AReaL's `PPOTrainer`
28+
- rollout logic is implemented as an AReaL `RolloutWorkflow`
29+
- the CAMEL-based terminal agent is packaged as an example-local agent module
30+
- Terminal Bench task environments are still created and verified through
31+
`terminal_bench`
32+
33+
## Code Architecture
34+
35+
- `train.py`: Entry point that loads config, builds the dataset, and launches AReaL
36+
training.
37+
- `workflow/camel_rlvr_workflow.py`: Rollout workflow that builds task images, runs
38+
trajectories, collects rewards, and exports interactions.
39+
- `workflow/pre_build_tasks_utils.py`: Helper for pre-building Terminal Bench task
40+
images before rollout.
41+
- `agent/camel_terminal_agent.py`: CAMEL-based terminal agent wrapper used for each
42+
trajectory.
43+
- `agent/chat_agent_trace.py`: Traced `ChatAgent` variant used by the agent.
44+
- `agent/prompts.py`: Developer-agent prompt construction.
45+
- `agent_rl_config.py`: Example-specific config extensions on top of AReaL `GRPOConfig`.
46+
47+
## Included Configurations
48+
49+
Two example configs are currently included:
50+
51+
| Config | Backend | Cluster Target | Use Case |
52+
| ------------------------- | ------- | --------------------- | ----------------------------- |
53+
| `config_tb_sglang.yaml` | SGLang | single-node GPU setup | local or small-scale training |
54+
| `config_tb_vllm_npu.yaml` | vLLM | Ascend NPU setup | NPU training |
55+
56+
## Running the Example
57+
58+
### Prerequisites
59+
60+
Please make sure AReaL itself is already installed and working.
61+
62+
You will need:
63+
64+
- Python `>=3.10`
65+
- a working AReaL environment
66+
- Docker CLI available inside the AReaL runtime
67+
- Docker Compose and Buildx available as Docker CLI plugins
68+
- the `terminal_bench` Python package
69+
70+
For NPU usage, you will also need:
71+
72+
- Ascend drivers and runtime
73+
- access to the required `/dev/davinci*` devices
74+
- `sglang[srt_npu]`, since this workflow currently depends on SGLang tool parsing even
75+
when using the vLLM-based config
76+
77+
### Recommended Runtime Model
78+
79+
This example is intended to run inside the AReaL runtime, with host Docker mounted into
80+
that runtime container.
81+
82+
That structure is important: Terminal Bench task environments are launched via
83+
`docker compose`, and the `docker compose` invocation needs to happen from the same
84+
AReaL runtime that is performing rollout and evaluation.
85+
86+
The recommended setup is:
87+
88+
- run AReaL inside a runtime container
89+
- mount the host Docker socket into that container
90+
- mount the Docker CLI and Docker CLI plugins into that container
91+
- run this example from inside that AReaL runtime container
92+
93+
Minimum mounts:
94+
95+
```bash
96+
-v /var/run/docker.sock:/var/run/docker.sock
97+
-v /usr/bin/docker:/usr/bin/docker:ro
98+
-v /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro
99+
```
100+
101+
### Install Example Dependencies
102+
103+
From the AReaL repo root:
104+
105+
```bash
106+
cd examples/terminal_bench
107+
pip install -e .
108+
```
109+
110+
This installs the example-scoped dependencies declared in
111+
[`pyproject.toml`](./pyproject.toml):
112+
113+
- `ipython`
114+
- `ruamel.yaml`
115+
- `streamlit`
116+
- `sqlalchemy`
117+
- `docker`
118+
- `camel_ai`
119+
- `terminal_bench`
120+
121+
If you are using the NPU / vLLM path, also install the optional extra:
122+
123+
```bash
124+
pip install -e ".[npu]"
125+
```
126+
127+
If `terminal_bench` fails to install because of an upstream Python-version constraint
128+
mismatch, which can happen on some NPU runtime images, install it from source and relax
129+
its Python requirement to `>=3.11`:
130+
131+
```bash
132+
git clone https://github.com/harbor-framework/terminal-bench.git
133+
cd terminal-bench
134+
```
135+
136+
Edit `pyproject.toml`:
137+
138+
```toml
139+
requires-python = ">=3.11"
140+
```
141+
142+
Then install it manually:
143+
144+
```bash
145+
pip install --no-deps -e .
146+
```
147+
148+
If you use this fallback path, you can install the rest of the example dependencies
149+
separately:
150+
151+
```bash
152+
cd ../AReaL/examples/terminal_bench
153+
pip install --no-deps -e .
154+
pip install ipython ruamel.yaml streamlit sqlalchemy docker
155+
```
156+
157+
### Manual Dependency Path
158+
159+
If you already manage some dependencies separately, you can use the same manual setup
160+
pattern used in SETA.
161+
162+
Install CAMEL and Terminal Bench from a SETA checkout:
163+
164+
```bash
165+
git clone https://github.com/camel-ai/seta.git
166+
cd seta
167+
git submodule update --init --recursive
168+
169+
cd external/camel
170+
pip install --no-deps -e .
171+
172+
cd ../terminal-bench
173+
pip install --no-deps -e .
174+
```
175+
176+
Then install the remaining example dependencies:
177+
178+
```bash
179+
pip install ipython ruamel.yaml streamlit sqlalchemy docker
180+
```
181+
182+
### Install SGLang for NPU
183+
184+
One working installation path from the original setup is:
185+
186+
```bash
187+
git clone -b v0.5.6.post2 https://github.com/sgl-project/sglang.git
188+
cd sglang
189+
mv python/pyproject_other.toml python/pyproject.toml
190+
pip install -e python[srt_npu] --no-deps
191+
```
192+
193+
### Configure `tiktoken`
194+
195+
This example assumes `o200k_base.tiktoken` is cached locally.
196+
197+
```bash
198+
export TIKTOKEN_CACHE_DIR=/tmp/tiktoken-cache
199+
mkdir -p "$TIKTOKEN_CACHE_DIR"
200+
curl -k -o "$TIKTOKEN_CACHE_DIR/o200k_base.tiktoken" \
201+
https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
202+
```
203+
204+
If you need the hashed cache filename used by `tiktoken`, compute it with:
205+
206+
```bash
207+
python3 - <<'PY'
208+
import hashlib
209+
url = "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
210+
print(hashlib.sha1(url.encode()).hexdigest())
211+
PY
212+
```
213+
214+
### Prepare the Dataset
215+
216+
This example does not work with the parquet file alone. The parquet rows point to task
217+
assets that must also exist under `AReaL/dataset/`.
218+
219+
You should prepare the converted Terminal Bench dataset from either of these sources:
220+
221+
- SETA: https://github.com/camel-ai/seta
222+
- terminal-bench-seta: https://github.com/ActuallyEdward/terminal-bench-seta
223+
224+
For this example, those two sources should be treated as equivalent dataset sources.
225+
226+
The configs in this directory expect the easy-subset parquet to be available at:
227+
228+
```bash
229+
AReaL/dataset/tbench-tasks_convert/tbench-selected-tasks-easy.parquet
230+
```
231+
232+
and they also expect the referenced task files and directories from the same converted
233+
dataset to be present under `AReaL/dataset/`.
234+
235+
One workable setup is:
236+
237+
```bash
238+
cd AReaL/dataset
239+
git clone https://github.com/ActuallyEdward/terminal-bench-seta.git
240+
```
241+
242+
The `train_filtered_easy.parquet` file is also provided in
243+
[`terminal-bench-seta`](https://github.com/ActuallyEdward/terminal-bench-seta).
244+
245+
Then place or link the easy-subset parquet from that checkout at the path expected by
246+
the configs:
247+
248+
```bash
249+
mkdir -p AReaL/dataset/tbench-tasks_convert
250+
cp AReaL/dataset/terminal-bench-seta/train_filtered_easy.parquet \
251+
AReaL/dataset/tbench-tasks_convert/tbench-selected-tasks-easy.parquet
252+
```
253+
254+
If you source the data from SETA instead, use the same converted dataset layout and
255+
place the parquet and referenced task assets under `AReaL/dataset/` in the same way.
256+
257+
### Docker Compose / Buildx
258+
259+
Docker Compose and Buildx should be available inside the AReaL runtime at:
260+
261+
```bash
262+
/usr/libexec/docker/cli-plugins/
263+
```
264+
265+
If needed:
266+
267+
```bash
268+
chmod +x /usr/libexec/docker/cli-plugins/docker-compose
269+
chmod +x /usr/libexec/docker/cli-plugins/docker-buildx
270+
```
271+
272+
### Training Commands
273+
274+
The following commands are intended to be executed from the AReaL repo root.
275+
276+
#### SGLang
277+
278+
```bash
279+
python3 examples/terminal_bench/train.py \
280+
--config examples/terminal_bench/config_tb_sglang.yaml
281+
```
282+
283+
#### vLLM on NPU
284+
285+
```bash
286+
python3 examples/terminal_bench/train.py \
287+
--config examples/terminal_bench/config_tb_vllm_npu.yaml
288+
```
289+
290+
## Results
291+
292+
The following figure shows a representative training reward curve on the easy subset
293+
derived from SETA:
294+
295+
<p align="left">
296+
<img src="reward.png" width="500">
297+
</p>
298+
299+
On this setup, we observe reward-curve behavior qualitatively similar to the GRPO
300+
training trends reported in
301+
[terminal-bench-rl](https://github.com/Danau5tin/terminal-bench-rl). This is a
302+
directional comparison of training dynamics rather than a claim of identical setup,
303+
identical scale, or identical leaderboard numbers.
304+
305+
## Notes
306+
307+
1. This example currently targets the easy subset used in the SETA conversion, not the
308+
full Terminal Bench task distribution.
309+
1. `pyproject.toml` in this directory is intentionally example-scoped. It does not
310+
replace installing AReaL itself.
311+
1. Docker, proxy, model-mount, and NPU device details are environment-specific and
312+
should be adapted locally.
313+
314+
## References
315+
316+
- SETA: https://github.com/camel-ai/seta
317+
- Terminal Bench: https://github.com/harbor-framework/terminal-bench
318+
- Terminal-Bench-RL: https://github.com/Danau5tin/terminal-bench-rl

examples/terminal_bench/__init__.py

Whitespace-only changes.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
from .camel_terminal_agent import CamelTerminalAgent
2+
from .chat_agent_trace import ChatAgentTrace
3+
from .prompts import get_developer_agent_prompt
4+
5+
__all__ = [
6+
"CamelTerminalAgent",
7+
"ChatAgentTrace",
8+
"get_developer_agent_prompt",
9+
]

0 commit comments

Comments
 (0)