|
| 1 | +# Terminal Agent Training with Terminal Bench 1.0 |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This example demonstrates how to train terminal agents with AReaL's PPO/GRPO-style |
| 6 | +training pipeline on Terminal Bench tasks. |
| 7 | + |
| 8 | +It is an AReaL adaptation of the training workflow originally developed in |
| 9 | +[SETA](https://github.com/camel-ai/seta), with the environment management and rollout |
| 10 | +loop refactored into an AReaL example. In this example, we focus on an easy subset of |
| 11 | +Terminal Bench 1.0 derived from the SETA conversion of Terminal Bench tasks. |
| 12 | + |
| 13 | +[Terminal Bench](https://github.com/harbor-framework/terminal-bench) is a benchmark for |
| 14 | +evaluating AI agents in real terminal environments. It provides a task dataset plus an |
| 15 | +execution harness, where each task includes a natural language instruction, a runnable |
| 16 | +environment, and outcome-based verification. This example targets the Terminal Bench 1.0 |
| 17 | +style workflow used in SETA and trains on the easy subset prepared for that pipeline. |
| 18 | + |
| 19 | +## Relation to SETA |
| 20 | + |
| 21 | +This directory is not a copy of SETA. It is a conversion of the Terminal Bench training |
| 22 | +path in SETA into AReaL's workflow abstraction and launcher model. |
| 23 | + |
| 24 | +Compared with SETA: |
| 25 | + |
| 26 | +- updated to work with the current AReaL stack (`v1.0.2`) |
| 27 | +- supports single-controller mode through AReaL's `PPOTrainer` |
| 28 | +- rollout logic is implemented as an AReaL `RolloutWorkflow` |
| 29 | +- the CAMEL-based terminal agent is packaged as an example-local agent module |
| 30 | +- Terminal Bench task environments are still created and verified through |
| 31 | + `terminal_bench` |
| 32 | + |
| 33 | +## Code Architecture |
| 34 | + |
| 35 | +- `train.py`: Entry point that loads config, builds the dataset, and launches AReaL |
| 36 | + training. |
| 37 | +- `workflow/camel_rlvr_workflow.py`: Rollout workflow that builds task images, runs |
| 38 | + trajectories, collects rewards, and exports interactions. |
| 39 | +- `workflow/pre_build_tasks_utils.py`: Helper for pre-building Terminal Bench task |
| 40 | + images before rollout. |
| 41 | +- `agent/camel_terminal_agent.py`: CAMEL-based terminal agent wrapper used for each |
| 42 | + trajectory. |
| 43 | +- `agent/chat_agent_trace.py`: Traced `ChatAgent` variant used by the agent. |
| 44 | +- `agent/prompts.py`: Developer-agent prompt construction. |
| 45 | +- `agent_rl_config.py`: Example-specific config extensions on top of AReaL `GRPOConfig`. |
| 46 | + |
| 47 | +## Included Configurations |
| 48 | + |
| 49 | +Two example configs are currently included: |
| 50 | + |
| 51 | +| Config | Backend | Cluster Target | Use Case | |
| 52 | +| ------------------------- | ------- | --------------------- | ----------------------------- | |
| 53 | +| `config_tb_sglang.yaml` | SGLang | single-node GPU setup | local or small-scale training | |
| 54 | +| `config_tb_vllm_npu.yaml` | vLLM | Ascend NPU setup | NPU training | |
| 55 | + |
| 56 | +## Running the Example |
| 57 | + |
| 58 | +### Prerequisites |
| 59 | + |
| 60 | +Please make sure AReaL itself is already installed and working. |
| 61 | + |
| 62 | +You will need: |
| 63 | + |
| 64 | +- Python `>=3.10` |
| 65 | +- a working AReaL environment |
| 66 | +- Docker CLI available inside the AReaL runtime |
| 67 | +- Docker Compose and Buildx available as Docker CLI plugins |
| 68 | +- the `terminal_bench` Python package |
| 69 | + |
| 70 | +For NPU usage, you will also need: |
| 71 | + |
| 72 | +- Ascend drivers and runtime |
| 73 | +- access to the required `/dev/davinci*` devices |
| 74 | +- `sglang[srt_npu]`, since this workflow currently depends on SGLang tool parsing even |
| 75 | + when using the vLLM-based config |
| 76 | + |
| 77 | +### Recommended Runtime Model |
| 78 | + |
| 79 | +This example is intended to run inside the AReaL runtime, with host Docker mounted into |
| 80 | +that runtime container. |
| 81 | + |
| 82 | +That structure is important: Terminal Bench task environments are launched via |
| 83 | +`docker compose`, and the `docker compose` invocation needs to happen from the same |
| 84 | +AReaL runtime that is performing rollout and evaluation. |
| 85 | + |
| 86 | +The recommended setup is: |
| 87 | + |
| 88 | +- run AReaL inside a runtime container |
| 89 | +- mount the host Docker socket into that container |
| 90 | +- mount the Docker CLI and Docker CLI plugins into that container |
| 91 | +- run this example from inside that AReaL runtime container |
| 92 | + |
| 93 | +Minimum mounts: |
| 94 | + |
| 95 | +```bash |
| 96 | +-v /var/run/docker.sock:/var/run/docker.sock |
| 97 | +-v /usr/bin/docker:/usr/bin/docker:ro |
| 98 | +-v /usr/libexec/docker/cli-plugins:/usr/libexec/docker/cli-plugins:ro |
| 99 | +``` |
| 100 | + |
| 101 | +### Install Example Dependencies |
| 102 | + |
| 103 | +From the AReaL repo root: |
| 104 | + |
| 105 | +```bash |
| 106 | +cd examples/terminal_bench |
| 107 | +pip install -e . |
| 108 | +``` |
| 109 | + |
| 110 | +This installs the example-scoped dependencies declared in |
| 111 | +[`pyproject.toml`](./pyproject.toml): |
| 112 | + |
| 113 | +- `ipython` |
| 114 | +- `ruamel.yaml` |
| 115 | +- `streamlit` |
| 116 | +- `sqlalchemy` |
| 117 | +- `docker` |
| 118 | +- `camel_ai` |
| 119 | +- `terminal_bench` |
| 120 | + |
| 121 | +If you are using the NPU / vLLM path, also install the optional extra: |
| 122 | + |
| 123 | +```bash |
| 124 | +pip install -e ".[npu]" |
| 125 | +``` |
| 126 | + |
| 127 | +If `terminal_bench` fails to install because of an upstream Python-version constraint |
| 128 | +mismatch, which can happen on some NPU runtime images, install it from source and relax |
| 129 | +its Python requirement to `>=3.11`: |
| 130 | + |
| 131 | +```bash |
| 132 | +git clone https://github.com/harbor-framework/terminal-bench.git |
| 133 | +cd terminal-bench |
| 134 | +``` |
| 135 | + |
| 136 | +Edit `pyproject.toml`: |
| 137 | + |
| 138 | +```toml |
| 139 | +requires-python = ">=3.11" |
| 140 | +``` |
| 141 | + |
| 142 | +Then install it manually: |
| 143 | + |
| 144 | +```bash |
| 145 | +pip install --no-deps -e . |
| 146 | +``` |
| 147 | + |
| 148 | +If you use this fallback path, you can install the rest of the example dependencies |
| 149 | +separately: |
| 150 | + |
| 151 | +```bash |
| 152 | +cd ../AReaL/examples/terminal_bench |
| 153 | +pip install --no-deps -e . |
| 154 | +pip install ipython ruamel.yaml streamlit sqlalchemy docker |
| 155 | +``` |
| 156 | + |
| 157 | +### Manual Dependency Path |
| 158 | + |
| 159 | +If you already manage some dependencies separately, you can use the same manual setup |
| 160 | +pattern used in SETA. |
| 161 | + |
| 162 | +Install CAMEL and Terminal Bench from a SETA checkout: |
| 163 | + |
| 164 | +```bash |
| 165 | +git clone https://github.com/camel-ai/seta.git |
| 166 | +cd seta |
| 167 | +git submodule update --init --recursive |
| 168 | + |
| 169 | +cd external/camel |
| 170 | +pip install --no-deps -e . |
| 171 | + |
| 172 | +cd ../terminal-bench |
| 173 | +pip install --no-deps -e . |
| 174 | +``` |
| 175 | + |
| 176 | +Then install the remaining example dependencies: |
| 177 | + |
| 178 | +```bash |
| 179 | +pip install ipython ruamel.yaml streamlit sqlalchemy docker |
| 180 | +``` |
| 181 | + |
| 182 | +### Install SGLang for NPU |
| 183 | + |
| 184 | +One working installation path from the original setup is: |
| 185 | + |
| 186 | +```bash |
| 187 | +git clone -b v0.5.6.post2 https://github.com/sgl-project/sglang.git |
| 188 | +cd sglang |
| 189 | +mv python/pyproject_other.toml python/pyproject.toml |
| 190 | +pip install -e python[srt_npu] --no-deps |
| 191 | +``` |
| 192 | + |
| 193 | +### Configure `tiktoken` |
| 194 | + |
| 195 | +This example assumes `o200k_base.tiktoken` is cached locally. |
| 196 | + |
| 197 | +```bash |
| 198 | +export TIKTOKEN_CACHE_DIR=/tmp/tiktoken-cache |
| 199 | +mkdir -p "$TIKTOKEN_CACHE_DIR" |
| 200 | +curl -k -o "$TIKTOKEN_CACHE_DIR/o200k_base.tiktoken" \ |
| 201 | + https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken |
| 202 | +``` |
| 203 | + |
| 204 | +If you need the hashed cache filename used by `tiktoken`, compute it with: |
| 205 | + |
| 206 | +```bash |
| 207 | +python3 - <<'PY' |
| 208 | +import hashlib |
| 209 | +url = "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken" |
| 210 | +print(hashlib.sha1(url.encode()).hexdigest()) |
| 211 | +PY |
| 212 | +``` |
| 213 | + |
| 214 | +### Prepare the Dataset |
| 215 | + |
| 216 | +This example does not work with the parquet file alone. The parquet rows point to task |
| 217 | +assets that must also exist under `AReaL/dataset/`. |
| 218 | + |
| 219 | +You should prepare the converted Terminal Bench dataset from either of these sources: |
| 220 | + |
| 221 | +- SETA: https://github.com/camel-ai/seta |
| 222 | +- terminal-bench-seta: https://github.com/ActuallyEdward/terminal-bench-seta |
| 223 | + |
| 224 | +For this example, those two sources should be treated as equivalent dataset sources. |
| 225 | + |
| 226 | +The configs in this directory expect the easy-subset parquet to be available at: |
| 227 | + |
| 228 | +```bash |
| 229 | +AReaL/dataset/tbench-tasks_convert/tbench-selected-tasks-easy.parquet |
| 230 | +``` |
| 231 | + |
| 232 | +and they also expect the referenced task files and directories from the same converted |
| 233 | +dataset to be present under `AReaL/dataset/`. |
| 234 | + |
| 235 | +One workable setup is: |
| 236 | + |
| 237 | +```bash |
| 238 | +cd AReaL/dataset |
| 239 | +git clone https://github.com/ActuallyEdward/terminal-bench-seta.git |
| 240 | +``` |
| 241 | + |
| 242 | +The `train_filtered_easy.parquet` file is also provided in |
| 243 | +[`terminal-bench-seta`](https://github.com/ActuallyEdward/terminal-bench-seta). |
| 244 | + |
| 245 | +Then place or link the easy-subset parquet from that checkout at the path expected by |
| 246 | +the configs: |
| 247 | + |
| 248 | +```bash |
| 249 | +mkdir -p AReaL/dataset/tbench-tasks_convert |
| 250 | +cp AReaL/dataset/terminal-bench-seta/train_filtered_easy.parquet \ |
| 251 | + AReaL/dataset/tbench-tasks_convert/tbench-selected-tasks-easy.parquet |
| 252 | +``` |
| 253 | + |
| 254 | +If you source the data from SETA instead, use the same converted dataset layout and |
| 255 | +place the parquet and referenced task assets under `AReaL/dataset/` in the same way. |
| 256 | + |
| 257 | +### Docker Compose / Buildx |
| 258 | + |
| 259 | +Docker Compose and Buildx should be available inside the AReaL runtime at: |
| 260 | + |
| 261 | +```bash |
| 262 | +/usr/libexec/docker/cli-plugins/ |
| 263 | +``` |
| 264 | + |
| 265 | +If needed: |
| 266 | + |
| 267 | +```bash |
| 268 | +chmod +x /usr/libexec/docker/cli-plugins/docker-compose |
| 269 | +chmod +x /usr/libexec/docker/cli-plugins/docker-buildx |
| 270 | +``` |
| 271 | + |
| 272 | +### Training Commands |
| 273 | + |
| 274 | +The following commands are intended to be executed from the AReaL repo root. |
| 275 | + |
| 276 | +#### SGLang |
| 277 | + |
| 278 | +```bash |
| 279 | +python3 examples/terminal_bench/train.py \ |
| 280 | + --config examples/terminal_bench/config_tb_sglang.yaml |
| 281 | +``` |
| 282 | + |
| 283 | +#### vLLM on NPU |
| 284 | + |
| 285 | +```bash |
| 286 | +python3 examples/terminal_bench/train.py \ |
| 287 | + --config examples/terminal_bench/config_tb_vllm_npu.yaml |
| 288 | +``` |
| 289 | + |
| 290 | +## Results |
| 291 | + |
| 292 | +The following figure shows a representative training reward curve on the easy subset |
| 293 | +derived from SETA: |
| 294 | + |
| 295 | +<p align="left"> |
| 296 | + <img src="reward.png" width="500"> |
| 297 | +</p> |
| 298 | + |
| 299 | +On this setup, we observe reward-curve behavior qualitatively similar to the GRPO |
| 300 | +training trends reported in |
| 301 | +[terminal-bench-rl](https://github.com/Danau5tin/terminal-bench-rl). This is a |
| 302 | +directional comparison of training dynamics rather than a claim of identical setup, |
| 303 | +identical scale, or identical leaderboard numbers. |
| 304 | + |
| 305 | +## Notes |
| 306 | + |
| 307 | +1. This example currently targets the easy subset used in the SETA conversion, not the |
| 308 | + full Terminal Bench task distribution. |
| 309 | +1. `pyproject.toml` in this directory is intentionally example-scoped. It does not |
| 310 | + replace installing AReaL itself. |
| 311 | +1. Docker, proxy, model-mount, and NPU device details are environment-specific and |
| 312 | + should be adapted locally. |
| 313 | + |
| 314 | +## References |
| 315 | + |
| 316 | +- SETA: https://github.com/camel-ai/seta |
| 317 | +- Terminal Bench: https://github.com/harbor-framework/terminal-bench |
| 318 | +- Terminal-Bench-RL: https://github.com/Danau5tin/terminal-bench-rl |
0 commit comments