Skip to content

feat(example): add Terminal Bench training example#1224

Merged
garrett4wade merged 11 commits intoinclusionAI:mainfrom
ActuallyEdward:edward/terminal-bench-example
Apr 24, 2026
Merged

feat(example): add Terminal Bench training example#1224
garrett4wade merged 11 commits intoinclusionAI:mainfrom
ActuallyEdward:edward/terminal-bench-example

Conversation

@ActuallyEdward
Copy link
Copy Markdown
Contributor

Description

This PR adds a new examples/terminal_bench example for training terminal agents with AReaL on Terminal Bench 1.0 tasks.

The example is an AReaL adaptation of the Terminal Bench training workflow from SETA, targeting an easy subset from the converted SETA dataset. It includes a full training entrypoint, rollout workflow, CAMEL-based terminal agent, example configs for SGLang and vLLM-on-NPU, example-scoped dependency metadata, a reward figure, and a README covering setup, runtime assumptions, dataset preparation, and training commands.

A few points are important for users of this example. The workflow is intended to run inside the AReaL runtime with host Docker mounted in, because Terminal Bench task environments are launched through docker compose from inside the rollout runtime. The example also depends on the converted dataset layout under AReaL/dataset, sourced from either SETA or terminal-bench-seta; the bundled parquet is only a convenience copy and is not sufficient by itself without the referenced task assets.

Related Issue

NA

Fixes #(issue)

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📝 Documentation update
  • ♻️ Refactoring
  • ⚡ Performance improvement
  • ✅ Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • Pre-commit hooks pass (pre-commit run --all-files)
  • Relevant tests pass; new tests added for new functionality
  • Documentation updated (if applicable; built with ./docs/build_all.sh)
  • Branch is up to date with main
  • Self-reviewed via /review-pr command
  • This PR was created by a coding agent via /create-pr
  • This PR is a breaking change

Breaking Change Details (if applicable):

Additional Context


Need help? Check the Contributing Guide or ask in
GitHub Discussions!

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a training pipeline for terminal agents using AReaL and Terminal Bench, featuring a CAMEL-based agent and specialized rollout workflows. Key feedback includes fixing variable substitution syntax in the YAML configurations, resolving code duplication in the tracing agent, and correcting markdown formatting in the prompts. Additionally, recommendations were made to improve security by removing insecure curl flags, enhance maintainability by avoiding hardcoded paths and environment modifications, and follow best practices regarding logging and dependency pinning.

Comment thread examples/terminal_bench/config_tb_sglang.yaml Outdated
Comment thread examples/terminal_bench/config_tb_vllm_npu.yaml Outdated
Comment thread examples/terminal_bench/agent/chat_agent_trace.py
Comment thread examples/terminal_bench/pyproject.toml Outdated
Comment thread examples/terminal_bench/command.sh Outdated
from .prompts import get_developer_agent_prompt


DATASET_ROOT = Path(__file__).resolve().parents[3] / "dataset"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the dataset root path relative to the current file (Path(__file__).resolve().parents[3] / "dataset") is fragile. If the directory structure changes, this will break. It's better to make this path configurable, for instance by passing it through the agent's configuration or as an environment variable. This improves maintainability and makes the example more robust.

Comment on lines +35 to +38
Path(__file__).parent.parent.parent
/ "dataset"
/ config.train_dataset.path
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Constructing the dataset path relative to the current file's location (Path(__file__).parent.parent.parent) makes the script fragile and dependent on a specific directory structure. A more robust approach would be to define a root directory in the configuration and construct paths relative to that, or expect absolute paths. This would make the example easier to adapt to different environments.

timeout=self.task_timeouts._reset_env + 60.0,
)
except asyncio.TimeoutError:
print(f"Timeout while building docker image for task {data.get('task_name')}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print for logging can make it difficult to manage log levels, format messages consistently, and redirect output. It's better to use the logging module. This allows for more flexible and powerful logging, which is especially important in a complex workflow like this.

from terminal_bench.terminal.docker_compose_manager import DockerComposeManager


DATASET_ROOT = Path(__file__).resolve().parents[3] / "dataset"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the dataset root path relative to the current file (Path(__file__).resolve().parents[3] / "dataset") is fragile. If the directory structure changes, this will break. It's better to make this path configurable, for instance by passing it as an argument to the function or reading it from a central configuration. This improves maintainability and reusability.

input_path=task_path,
output_path=Path("build_outputs"),
)
print(f"Task path: {task_path}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print for logging is generally discouraged in library or application code. It's better to use the logging module, which provides more control over verbosity, formatting, and output streams (e.g., stdout, stderr, files).

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some cleanups

Comment thread examples/terminal_bench/pre_commit.txt Outdated
Comment thread examples/terminal_bench/gemini_rebuttals.png Outdated
Comment thread examples/terminal_bench/train_filtered_easy.parquet Outdated
Comment thread examples/terminal_bench/command.sh Outdated
Comment thread examples/terminal_bench/config_tb_sglang.yaml Outdated
Comment thread examples/terminal_bench/config_tb_sglang.yaml Outdated
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit aeb237b into inclusionAI:main Apr 24, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants