Skip to content

Commit e0ce670

Browse files
nicoleqiwtByteDanceclaude
authored
feat(tau2/vikingbot): benchmark updates (volcengine#2244)
* feat(benchmark/tau2): add VikingBot agent runner for tau2-bench Adds benchmark/tau2/vikingbot/, an end-to-end harness that runs the full VikingBot AgentLoop on tau2-bench tasks and commits trajectories back into OpenViking memory for epoch-based self-improvement. This complements the existing memory-retrieval harness in benchmark/tau2/ (which is retrieval-only). Contents: - scripts/vikingbot_tau2_runner.py: run one tau2 task through the agent loop (tau2 tool registry swap, simulated-time patch, advisory memory scope guard). - scripts/run_tau2_domain.sh / run_eval_reward.sh: run a domain split with bounded concurrency and score average reward. - scripts/commit_trajectory_to_memory.py: commit train trajectories to memory. - scripts/stat_trajectory.py, check_openviking_tool_calls.py: analysis helpers. - tau2_env/: tau2 environment + tool-provider integration. - run_full_test.sh and run_{airline,retail}_*epochs.sh: full / multi-epoch runs. - setup_env.sh, README.md, .gitignore. tau2-bench is referenced as an external dependency (cloned + installed by the user); no OpenViking core changes are required. The runner is API-compatible with bot/vikingbot on current main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(benchmark/tau2): split into llm/ and vikingbot/ subfolders Mirror the two evaluation approaches as sibling subfolders under benchmark/tau2/: - llm/: the existing OpenViking Memory V2 retrieval harness, moved from benchmark/tau2/. All internal benchmark/tau2/... path references and the REPO_ROOT depth computations (run_full_eval.sh, tau2_common.py, run_memory_v2_eval.py) are updated for the extra directory level. - vikingbot/: the VikingBot agent runner (added in the previous commit). vikingbot/ cleanup: - make memory-block extraction time-independent: anchor on the stable session header and trailing reply instruction instead of a fixed simulated timestamp (the sim-time patch was removed, so the current time is now system-generated). - drop the now-removed sim-time / scope-guard notes from the README. - remove the unused stat_trajectory.py and check_openviking_tool_calls.py helpers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(benchmark/tau2/vikingbot): train-once/test-8x eval, drop smolagents, doc updates - run_full_test.sh: run train once per epoch (experience extraction) and test N times in parallel (--test-repeats, default 8), reporting the averaged test accuracy; keep --commit/--no-commit. - tau2_environment.py: remove the unused smolagents Tool path (CommunicateWithUser / create_tool_from_json_schema / self.tools); communicate_with_user is handled directly in tool_call. tau2-bench has no smolagents dependency, so it is dropped. - README: reorder install (tau2-bench first so setup_env can derive TAU2_DATA_ROOT), explain train-once/test-8x methodology and train-only memory extraction, document the required bot/vikingbot core changes (agent_id isolation + agent-experience memory), fix sibling links to ../llm/. - Remove run_retail_3epochs.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(tau2/vikingbot): one-step setup_env.sh + communicate_with_user refactor setup_env.sh now does full environment setup in a single `source`: creates a fresh repo-root .venv, clones tau2-bench (external dep), installs openviking + vikingbot (pip install -e ., runs the Cargo build) + tau2-bench + smolagents, then activates and exports the runtime env vars. Idempotent via a marker file; supports --reinstall. README updated to document the one-step flow and the overridable env vars. Also move the communicate_with_user tool into a CommunicateWithUser class in tau2_environment.py (owns both schema and execution) and drop the duplicated inline schema from tau2_tool_provider.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(tau2/vikingbot): sync setup_env.sh fixes + README port/diff clarifications Backport the environment-setup fixes and README clarifications discovered while running the harness end-to-end (the core bot/vikingbot code changes live on the test/tau2-vikingbot-core-changes branch, not here): - setup_env.sh: install the [bot] extra (prompt_toolkit/gradio/mcp/...), build + bundle ragfs_python via maturin when the editable install skips it under pip build isolation, and install tau2-bench with the [gym] extra (gymnasium) - README.md: explain the server port (default 1933 vs bot.ov_server.server_url) and show the None-safe forms of the Change-1 diffs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * clean README message --------- Co-authored-by: ByteDance <wenting.qi@bytedance.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent bb22bac commit e0ce670

28 files changed

Lines changed: 1666 additions & 57 deletions
Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Category rerank and other harness-only diagnostics are intentionally left out.
1212
## Layout
1313

1414
```text
15-
benchmark/tau2/
15+
benchmark/tau2/llm/
1616
├── config/
1717
│ ├── baseline.yaml
1818
│ ├── official.yaml
@@ -25,9 +25,9 @@ benchmark/tau2/
2525
└── run_full_eval.sh
2626
```
2727

28-
Generated eval artifacts are written to `benchmark/tau2/result/<run_id>/`.
28+
Generated eval artifacts are written to `benchmark/tau2/llm/result/<run_id>/`.
2929
Memory corpus artifacts are cached outside the run id at
30-
`benchmark/tau2/result/memory_corpora/` by default.
30+
`benchmark/tau2/llm/result/memory_corpora/` by default.
3131

3232
## Quick Start
3333

@@ -59,26 +59,26 @@ For a local one-command setup, clone and install TAU-2 into ignored benchmark
5959
directories:
6060

6161
```bash
62-
benchmark/tau2/scripts/setup_tau2_repo.sh
63-
source benchmark/tau2/.env.tau2
62+
benchmark/tau2/llm/scripts/setup_tau2_repo.sh
63+
source benchmark/tau2/llm/.env.tau2
6464
```
6565

6666
For PR-B-compatible reproduction, pin the TAU-2 checkout to a ref that includes
6767
the confirmation-aware text-user-simulator prompt. The original PR-B evidence
6868
used the open TAU-2 fix PR head (`79dbf0c18ac7637aedf869cb3122babcd57aaf17`):
6969

7070
```bash
71-
benchmark/tau2/scripts/setup_tau2_repo.sh \
71+
benchmark/tau2/llm/scripts/setup_tau2_repo.sh \
7272
--ref refs/pull/297/head
73-
source benchmark/tau2/.env.tau2
73+
source benchmark/tau2/llm/.env.tau2
7474
```
7575

7676
Reference: [sierra-research/tau2-bench#297](https://github.com/sierra-research/tau2-bench/pull/297).
7777

7878
Plan the default benchmark without running TAU-2:
7979

8080
```bash
81-
python benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/baseline.yaml --plan-only
81+
python benchmark/tau2/llm/scripts/run_eval.py --config benchmark/tau2/llm/config/baseline.yaml --plan-only
8282
```
8383

8484
Add `--preflight` or `--strict-preflight` when you want the runner to write a
@@ -87,8 +87,8 @@ small environment/config check next to the run plan.
8787
After setup, verify the local TAU-2 link and write a one-cell run plan:
8888

8989
```bash
90-
benchmark/tau2/run_full_eval.sh \
91-
--config benchmark/tau2/config/baseline.yaml \
90+
benchmark/tau2/llm/run_full_eval.sh \
91+
--config benchmark/tau2/llm/config/baseline.yaml \
9292
--strict-preflight \
9393
--domain retail \
9494
--strategy-id memory_v2_experience_only \
@@ -99,8 +99,8 @@ benchmark/tau2/run_full_eval.sh \
9999
Plan a one-cell Memory V2 pre-write smoke:
100100

101101
```bash
102-
benchmark/tau2/run_full_eval.sh \
103-
--config benchmark/tau2/config/baseline.yaml \
102+
benchmark/tau2/llm/run_full_eval.sh \
103+
--config benchmark/tau2/llm/config/baseline.yaml \
104104
--domain retail \
105105
--strategy-id memory_v2_prewrite \
106106
--num-tasks 1 \
@@ -110,8 +110,8 @@ benchmark/tau2/run_full_eval.sh \
110110
Plan a one-cell trajectory memory smoke:
111111

112112
```bash
113-
benchmark/tau2/run_full_eval.sh \
114-
--config benchmark/tau2/config/trajectory.yaml \
113+
benchmark/tau2/llm/run_full_eval.sh \
114+
--config benchmark/tau2/llm/config/trajectory.yaml \
115115
--domain retail \
116116
--strategy-id memory_v2_trajectory_view \
117117
--num-tasks 1 \
@@ -122,8 +122,8 @@ benchmark/tau2/run_full_eval.sh \
122122
Run the Memory V2 8-trial matrix (`retail + airline` x 2 strategies x 8 repeats):
123123

124124
```bash
125-
benchmark/tau2/run_full_eval.sh \
126-
--config benchmark/tau2/config/baseline.yaml \
125+
benchmark/tau2/llm/run_full_eval.sh \
126+
--config benchmark/tau2/llm/config/baseline.yaml \
127127
--execute
128128
```
129129

@@ -143,15 +143,15 @@ the same confirmation-aware simulator policy but does not require fixed fixtures
143143
Run one bootstrap pass per domain:
144144

145145
```bash
146-
benchmark/tau2/run_full_eval.sh \
147-
--config benchmark/tau2/config/fixed_first_user_bootstrap.yaml \
146+
benchmark/tau2/llm/run_full_eval.sh \
147+
--config benchmark/tau2/llm/config/fixed_first_user_bootstrap.yaml \
148148
--domain retail \
149149
--run-id fixed_first_user_bootstrap_retail \
150150
--strict-preflight \
151151
--execute
152152

153-
benchmark/tau2/run_full_eval.sh \
154-
--config benchmark/tau2/config/fixed_first_user_bootstrap.yaml \
153+
benchmark/tau2/llm/run_full_eval.sh \
154+
--config benchmark/tau2/llm/config/fixed_first_user_bootstrap.yaml \
155155
--domain airline \
156156
--run-id fixed_first_user_bootstrap_airline \
157157
--strict-preflight \
@@ -161,40 +161,40 @@ benchmark/tau2/run_full_eval.sh \
161161
Then convert each bootstrap `results.json` into a fixture:
162162

163163
```bash
164-
RETAIL_RESULTS=benchmark/tau2/result/fixed_first_user_bootstrap_retail/memory_cells/fixed_first_user_bootstrap_retail_retail_no_memory_r1/fixed_first_user_bootstrap_retail_retail_no_memory_r1.json
165-
AIRLINE_RESULTS=benchmark/tau2/result/fixed_first_user_bootstrap_airline/memory_cells/fixed_first_user_bootstrap_airline_airline_no_memory_r1/fixed_first_user_bootstrap_airline_airline_no_memory_r1.json
164+
RETAIL_RESULTS=benchmark/tau2/llm/result/fixed_first_user_bootstrap_retail/memory_cells/fixed_first_user_bootstrap_retail_retail_no_memory_r1/fixed_first_user_bootstrap_retail_retail_no_memory_r1.json
165+
AIRLINE_RESULTS=benchmark/tau2/llm/result/fixed_first_user_bootstrap_airline/memory_cells/fixed_first_user_bootstrap_airline_airline_no_memory_r1/fixed_first_user_bootstrap_airline_airline_no_memory_r1.json
166166

167-
python benchmark/tau2/scripts/build_fixed_first_user_fixture.py \
167+
python benchmark/tau2/llm/scripts/build_fixed_first_user_fixture.py \
168168
--repo "$TAU2_REPO" \
169169
--results-json "$RETAIL_RESULTS" \
170170
--domain retail \
171171
--task-split-name test \
172-
--output benchmark/tau2/result/fixed_first_user_fixtures/retail/fixed_first_user_fixture.json \
172+
--output benchmark/tau2/llm/result/fixed_first_user_fixtures/retail/fixed_first_user_fixture.json \
173173
--require-full-split
174174

175-
python benchmark/tau2/scripts/build_fixed_first_user_fixture.py \
175+
python benchmark/tau2/llm/scripts/build_fixed_first_user_fixture.py \
176176
--repo "$TAU2_REPO" \
177177
--results-json "$AIRLINE_RESULTS" \
178178
--domain airline \
179179
--task-split-name test \
180-
--output benchmark/tau2/result/fixed_first_user_fixtures/airline/fixed_first_user_fixture.json \
180+
--output benchmark/tau2/llm/result/fixed_first_user_fixtures/airline/fixed_first_user_fixture.json \
181181
--require-full-split
182182
```
183183

184184
Export the generated fixture paths for subsequent strict runs:
185185

186186
```bash
187-
export TAU2_RETAIL_FIXED_FIRST_USER_FILE="$PWD/benchmark/tau2/result/fixed_first_user_fixtures/retail/fixed_first_user_fixture.json"
188-
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE="$PWD/benchmark/tau2/result/fixed_first_user_fixtures/airline/fixed_first_user_fixture.json"
187+
export TAU2_RETAIL_FIXED_FIRST_USER_FILE="$PWD/benchmark/tau2/llm/result/fixed_first_user_fixtures/retail/fixed_first_user_fixture.json"
188+
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE="$PWD/benchmark/tau2/llm/result/fixed_first_user_fixtures/airline/fixed_first_user_fixture.json"
189189
```
190190

191191
### 2. Run smoke and full PR-B matrix
192192

193193
First run one tiny end-to-end smoke against a clean local OpenViking service:
194194

195195
```bash
196-
benchmark/tau2/run_full_eval.sh \
197-
--config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
196+
benchmark/tau2/llm/run_full_eval.sh \
197+
--config benchmark/tau2/llm/config/prb_content_matrix_new_prompt.yaml \
198198
--domain retail \
199199
--strategy-id new_traj_fixed_first_user_prewrite \
200200
--num-tasks 1 \
@@ -207,24 +207,24 @@ benchmark/tau2/run_full_eval.sh \
207207
Then run the full PR-B matrix:
208208

209209
```bash
210-
benchmark/tau2/run_full_eval.sh \
211-
--config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
210+
benchmark/tau2/llm/run_full_eval.sh \
211+
--config benchmark/tau2/llm/config/prb_content_matrix_new_prompt.yaml \
212212
--run-id prb_content_matrix_new_prompt_full8 \
213213
--strict-preflight \
214214
--execute
215215
```
216216

217217
The main result is written to
218-
`benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json`.
218+
`benchmark/tau2/llm/result/prb_content_matrix_new_prompt_full8/scoreboard.json`.
219219
Per-cell execution records live under `cell_results/`, raw TAU-2 result JSON
220220
lives under `memory_cells/`, and corpus identity / generated memory checks live
221221
under `memory_corpora/`.
222222

223223
For a small E2E smoke, keep both the eval and train slices tiny:
224224

225225
```bash
226-
benchmark/tau2/run_full_eval.sh \
227-
--config benchmark/tau2/config/baseline.yaml \
226+
benchmark/tau2/llm/run_full_eval.sh \
227+
--config benchmark/tau2/llm/config/baseline.yaml \
228228
--domain retail \
229229
--strategy-id memory_v2_experience_only \
230230
--num-tasks 1 \
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ benchmark:
1818
paths:
1919
tau2_repo: ${TAU2_REPO:-data/external_benchmarks/tau2-bench}
2020
tau2_cli: ${TAU2_CLI:-tau2}
21-
output_dir: benchmark/tau2/result
21+
output_dir: benchmark/tau2/llm/result
2222
# Corpus writes are expensive and should be reused across eval run ids when
2323
# the train split and memory prompt/config did not change.
24-
corpus_cache_dir: benchmark/tau2/result/memory_corpora
24+
corpus_cache_dir: benchmark/tau2/llm/result/memory_corpora
2525

2626
eval:
2727
# Default OpenViking TAU-2 memory evidence uses the fixed-first-user full8

benchmark/tau2/config/fixed_first_user_bootstrap.yaml renamed to benchmark/tau2/llm/config/fixed_first_user_bootstrap.yaml

File renamed without changes.

benchmark/tau2/config/prb_content_matrix_new_prompt.yaml renamed to benchmark/tau2/llm/config/prb_content_matrix_new_prompt.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ strategies:
2424
retrieval_mode: first_user
2525
retrieval_top_k: 4
2626
first_user_inject_top_k: 4
27-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
27+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
2828

2929
- id: new_traj_fixed_prewrite_only
3030
label: PR-B new trajectory fixed-count prewrite top2
@@ -40,7 +40,7 @@ strategies:
4040
retrieval_top_k: 4
4141
prewrite_retrieval_top_k: 2
4242
prewrite_inject_top_k: 2
43-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
43+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
4444

4545
- id: new_traj_fixed_first_user_prewrite
4646
label: PR-B new trajectory fixed-count first-user top4 + prewrite top2
@@ -57,7 +57,7 @@ strategies:
5757
first_user_inject_top_k: 4
5858
prewrite_retrieval_top_k: 2
5959
prewrite_inject_top_k: 2
60-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
60+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
6161

6262
- id: new_exp_fixed_first_user
6363
label: PR-B new experience fixed-count first-user top2
@@ -72,7 +72,7 @@ strategies:
7272
retrieval_mode: first_user
7373
retrieval_top_k: 2
7474
first_user_inject_top_k: 2
75-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
75+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
7676

7777
- id: new_exp_fixed_prewrite_only
7878
label: PR-B new experience fixed-count prewrite top2
@@ -88,7 +88,7 @@ strategies:
8888
retrieval_top_k: 2
8989
prewrite_retrieval_top_k: 2
9090
prewrite_inject_top_k: 2
91-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
91+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
9292

9393
- id: new_exp_fixed_first_user_prewrite
9494
label: PR-B new experience fixed-count first-user + prewrite top2
@@ -105,7 +105,7 @@ strategies:
105105
first_user_inject_top_k: 2
106106
prewrite_retrieval_top_k: 2
107107
prewrite_inject_top_k: 2
108-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
108+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
109109

110110
- id: new_traj_4000_prewrite_only
111111
label: PR-B new trajectory 4000-char prewrite
@@ -122,7 +122,7 @@ strategies:
122122
prewrite_retrieval_top_k: 8
123123
prewrite_inject_top_k: 8
124124
memory_inject_max_chars: 4000
125-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
125+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
126126

127127
- id: new_exp_4000_first_user_prewrite
128128
label: PR-B new experience 4000-char first-user + prewrite
@@ -140,4 +140,4 @@ strategies:
140140
prewrite_retrieval_top_k: 8
141141
prewrite_inject_top_k: 8
142142
memory_inject_max_chars: 4000
143-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
143+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md

benchmark/tau2/config/prb_scope_fairness.yaml renamed to benchmark/tau2/llm/config/prb_scope_fairness.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ strategies:
1717
- id: no_memory_generic_scope
1818
label: TAU-2 no-memory same-seed baseline with generic memory scope prompt
1919
memory_backend: none
20-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
20+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md
2121

2222
- id: trajectory_top4_first_user_prewrite_generic_scope
2323
label: Trajectory top4 first-user + pre-write top2 with generic memory scope prompt
@@ -31,4 +31,4 @@ strategies:
3131
first_user_inject_top_k: 4
3232
prewrite_retrieval_top_k: 2
3333
prewrite_inject_top_k: 2
34-
scope_prompt_file: benchmark/tau2/config/scope_prompts/generic_memory_scope.md
34+
scope_prompt_file: benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md

benchmark/tau2/config/scope_prompts/generic_memory_scope.md renamed to benchmark/tau2/llm/config/scope_prompts/generic_memory_scope.md

File renamed without changes.

0 commit comments

Comments
 (0)