You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-13Lines changed: 12 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -223,36 +223,36 @@ bayesian-agent summarize \
223
223
--out temp/summary.json
224
224
```
225
225
226
-
Run a live GenericAgent-backed SOP/Lifelong experiment. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
226
+
Run a live GenericAgent-backed benchmark experiment. Use the same script for SOP-Bench, Lifelong AgentBench, and RealFin-Bench; switch benchmarks with `--bench core`, `--bench sop`, `--bench lifelong`, or `--bench realfin`. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
227
227
228
228
```bash
229
229
cd Bayesian-Agent
230
230
export GENERICAGENT_ROOT="/path/to/GenericAgent"
231
231
export DEEPSEEK_API_KEY="sk-..."
232
232
export MODEL="deepseek-v4-flash"
233
233
"$GENERICAGENT_ROOT/.venv/bin/python" \
234
-
experiments/run_sop_lifelong.py \
234
+
experiments/run_benchmarks.py \
235
235
--genericagent-root "$GENERICAGENT_ROOT" \
236
236
--model "$MODEL" \
237
237
--mode all \
238
-
--bench core \
239
-
--out-root "temp/sop_lifelong_${MODEL//-/_}"
238
+
--bench core
240
239
```
241
240
242
-
Use `--limit 1` for a smoke test before running the full benchmark.
241
+
With `--bench core`, the runner fans out into separate benchmark roots instead of sharing one combined directory: `results/sop_${MODEL//-/_}` and `results/lifelong_${MODEL//-/_}`. If you pass `--out-root temp/core_${MODEL//-/_}`, it is treated as a parent directory and the runs go to `temp/core_${MODEL//-/_}/sop` and `temp/core_${MODEL//-/_}/lifelong`.
242
+
243
+
Use `--limit 1` for a smoke test before running the full benchmark. For RealFin-Bench, keep the same command shape and set `--bench realfin`; the default root becomes `results/realfin_${MODEL//-/_}`.
243
244
244
245
Run incremental repair against an existing GA baseline by passing its result files. The script reruns only failed tasks:
@@ -344,20 +344,19 @@ The result shows that Bayesian-Agent can work as a plug-in repair layer: it can
344
344
345
345
Experiment artifacts are stored under [`artifacts/`](artifacts/), and the method note is in [`docs/method.md`](docs/method.md).
346
346
347
-
To reproduce the same experiment shape with another model, change only `--model` and `--out-root`:
347
+
To reproduce the same experiment shape with another model, change only `--model`:
348
348
349
349
```bash
350
350
export MODEL="deepseek-v4-pro"
351
351
"$GENERICAGENT_ROOT/.venv/bin/python" \
352
-
experiments/run_sop_lifelong.py \
352
+
experiments/run_benchmarks.py \
353
353
--genericagent-root "$GENERICAGENT_ROOT" \
354
354
--model "$MODEL" \
355
355
--mode all \
356
-
--bench core \
357
-
--out-root "temp/sop_lifelong_${MODEL//-/_}"
356
+
--bench core
358
357
```
359
358
360
-
The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Results are written to `<out-root>/summary.md`.
359
+
The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Each selected benchmark writes its own `summary.md` under its benchmark-specific result root.
0 commit comments