Skip to content

Commit 9814e8f

Browse files
committed
feat: add sop lifelong benchmark runner
1 parent faf8453 commit 9814e8f

22 files changed

Lines changed: 1469 additions & 5491 deletions

.env_template

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
DEEPSEEK_API_KEY=your_deepseek_api_key_here
2+
GENERICAGENT_ROOT=../GenericAgent
3+
MODEL=deepseek-v4-flash
4+
MODE=all
5+
BENCH=core

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
# Project-specific gitignore file for Python projects
2+
temp/
3+
results/
4+
15
# Byte-compiled / optimized / DLL files
26
__pycache__/
37
*.py[codz]

CONTRIBUTING.md

Lines changed: 64 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,11 @@ Bayesian-Agent is not meant to become a monolithic agent runtime. Its core value
1919
bayesian_agent/
2020
core/ # Framework-agnostic evidence, beliefs, registry, policy, context, repair
2121
adapters/ # Adapter protocol and optional external harness boundaries
22+
benchmarks/ # Benchmark orchestration owned by Bayesian-Agent
2223
schemas/ # Portable JSON schemas for trajectories and Skill beliefs
2324
artifacts/ # Experiment result artifacts
2425
docs/ # Documentation site content
26+
experiments/ # Reproducible experiment entry points
2527
tests/ # unittest test suite
2628
```
2729

@@ -36,7 +38,7 @@ External Harness Run -> trajectory-like mapping -> TrajectoryEvidence -> Bayesia
3638
Bayesian Skill Context -> Adapter -> External Harness Next Run
3739
```
3840

39-
An adapter should execute one task with posterior-weighted Skill context and return a trajectory-like mapping that Bayesian-Agent can normalize.
41+
An adapter should execute one task with posterior-weighted Skill context and return a trajectory-like mapping that Bayesian-Agent can normalize. It should not own benchmark loops, graders, Skill registries, or posterior update logic.
4042

4143
### 1. Start From the Protocol
4244

@@ -98,7 +100,7 @@ class MyHarnessAdapter:
98100
def run(self, task: Mapping[str, Any], skill_context: str) -> Mapping[str, Any]:
99101
"""Run one task and return a trajectory-like result.
100102
101-
Keep this method as a thin boundary around the external harness.
103+
Keep this method as a thin boundary around one external harness run.
102104
"""
103105
raise NotImplementedError(
104106
"Wire this method to your local MyHarness runner."
@@ -107,6 +109,8 @@ class MyHarnessAdapter:
107109

108110
If the external harness has expensive imports, import them inside `run()` or behind a helper so `import bayesian_agent` remains lightweight.
109111

112+
If your harness needs a lower-level task runner, expose it as an explicit method such as `run_task(prompt=..., workspace=..., max_turns=...)`. The GenericAgent adapter follows this pattern: it runs one prompt in one workspace and reports token usage, while SOP-Bench/Lifelong orchestration stays in `bayesian_agent/benchmarks/`.
113+
110114
### 3. Return a Trajectory-Like Mapping
111115

112116
The returned mapping should include fields compatible with `TrajectoryEvidence.from_run(...)`.
@@ -153,9 +157,66 @@ Avoid:
153157
- copying external framework source code into `bayesian_agent/adapters/`
154158
- importing a large framework at module import time
155159
- hiding benchmark graders inside the adapter
160+
- depending on historical experiment scripts from another repository
156161
- returning only free-form text without token usage or success signal
157162
- changing `bayesian_agent/core/` to fit one harness
158163

164+
## Adding Benchmark Orchestration
165+
166+
Benchmark runners belong under `bayesian_agent/benchmarks/`, not in external harness adapters.
167+
168+
A benchmark module should:
169+
170+
- build isolated task workspaces
171+
- construct task prompts
172+
- call an adapter for execution
173+
- grade results with deterministic local logic
174+
- record `TrajectoryEvidence` through the Bayesian Skill registry
175+
- write `results.json` and a small Markdown table with accuracy, input tokens, output tokens, total tokens, and efficiency
176+
177+
The current SOP-Bench/Lifelong runner is invoked through:
178+
179+
```bash
180+
export GENERICAGENT_ROOT="/path/to/GenericAgent"
181+
export DEEPSEEK_API_KEY="sk-..."
182+
export MODEL="deepseek-v4-flash"
183+
184+
"$GENERICAGENT_ROOT/.venv/bin/python" \
185+
experiments/run_sop_lifelong.py \
186+
--genericagent-root "$GENERICAGENT_ROOT" \
187+
--model "$MODEL" \
188+
--mode all \
189+
--bench core \
190+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
191+
```
192+
193+
Use the same script for `deepseek-v4-pro` by changing `MODEL`. Do not add model-specific scripts unless the model requires a genuinely different protocol.
194+
195+
For incremental repair from an existing baseline:
196+
197+
```bash
198+
"$GENERICAGENT_ROOT/.venv/bin/python" \
199+
experiments/run_sop_lifelong.py \
200+
--genericagent-root "$GENERICAGENT_ROOT" \
201+
--model "$MODEL" \
202+
--mode bayesian-incremental \
203+
--bench core \
204+
--baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
205+
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
206+
--out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
207+
```
208+
209+
### Benchmark PR Checklist
210+
211+
Before opening a benchmark PR, make sure:
212+
213+
- [ ] The runner is named after the benchmark, not after a paper table or temporary comparison label.
214+
- [ ] The adapter is used only for task execution.
215+
- [ ] Token accounting is included in every result row.
216+
- [ ] Incremental mode reruns only failed baseline tasks.
217+
- [ ] Heavy transcripts from imported baselines are compacted before writing new artifacts.
218+
- [ ] A smoke test with `--limit 1` succeeds before a full run.
219+
159220
### 5. Add Tests
160221

161222
Add tests under `tests/`.
@@ -228,7 +289,7 @@ Before opening a PR, make sure:
228289

229290
```bash
230291
python3 -m unittest discover -v
231-
python3 -m compileall bayesian_agent
292+
PYTHONPYCACHEPREFIX=/private/tmp/ba_pycache python3 -m compileall bayesian_agent experiments
232293
git diff --check
233294
```
234295

README.md

Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ It is designed to stand out from monolithic agent frameworks in three ways:
1919
- **Repair incrementally**: attach to an existing agent, read its failed trajectories, and rerun only the tasks that need repair.
2020
- **Adapt across harnesses**: integrate with GenericAgent today, and with other agent frameworks through a portable trajectory schema and adapter boundary.
2121

22-
> v0.4 is the first standalone release. It includes the core Bayesian Skill Evolution package, schemas, CLI utilities, experiment artifacts, and a clean GenericAgent integration boundary. GenericAgent itself is not copied, vendored, or forked.
22+
> v0.4 is the first standalone release. It includes the core Bayesian Skill Evolution package, schemas, CLI utilities, experiment artifacts, and a runnable GenericAgent adapter boundary. GenericAgent itself is not copied, vendored, or forked.
2323
2424
## 📅 News
2525

@@ -182,6 +182,38 @@ bayesian-agent summarize \
182182
--out temp/summary.json
183183
```
184184

185+
Run a live GenericAgent-backed SOP/Lifelong experiment. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
186+
187+
```bash
188+
cd Bayesian-Agent
189+
export GENERICAGENT_ROOT="/path/to/GenericAgent"
190+
export DEEPSEEK_API_KEY="sk-..."
191+
export MODEL="deepseek-v4-flash"
192+
"$GENERICAGENT_ROOT/.venv/bin/python" \
193+
experiments/run_sop_lifelong.py \
194+
--genericagent-root "$GENERICAGENT_ROOT" \
195+
--model "$MODEL" \
196+
--mode all \
197+
--bench core \
198+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
199+
```
200+
201+
Use `--limit 1` for a smoke test before running the full benchmark.
202+
203+
Run incremental repair against an existing GA baseline by passing its result files. The script reruns only failed tasks:
204+
205+
```bash
206+
"$GENERICAGENT_ROOT/.venv/bin/python" \
207+
experiments/run_sop_lifelong.py \
208+
--genericagent-root "$GENERICAGENT_ROOT" \
209+
--model "$MODEL" \
210+
--mode bayesian-incremental \
211+
--bench core \
212+
--baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
213+
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
214+
--out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
215+
```
216+
185217
## 🐍 Python API
186218

187219
```python
@@ -262,13 +294,28 @@ In incremental mode, Bayesian-Agent only reran failed GenericAgent tasks:
262294

263295
| Benchmark | Agent | Model | Final Accuracy | Incremental Input | Incremental Output | Incremental Total | Incremental Efficiency |
264296
|---|---|---|---:|---:|---:|---:|---:|
265-
| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 216k | 10k | 226k | 17.73 |
266-
| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 71k | 7k | 78k | 25.57 |
297+
| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 254k | 14k | 268k | 14.93 |
298+
| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 129k | 10k | 139k | 14.41 |
267299

268300
The result shows that Bayesian-Agent can work as a plug-in repair layer: it can take an existing agent below 100% accuracy and improve it with a small amount of incremental inference. This is the practical advantage over one-off benchmark agents: Bayesian-Agent can sit beside a harness, learn from its failures, and improve it without replacing it.
269301

270302
Experiment artifacts are stored under [`artifacts/`](artifacts/), and the method note is in [`docs/method.md`](docs/method.md).
271303

304+
To reproduce the same experiment shape with another model, change only `--model` and `--out-root`:
305+
306+
```bash
307+
export MODEL="deepseek-v4-pro"
308+
"$GENERICAGENT_ROOT/.venv/bin/python" \
309+
experiments/run_sop_lifelong.py \
310+
--genericagent-root "$GENERICAGENT_ROOT" \
311+
--model "$MODEL" \
312+
--mode all \
313+
--bench core \
314+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
315+
```
316+
317+
The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Results are written to `<out-root>/summary.md`.
318+
272319
## 🔌 GenericAgent and Cross-Harness Adaptation
273320

274321
The first prototype was validated inside GenericAgent, but Bayesian-Agent is not a GenericAgent fork and not just a GenericAgent add-on.

README_ZH.md

Lines changed: 50 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Bayesian-Agent 是一个面向跨 Agent framework / execution harness 的 Bayesi
1919
- **增量修复**:接到已有 Agent 后面,读取失败轨迹,只重跑需要修复的任务。
2020
- **跨 harness 适配**:当前适配 GenericAgent,后续可通过统一 trajectory schema 和 adapter boundary 接入其他 agent frameworks。
2121

22-
> v0.4 是第一个独立开源版本。它包含 Bayesian Skill Evolution 核心包、Schema、CLI 工具、实验 artifacts,以及干净的 GenericAgent 集成边界。GenericAgent 本身不会被复制、vendoring 或 fork 到本仓库中。
22+
> v0.4 是第一个独立开源版本。它包含 Bayesian Skill Evolution 核心包、Schema、CLI 工具、实验 artifacts,以及可运行的 GenericAgent adapter boundary。GenericAgent 本身不会被复制、vendoring 或 fork 到本仓库中。
2323
2424
## 📅 News
2525

@@ -182,6 +182,38 @@ bayesian-agent summarize \
182182
--out temp/summary.json
183183
```
184184

185+
跑一次真实 GenericAgent-backed SOP/Lifelong 实验。用 `--model``deepseek-v4-flash``deepseek-v4-pro` 之间切换:
186+
187+
```bash
188+
cd Bayesian-Agent
189+
export GENERICAGENT_ROOT="/path/to/GenericAgent"
190+
export DEEPSEEK_API_KEY="sk-..."
191+
export MODEL="deepseek-v4-flash"
192+
"$GENERICAGENT_ROOT/.venv/bin/python" \
193+
experiments/run_sop_lifelong.py \
194+
--genericagent-root "$GENERICAGENT_ROOT" \
195+
--model "$MODEL" \
196+
--mode all \
197+
--bench core \
198+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
199+
```
200+
201+
想先 smoke test 可以加 `--limit 1`,确认脚本和 token 统计正常后再跑全量。
202+
203+
如果要接一个已有 GA baseline 做增量修复,把结果文件通过 `--baseline-results` 传进来即可。脚本只会重跑失败任务:
204+
205+
```bash
206+
"$GENERICAGENT_ROOT/.venv/bin/python" \
207+
experiments/run_sop_lifelong.py \
208+
--genericagent-root "$GENERICAGENT_ROOT" \
209+
--model "$MODEL" \
210+
--mode bayesian-incremental \
211+
--bench core \
212+
--baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
213+
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
214+
--out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
215+
```
216+
185217
## 🐍 Python API
186218

187219
```python
@@ -262,13 +294,28 @@ v0.4 原型基于 GenericAgent 与 `deepseek-v4-flash`,在 SOP-Bench 和 Lifel
262294

263295
| Benchmark | Agent | Model | Final Accuracy | Incremental Input | Incremental Output | Incremental Total | Incremental Efficiency |
264296
|---|---|---|---:|---:|---:|---:|---:|
265-
| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 216k | 10k | 226k | 17.73 |
266-
| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 71k | 7k | 78k | 25.57 |
297+
| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 254k | 14k | 268k | 14.93 |
298+
| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 129k | 10k | 139k | 14.41 |
267299

268300
这说明 Bayesian-Agent 可以作为即插即用的 repair layer:接在一个未达到 100% 准确率的 Agent 后面,用较小的增量推理成本把失败任务补齐。这也是它区别于普通 benchmark agent 的关键:它可以站在 harness 旁边,学习它的失败,并在不替换它的情况下提升它。
269301

270302
实验 artifacts 位于 [`artifacts/`](artifacts/),方法说明位于 [`docs/method.md`](docs/method.md)
271303

304+
如果要用另一个模型复现实验形态,只需要改 `--model``--out-root`
305+
306+
```bash
307+
export MODEL="deepseek-v4-pro"
308+
"$GENERICAGENT_ROOT/.venv/bin/python" \
309+
experiments/run_sop_lifelong.py \
310+
--genericagent-root "$GENERICAGENT_ROOT" \
311+
--model "$MODEL" \
312+
--mode all \
313+
--bench core \
314+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
315+
```
316+
317+
默认会依次跑三段:GA baseline、Bayesian 全量自进化、Bayesian 基于所选模型新 baseline 的增量修复。结果会写到 `<out-root>/summary.md`
318+
272319
## 🔌 GenericAgent 与跨 Harness 适配
273320

274321
第一个原型是在 GenericAgent 内部验证的,但 Bayesian-Agent 不是 GenericAgent fork,也不只是 GenericAgent 的附属模块。

0 commit comments

Comments
 (0)