You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bayesian Skill Context -> Adapter -> External Harness Next Run
37
39
```
38
40
39
-
An adapter should execute one task with posterior-weighted Skill context and return a trajectory-like mapping that Bayesian-Agent can normalize.
41
+
An adapter should execute one task with posterior-weighted Skill context and return a trajectory-like mapping that Bayesian-Agent can normalize. It should not own benchmark loops, graders, Skill registries, or posterior update logic.
"""Run one task and return a trajectory-like result.
100
102
101
-
Keep this method as a thin boundary around the external harness.
103
+
Keep this method as a thin boundary around one external harness run.
102
104
"""
103
105
raiseNotImplementedError(
104
106
"Wire this method to your local MyHarness runner."
@@ -107,6 +109,8 @@ class MyHarnessAdapter:
107
109
108
110
If the external harness has expensive imports, import them inside `run()` or behind a helper so `import bayesian_agent` remains lightweight.
109
111
112
+
If your harness needs a lower-level task runner, expose it as an explicit method such as `run_task(prompt=..., workspace=..., max_turns=...)`. The GenericAgent adapter follows this pattern: it runs one prompt in one workspace and reports token usage, while SOP-Bench/Lifelong orchestration stays in `bayesian_agent/benchmarks/`.
113
+
110
114
### 3. Return a Trajectory-Like Mapping
111
115
112
116
The returned mapping should include fields compatible with `TrajectoryEvidence.from_run(...)`.
@@ -153,9 +157,66 @@ Avoid:
153
157
- copying external framework source code into `bayesian_agent/adapters/`
154
158
- importing a large framework at module import time
155
159
- hiding benchmark graders inside the adapter
160
+
- depending on historical experiment scripts from another repository
156
161
- returning only free-form text without token usage or success signal
157
162
- changing `bayesian_agent/core/` to fit one harness
158
163
164
+
## Adding Benchmark Orchestration
165
+
166
+
Benchmark runners belong under `bayesian_agent/benchmarks/`, not in external harness adapters.
167
+
168
+
A benchmark module should:
169
+
170
+
- build isolated task workspaces
171
+
- construct task prompts
172
+
- call an adapter for execution
173
+
- grade results with deterministic local logic
174
+
- record `TrajectoryEvidence` through the Bayesian Skill registry
175
+
- write `results.json` and a small Markdown table with accuracy, input tokens, output tokens, total tokens, and efficiency
176
+
177
+
The current SOP-Bench/Lifelong runner is invoked through:
178
+
179
+
```bash
180
+
export GENERICAGENT_ROOT="/path/to/GenericAgent"
181
+
export DEEPSEEK_API_KEY="sk-..."
182
+
export MODEL="deepseek-v4-flash"
183
+
184
+
"$GENERICAGENT_ROOT/.venv/bin/python" \
185
+
experiments/run_sop_lifelong.py \
186
+
--genericagent-root "$GENERICAGENT_ROOT" \
187
+
--model "$MODEL" \
188
+
--mode all \
189
+
--bench core \
190
+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
191
+
```
192
+
193
+
Use the same script for `deepseek-v4-pro` by changing `MODEL`. Do not add model-specific scripts unless the model requires a genuinely different protocol.
Copy file name to clipboardExpand all lines: README.md
+50-3Lines changed: 50 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ It is designed to stand out from monolithic agent frameworks in three ways:
19
19
-**Repair incrementally**: attach to an existing agent, read its failed trajectories, and rerun only the tasks that need repair.
20
20
-**Adapt across harnesses**: integrate with GenericAgent today, and with other agent frameworks through a portable trajectory schema and adapter boundary.
21
21
22
-
> v0.4 is the first standalone release. It includes the core Bayesian Skill Evolution package, schemas, CLI utilities, experiment artifacts, and a clean GenericAgent integration boundary. GenericAgent itself is not copied, vendored, or forked.
22
+
> v0.4 is the first standalone release. It includes the core Bayesian Skill Evolution package, schemas, CLI utilities, experiment artifacts, and a runnable GenericAgent adapter boundary. GenericAgent itself is not copied, vendored, or forked.
23
23
24
24
## 📅 News
25
25
@@ -182,6 +182,38 @@ bayesian-agent summarize \
182
182
--out temp/summary.json
183
183
```
184
184
185
+
Run a live GenericAgent-backed SOP/Lifelong experiment. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
186
+
187
+
```bash
188
+
cd Bayesian-Agent
189
+
export GENERICAGENT_ROOT="/path/to/GenericAgent"
190
+
export DEEPSEEK_API_KEY="sk-..."
191
+
export MODEL="deepseek-v4-flash"
192
+
"$GENERICAGENT_ROOT/.venv/bin/python" \
193
+
experiments/run_sop_lifelong.py \
194
+
--genericagent-root "$GENERICAGENT_ROOT" \
195
+
--model "$MODEL" \
196
+
--mode all \
197
+
--bench core \
198
+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
199
+
```
200
+
201
+
Use `--limit 1` for a smoke test before running the full benchmark.
202
+
203
+
Run incremental repair against an existing GA baseline by passing its result files. The script reruns only failed tasks:
The result shows that Bayesian-Agent can work as a plug-in repair layer: it can take an existing agent below 100% accuracy and improve it with a small amount of incremental inference. This is the practical advantage over one-off benchmark agents: Bayesian-Agent can sit beside a harness, learn from its failures, and improve it without replacing it.
269
301
270
302
Experiment artifacts are stored under [`artifacts/`](artifacts/), and the method note is in [`docs/method.md`](docs/method.md).
271
303
304
+
To reproduce the same experiment shape with another model, change only `--model` and `--out-root`:
305
+
306
+
```bash
307
+
export MODEL="deepseek-v4-pro"
308
+
"$GENERICAGENT_ROOT/.venv/bin/python" \
309
+
experiments/run_sop_lifelong.py \
310
+
--genericagent-root "$GENERICAGENT_ROOT" \
311
+
--model "$MODEL" \
312
+
--mode all \
313
+
--bench core \
314
+
--out-root "temp/sop_lifelong_${MODEL//-/_}"
315
+
```
316
+
317
+
The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Results are written to `<out-root>/summary.md`.
318
+
272
319
## 🔌 GenericAgent and Cross-Harness Adaptation
273
320
274
321
The first prototype was validated inside GenericAgent, but Bayesian-Agent is not a GenericAgent fork and not just a GenericAgent add-on.
0 commit comments