DataArcTech
diff --git a/‎.env_template‎
Lines changed: 5 additions & 0 deletions b/‎.env_template‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 0 deletions b/‎.gitignore‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 64 additions & 3 deletions b/‎CONTRIBUTING.md‎
Lines changed: 64 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 50 additions & 3 deletions b/‎README.md‎
Lines changed: 50 additions & 3 deletions
diff --git a/‎README_ZH.md‎
Lines changed: 50 additions & 3 deletions b/‎README_ZH.md‎
Lines changed: 50 additions & 3 deletions
@@ -0,0 +1,5 @@
+DEEPSEEK_API_KEY=your_deepseek_api_key_here
+GENERICAGENT_ROOT=../GenericAgent
+MODEL=deepseek-v4-flash
+MODE=all
+BENCH=core
@@ -1,3 +1,7 @@
+# Project-specific gitignore file for Python projects
+temp/
+results/
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[codz]
 
@@ -19,9 +19,11 @@ Bayesian-Agent is not meant to become a monolithic agent runtime. Its core value
 bayesian_agent/
   core/                 # Framework-agnostic evidence, beliefs, registry, policy, context, repair
   adapters/             # Adapter protocol and optional external harness boundaries
+  benchmarks/           # Benchmark orchestration owned by Bayesian-Agent
 schemas/                # Portable JSON schemas for trajectories and Skill beliefs
 artifacts/              # Experiment result artifacts
 docs/                   # Documentation site content
+experiments/            # Reproducible experiment entry points
 tests/                  # unittest test suite
 ```
 
@@ -36,7 +38,7 @@ External Harness Run -> trajectory-like mapping -> TrajectoryEvidence -> Bayesia
 Bayesian Skill Context -> Adapter -> External Harness Next Run
 ```
 
-An adapter should execute one task with posterior-weighted Skill context and return a trajectory-like mapping that Bayesian-Agent can normalize.
+An adapter should execute one task with posterior-weighted Skill context and return a trajectory-like mapping that Bayesian-Agent can normalize. It should not own benchmark loops, graders, Skill registries, or posterior update logic.
 
 ### 1. Start From the Protocol
 
@@ -98,7 +100,7 @@ class MyHarnessAdapter:
     def run(self, task: Mapping[str, Any], skill_context: str) -> Mapping[str, Any]:
         """Run one task and return a trajectory-like result.
 
-        Keep this method as a thin boundary around the external harness.
+        Keep this method as a thin boundary around one external harness run.
         """
         raise NotImplementedError(
             "Wire this method to your local MyHarness runner."
@@ -107,6 +109,8 @@ class MyHarnessAdapter:
 
 If the external harness has expensive imports, import them inside `run()` or behind a helper so `import bayesian_agent` remains lightweight.
 
+If your harness needs a lower-level task runner, expose it as an explicit method such as `run_task(prompt=..., workspace=..., max_turns=...)`. The GenericAgent adapter follows this pattern: it runs one prompt in one workspace and reports token usage, while SOP-Bench/Lifelong orchestration stays in `bayesian_agent/benchmarks/`.
+
 ### 3. Return a Trajectory-Like Mapping
 
 The returned mapping should include fields compatible with `TrajectoryEvidence.from_run(...)`.
@@ -153,9 +157,66 @@ Avoid:
 - copying external framework source code into `bayesian_agent/adapters/`
 - importing a large framework at module import time
 - hiding benchmark graders inside the adapter
+- depending on historical experiment scripts from another repository
 - returning only free-form text without token usage or success signal
 - changing `bayesian_agent/core/` to fit one harness
 
+## Adding Benchmark Orchestration
+
+Benchmark runners belong under `bayesian_agent/benchmarks/`, not in external harness adapters.
+
+A benchmark module should:
+
+- build isolated task workspaces
+- construct task prompts
+- call an adapter for execution
+- grade results with deterministic local logic
+- record `TrajectoryEvidence` through the Bayesian Skill registry
+- write `results.json` and a small Markdown table with accuracy, input tokens, output tokens, total tokens, and efficiency
+
+The current SOP-Bench/Lifelong runner is invoked through:
+
+```bash
+export GENERICAGENT_ROOT="/path/to/GenericAgent"
+export DEEPSEEK_API_KEY="sk-..."
+export MODEL="deepseek-v4-flash"
+
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode all \
+  --bench core \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+```
+
+Use the same script for `deepseek-v4-pro` by changing `MODEL`. Do not add model-specific scripts unless the model requires a genuinely different protocol.
+
+For incremental repair from an existing baseline:
+
+```bash
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode bayesian-incremental \
+  --bench core \
+  --baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
+  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
+```
+
+### Benchmark PR Checklist
+
+Before opening a benchmark PR, make sure:
+
+- [ ] The runner is named after the benchmark, not after a paper table or temporary comparison label.
+- [ ] The adapter is used only for task execution.
+- [ ] Token accounting is included in every result row.
+- [ ] Incremental mode reruns only failed baseline tasks.
+- [ ] Heavy transcripts from imported baselines are compacted before writing new artifacts.
+- [ ] A smoke test with `--limit 1` succeeds before a full run.
+
 ### 5. Add Tests
 
 Add tests under `tests/`.
@@ -228,7 +289,7 @@ Before opening a PR, make sure:
 
 ```bash
 python3 -m unittest discover -v
-python3 -m compileall bayesian_agent
+PYTHONPYCACHEPREFIX=/private/tmp/ba_pycache python3 -m compileall bayesian_agent experiments
 git diff --check
 ```
 
 
@@ -19,7 +19,7 @@ It is designed to stand out from monolithic agent frameworks in three ways:
 - **Repair incrementally**: attach to an existing agent, read its failed trajectories, and rerun only the tasks that need repair.
 - **Adapt across harnesses**: integrate with GenericAgent today, and with other agent frameworks through a portable trajectory schema and adapter boundary.
 
-> v0.4 is the first standalone release. It includes the core Bayesian Skill Evolution package, schemas, CLI utilities, experiment artifacts, and a clean GenericAgent integration boundary. GenericAgent itself is not copied, vendored, or forked.
+> v0.4 is the first standalone release. It includes the core Bayesian Skill Evolution package, schemas, CLI utilities, experiment artifacts, and a runnable GenericAgent adapter boundary. GenericAgent itself is not copied, vendored, or forked.
 
 ## 📅 News
 
@@ -182,6 +182,38 @@ bayesian-agent summarize \
   --out temp/summary.json
 ```
 
+Run a live GenericAgent-backed SOP/Lifelong experiment. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
+
+```bash
+cd Bayesian-Agent
+export GENERICAGENT_ROOT="/path/to/GenericAgent"
+export DEEPSEEK_API_KEY="sk-..."
+export MODEL="deepseek-v4-flash"
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode all \
+  --bench core \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+```
+
+Use `--limit 1` for a smoke test before running the full benchmark.
+
+Run incremental repair against an existing GA baseline by passing its result files. The script reruns only failed tasks:
+
+```bash
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode bayesian-incremental \
+  --bench core \
+  --baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
+  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
+```
+
 ## 🐍 Python API
 
 ```python
@@ -262,13 +294,28 @@ In incremental mode, Bayesian-Agent only reran failed GenericAgent tasks:
 
 | Benchmark | Agent | Model | Final Accuracy | Incremental Input | Incremental Output | Incremental Total | Incremental Efficiency |
 |---|---|---|---:|---:|---:|---:|---:|
-| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 216k | 10k | 226k | 17.73 |
-| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 71k | 7k | 78k | 25.57 |
+| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 254k | 14k | 268k | 14.93 |
+| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 129k | 10k | 139k | 14.41 |
 
 The result shows that Bayesian-Agent can work as a plug-in repair layer: it can take an existing agent below 100% accuracy and improve it with a small amount of incremental inference. This is the practical advantage over one-off benchmark agents: Bayesian-Agent can sit beside a harness, learn from its failures, and improve it without replacing it.
 
 Experiment artifacts are stored under [`artifacts/`](artifacts/), and the method note is in [`docs/method.md`](docs/method.md).
 
+To reproduce the same experiment shape with another model, change only `--model` and `--out-root`:
+
+```bash
+export MODEL="deepseek-v4-pro"
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode all \
+  --bench core \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+```
+
+The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Results are written to `<out-root>/summary.md`.
+
 ## 🔌 GenericAgent and Cross-Harness Adaptation
 
 The first prototype was validated inside GenericAgent, but Bayesian-Agent is not a GenericAgent fork and not just a GenericAgent add-on.
 
@@ -19,7 +19,7 @@ Bayesian-Agent 是一个面向跨 Agent framework / execution harness 的 Bayesi
 - **增量修复**：接到已有 Agent 后面，读取失败轨迹，只重跑需要修复的任务。
 - **跨 harness 适配**：当前适配 GenericAgent，后续可通过统一 trajectory schema 和 adapter boundary 接入其他 agent frameworks。
 
-> v0.4 是第一个独立开源版本。它包含 Bayesian Skill Evolution 核心包、Schema、CLI 工具、实验 artifacts，以及干净的 GenericAgent 集成边界。GenericAgent 本身不会被复制、vendoring 或 fork 到本仓库中。
+> v0.4 是第一个独立开源版本。它包含 Bayesian Skill Evolution 核心包、Schema、CLI 工具、实验 artifacts，以及可运行的 GenericAgent adapter boundary。GenericAgent 本身不会被复制、vendoring 或 fork 到本仓库中。
 
 ## 📅 News
 
@@ -182,6 +182,38 @@ bayesian-agent summarize \
   --out temp/summary.json
 ```
 
+跑一次真实 GenericAgent-backed SOP/Lifelong 实验。用 `--model` 在 `deepseek-v4-flash` 和 `deepseek-v4-pro` 之间切换：
+
+```bash
+cd Bayesian-Agent
+export GENERICAGENT_ROOT="/path/to/GenericAgent"
+export DEEPSEEK_API_KEY="sk-..."
+export MODEL="deepseek-v4-flash"
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode all \
+  --bench core \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+```
+
+想先 smoke test 可以加 `--limit 1`，确认脚本和 token 统计正常后再跑全量。
+
+如果要接一个已有 GA baseline 做增量修复，把结果文件通过 `--baseline-results` 传进来即可。脚本只会重跑失败任务：
+
+```bash
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode bayesian-incremental \
+  --bench core \
+  --baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
+  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
+```
+
 ## 🐍 Python API
 
 ```python
@@ -262,13 +294,28 @@ v0.4 原型基于 GenericAgent 与 `deepseek-v4-flash`，在 SOP-Bench 和 Lifel
 
 | Benchmark | Agent | Model | Final Accuracy | Incremental Input | Incremental Output | Incremental Total | Incremental Efficiency |
 |---|---|---|---:|---:|---:|---:|---:|
-| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 216k | 10k | 226k | 17.73 |
-| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 71k | 7k | 78k | 25.57 |
+| SOP-Bench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 254k | 14k | 268k | 14.93 |
+| Lifelong AgentBench | GA+BayesianIncremental | deepseek-v4-flash | 100% | 129k | 10k | 139k | 14.41 |
 
 这说明 Bayesian-Agent 可以作为即插即用的 repair layer：接在一个未达到 100% 准确率的 Agent 后面，用较小的增量推理成本把失败任务补齐。这也是它区别于普通 benchmark agent 的关键：它可以站在 harness 旁边，学习它的失败，并在不替换它的情况下提升它。
 
 实验 artifacts 位于 [`artifacts/`](artifacts/)，方法说明位于 [`docs/method.md`](docs/method.md)。
 
+如果要用另一个模型复现实验形态，只需要改 `--model` 和 `--out-root`：
+
+```bash
+export MODEL="deepseek-v4-pro"
+"$GENERICAGENT_ROOT/.venv/bin/python" \
+  experiments/run_sop_lifelong.py \
+  --genericagent-root "$GENERICAGENT_ROOT" \
+  --model "$MODEL" \
+  --mode all \
+  --bench core \
+  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+```
+
+默认会依次跑三段：GA baseline、Bayesian 全量自进化、Bayesian 基于所选模型新 baseline 的增量修复。结果会写到 `<out-root>/summary.md`。
+
 ## 🔌 GenericAgent 与跨 Harness 适配
 
 第一个原型是在 GenericAgent 内部验证的，但 Bayesian-Agent 不是 GenericAgent fork，也不只是 GenericAgent 的附属模块。