Skip to content

Commit 4e29cf5

Browse files
committed
feat: unify benchmark runner
1 parent 6125b29 commit 4e29cf5

16 files changed

Lines changed: 1647 additions & 249 deletions

File tree

README.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -223,36 +223,36 @@ bayesian-agent summarize \
223223
--out temp/summary.json
224224
```
225225

226-
Run a live GenericAgent-backed SOP/Lifelong experiment. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
226+
Run a live GenericAgent-backed benchmark experiment. Use the same script for SOP-Bench, Lifelong AgentBench, and RealFin-Bench; switch benchmarks with `--bench core`, `--bench sop`, `--bench lifelong`, or `--bench realfin`. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
227227

228228
```bash
229229
cd Bayesian-Agent
230230
export GENERICAGENT_ROOT="/path/to/GenericAgent"
231231
export DEEPSEEK_API_KEY="sk-..."
232232
export MODEL="deepseek-v4-flash"
233233
"$GENERICAGENT_ROOT/.venv/bin/python" \
234-
experiments/run_sop_lifelong.py \
234+
experiments/run_benchmarks.py \
235235
--genericagent-root "$GENERICAGENT_ROOT" \
236236
--model "$MODEL" \
237237
--mode all \
238-
--bench core \
239-
--out-root "temp/sop_lifelong_${MODEL//-/_}"
238+
--bench core
240239
```
241240

242-
Use `--limit 1` for a smoke test before running the full benchmark.
241+
With `--bench core`, the runner fans out into separate benchmark roots instead of sharing one combined directory: `results/sop_${MODEL//-/_}` and `results/lifelong_${MODEL//-/_}`. If you pass `--out-root temp/core_${MODEL//-/_}`, it is treated as a parent directory and the runs go to `temp/core_${MODEL//-/_}/sop` and `temp/core_${MODEL//-/_}/lifelong`.
242+
243+
Use `--limit 1` for a smoke test before running the full benchmark. For RealFin-Bench, keep the same command shape and set `--bench realfin`; the default root becomes `results/realfin_${MODEL//-/_}`.
243244

244245
Run incremental repair against an existing GA baseline by passing its result files. The script reruns only failed tasks:
245246

246247
```bash
247248
"$GENERICAGENT_ROOT/.venv/bin/python" \
248-
experiments/run_sop_lifelong.py \
249+
experiments/run_benchmarks.py \
249250
--genericagent-root "$GENERICAGENT_ROOT" \
250251
--model "$MODEL" \
251252
--mode bayesian-incremental \
252253
--bench core \
253254
--baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
254-
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
255-
--out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
255+
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json
256256
```
257257

258258
## 🐍 Python API
@@ -344,20 +344,19 @@ The result shows that Bayesian-Agent can work as a plug-in repair layer: it can
344344

345345
Experiment artifacts are stored under [`artifacts/`](artifacts/), and the method note is in [`docs/method.md`](docs/method.md).
346346

347-
To reproduce the same experiment shape with another model, change only `--model` and `--out-root`:
347+
To reproduce the same experiment shape with another model, change only `--model`:
348348

349349
```bash
350350
export MODEL="deepseek-v4-pro"
351351
"$GENERICAGENT_ROOT/.venv/bin/python" \
352-
experiments/run_sop_lifelong.py \
352+
experiments/run_benchmarks.py \
353353
--genericagent-root "$GENERICAGENT_ROOT" \
354354
--model "$MODEL" \
355355
--mode all \
356-
--bench core \
357-
--out-root "temp/sop_lifelong_${MODEL//-/_}"
356+
--bench core
358357
```
359358

360-
The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Results are written to `<out-root>/summary.md`.
359+
The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Each selected benchmark writes its own `summary.md` under its benchmark-specific result root.
361360

362361
## 🔌 GenericAgent and Cross-Harness Adaptation
363362

README_ZH.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -223,36 +223,36 @@ bayesian-agent summarize \
223223
--out temp/summary.json
224224
```
225225

226-
跑一次真实 GenericAgent-backed SOP/Lifelong 实验。用 `--model``deepseek-v4-flash``deepseek-v4-pro` 之间切换:
226+
跑一次真实 GenericAgent-backed benchmark 实验。SOP-Bench、Lifelong AgentBench 和 RealFin-Bench 都用同一个脚本;通过 `--bench core``--bench sop``--bench lifelong``--bench realfin` 切换。用 `--model``deepseek-v4-flash``deepseek-v4-pro` 之间切换:
227227

228228
```bash
229229
cd Bayesian-Agent
230230
export GENERICAGENT_ROOT="/path/to/GenericAgent"
231231
export DEEPSEEK_API_KEY="sk-..."
232232
export MODEL="deepseek-v4-flash"
233233
"$GENERICAGENT_ROOT/.venv/bin/python" \
234-
experiments/run_sop_lifelong.py \
234+
experiments/run_benchmarks.py \
235235
--genericagent-root "$GENERICAGENT_ROOT" \
236236
--model "$MODEL" \
237237
--mode all \
238-
--bench core \
239-
--out-root "temp/sop_lifelong_${MODEL//-/_}"
238+
--bench core
240239
```
241240

242-
想先 smoke test 可以加 `--limit 1`,确认脚本和 token 统计正常后再跑全量。
241+
使用 `--bench core` 时,runner 会 fan-out 到独立 benchmark root,而不是共用一个组合目录:`results/sop_${MODEL//-/_}``results/lifelong_${MODEL//-/_}`。如果显式传 `--out-root temp/core_${MODEL//-/_}`,它会被当作父目录,实际结果写到 `temp/core_${MODEL//-/_}/sop``temp/core_${MODEL//-/_}/lifelong`
242+
243+
想先 smoke test 可以加 `--limit 1`,确认脚本和 token 统计正常后再跑全量。RealFin-Bench 也保持同样命令形态,把 `--bench` 改成 `realfin` 即可,默认 root 是 `results/realfin_${MODEL//-/_}`
243244

244245
如果要接一个已有 GA baseline 做增量修复,把结果文件通过 `--baseline-results` 传进来即可。脚本只会重跑失败任务:
245246

246247
```bash
247248
"$GENERICAGENT_ROOT/.venv/bin/python" \
248-
experiments/run_sop_lifelong.py \
249+
experiments/run_benchmarks.py \
249250
--genericagent-root "$GENERICAGENT_ROOT" \
250251
--model "$MODEL" \
251252
--mode bayesian-incremental \
252253
--bench core \
253254
--baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
254-
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
255-
--out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
255+
--baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json
256256
```
257257

258258
## 🐍 Python API
@@ -344,20 +344,19 @@ v0.4 原型基于 GenericAgent 与 `deepseek-v4-flash`,在 SOP-Bench 和 Lifel
344344

345345
实验 artifacts 位于 [`artifacts/`](artifacts/),方法说明位于 [`docs/method.md`](docs/method.md)
346346

347-
如果要用另一个模型复现实验形态,只需要改 `--model``--out-root`
347+
如果要用另一个模型复现实验形态,只需要改 `--model`
348348

349349
```bash
350350
export MODEL="deepseek-v4-pro"
351351
"$GENERICAGENT_ROOT/.venv/bin/python" \
352-
experiments/run_sop_lifelong.py \
352+
experiments/run_benchmarks.py \
353353
--genericagent-root "$GENERICAGENT_ROOT" \
354354
--model "$MODEL" \
355355
--mode all \
356-
--bench core \
357-
--out-root "temp/sop_lifelong_${MODEL//-/_}"
356+
--bench core
358357
```
359358

360-
默认会依次跑三段:GA baseline、Bayesian 全量自进化、Bayesian 基于所选模型新 baseline 的增量修复。结果会写到 `<out-root>/summary.md`
359+
默认会依次跑三段:GA baseline、Bayesian 全量自进化、Bayesian 基于所选模型新 baseline 的增量修复。每个选中的 benchmark 都会写入自己的 benchmark-specific result root`summary.md`
361360

362361
## 🔌 GenericAgent 与跨 Harness 适配
363362

bayesian_agent/adapters/generic_agent.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ class GenericAgentAdapter:
5858
protocol: str = "openai"
5959
max_tokens: int = 8192
6060
context_win: int = 50000
61+
verify_ssl: bool = True
62+
host_header: str = ""
6163

6264
def integration_note(self) -> str:
6365
return (
@@ -135,7 +137,10 @@ def model_configs(self, api_key: str) -> MutableMapping[str, Mapping[str, Any]]:
135137
"read_timeout": 180,
136138
"max_tokens": self.max_tokens,
137139
"context_win": self.context_win,
140+
"verify": self.verify_ssl,
138141
}
142+
if self.host_header:
143+
common["host_header"] = self.host_header
139144
if self.protocol == "anthropic":
140145
return {
141146
"native_claude_config_bayesian_agent": {
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
"""Benchmark orchestration helpers owned by Bayesian-Agent."""
22

33
from bayesian_agent.benchmarks.evolution import build_benchmark_skill_context, classify_failure
4+
from bayesian_agent.benchmarks.realfin import run_realfin
45

5-
__all__ = ["build_benchmark_skill_context", "classify_failure"]
6+
__all__ = ["build_benchmark_skill_context", "classify_failure", "run_realfin"]

bayesian_agent/benchmarks/evolution.py

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,21 @@ def classify_failure(benchmark: str, run: Mapping[str, Any]) -> str:
3838
return "wrote_transcript_instead_of_sql_after_workspace_confusion"
3939
if run.get("error"):
4040
return str(run.get("error"))[:160]
41+
if benchmark == "realfin_benchmark":
42+
scores = dict(run.get("scores") or {})
43+
if run.get("error"):
44+
return str(run.get("error"))[:160]
45+
if scores.get("file_created") == 0.0:
46+
transcript = str(run.get("transcript") or "")
47+
if "could not convert string to float" in transcript or "ValueError" in transcript:
48+
return "blank_ohlcv_field_crashed_calculation"
49+
return "missing_requested_output_file"
50+
if any(float(scores.get(key) or 0.0) < 1.0 for key in _realfin_format_score_keys(scores)):
51+
return "invalid_realfin_output_format"
52+
if any(float(scores.get(key) or 0.0) < 1.0 for key in _realfin_analysis_trace_score_keys(scores)):
53+
return "missing_required_analysis_trace"
54+
if scores:
55+
return "realfin_automated_check_failed"
4156
return str(run.get("error") or "benchmark_failure")[:160]
4257

4358

@@ -155,6 +170,18 @@ def _stable_rules(benchmark: str):
155170
"For mutation tasks, write executable SQL that reproduces the expected table state.",
156171
"If SQL ranking is needed, express ranking inside a subquery and keep the final output to one SQL statement.",
157172
]
173+
if benchmark == "realfin_benchmark":
174+
return [
175+
"Read `task.json` and `realfin_cache_manifest.json` in the current workspace before calculating.",
176+
"Use the local `api_cache` symlink for market data; do not call EastMoney historical endpoints such as `push2his.eastmoney.com`.",
177+
"Create exactly the requested output file in the workspace; do not wrap the file content in Markdown.",
178+
"Map 创业板 code `300XXX` to baostock CSV `api_cache/baostock/daily_qfq_20230101_20260331/sz.300XXX.csv`.",
179+
"When writing stock codes to output files, strip cache market prefixes unless explicitly requested: use `300531`, not `sz.300531`.",
180+
"Use auxiliary baostock cache for indexes such as `sh.000001` and `sz.399006`.",
181+
"Use Tencent ETF cache files for ETF symbols such as `sz159642` or `sh511010`.",
182+
"When a task asks for indicators or constraints, compute them from cached OHLCV data and keep the output format aligned with the prompt.",
183+
"Filter cached rows to valid trading rows with non-empty numeric OHLCV fields; skip blank rows instead of crashing numeric conversion.",
184+
]
158185
return []
159186

160187

@@ -201,9 +228,79 @@ def _patch_rule_catalog(benchmark: str):
201228
"Read only the current task workspace and avoid copying content from sibling benchmark runs.",
202229
],
203230
}
231+
if benchmark == "realfin_benchmark":
232+
return {
233+
"missing_requested_output_file": [
234+
"Before finishing, list the task's requested `.txt` output file and verify it exists in the workspace.",
235+
"If calculations find no qualifying symbols, still create the requested file with the task-accepted empty-result wording or header.",
236+
],
237+
"blank_ohlcv_field_crashed_calculation": [
238+
"When reading cached CSV files, skip rows where open/high/low/close/volume is blank or non-numeric.",
239+
"Filter to `tradestatus == 1` where available before indicator calculations.",
240+
"After handling sparse rows, re-run the calculation and create the requested output file.",
241+
],
242+
"invalid_realfin_output_format": [
243+
"Match the prompt's output format exactly: headers, comma-separated columns, code format, numeric precision, and sort order.",
244+
"For stock-code outputs, strip cache prefixes like `sz.` and `sh.` unless the task explicitly requests prefixed codes.",
245+
"Re-read the output file and validate it against the task's automated format constraints before finishing.",
246+
],
247+
"missing_required_analysis_trace": [
248+
"Run an explicit calculation script over cached OHLCV data and mention the required indicators or checks in the analysis transcript.",
249+
"For indicator tasks, use the task's exact indicator names such as MACD, RSI, KDJ, Bollinger, MA, volume, ATR, or correlation.",
250+
],
251+
}
204252
return {}
205253

206254

255+
def _realfin_format_score_keys(scores: Mapping[str, Any]):
256+
return [
257+
key
258+
for key in scores
259+
if key
260+
in {
261+
"valid_codes",
262+
"valid_format",
263+
"valid_values",
264+
"reasonable_values",
265+
"five_records",
266+
"three_records",
267+
"has_count",
268+
"has_ratio",
269+
"correlation_value",
270+
"correlation_type",
271+
"consistency",
272+
"sorted_desc",
273+
"valid_dates",
274+
"count_limit",
275+
"price_positive",
276+
"vol_negative",
277+
"divergence_positive",
278+
}
279+
]
280+
281+
282+
def _realfin_analysis_trace_score_keys(scores: Mapping[str, Any]):
283+
return [
284+
key
285+
for key in scores
286+
if key.endswith("_computed")
287+
or key.endswith("_checked")
288+
or key
289+
in {
290+
"data_fetched",
291+
"kdj_computed",
292+
"macd_computed",
293+
"rsi_computed",
294+
"ma_computed",
295+
"bollinger_computed",
296+
"histogram_computed",
297+
"consecutive_checked",
298+
"volume_checked",
299+
"range_checked",
300+
}
301+
]
302+
303+
207304
def _benchmark_skill_state(benchmark: str, registry: BayesianSkillRegistry):
208305
skill_id = f"benchmark/{benchmark}"
209306
known = skill_id in registry.data.get("skills", {})

0 commit comments

Comments
 (0)