DataArcTech
diff --git a/‎README.md‎
Lines changed: 12 additions & 13 deletions b/‎README.md‎
Lines changed: 12 additions & 13 deletions
diff --git a/‎README_ZH.md‎
Lines changed: 12 additions & 13 deletions b/‎README_ZH.md‎
Lines changed: 12 additions & 13 deletions
diff --git a/‎bayesian_agent/adapters/generic_agent.py‎
Lines changed: 5 additions & 0 deletions b/‎bayesian_agent/adapters/generic_agent.py‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎bayesian_agent/benchmarks/__init__.py‎
Lines changed: 2 additions & 1 deletion b/‎bayesian_agent/benchmarks/__init__.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎bayesian_agent/benchmarks/evolution.py‎
Lines changed: 97 additions & 0 deletions b/‎bayesian_agent/benchmarks/evolution.py‎
Lines changed: 97 additions & 0 deletions
@@ -223,36 +223,36 @@ bayesian-agent summarize \
   --out temp/summary.json
 ```
 
-Run a live GenericAgent-backed SOP/Lifelong experiment. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
+Run a live GenericAgent-backed benchmark experiment. Use the same script for SOP-Bench, Lifelong AgentBench, and RealFin-Bench; switch benchmarks with `--bench core`, `--bench sop`, `--bench lifelong`, or `--bench realfin`. Use `--model` to switch between `deepseek-v4-flash` and `deepseek-v4-pro`:
 
 ```bash
 cd Bayesian-Agent
 export GENERICAGENT_ROOT="/path/to/GenericAgent"
 export DEEPSEEK_API_KEY="sk-..."
 export MODEL="deepseek-v4-flash"
 "$GENERICAGENT_ROOT/.venv/bin/python" \
-  experiments/run_sop_lifelong.py \
+  experiments/run_benchmarks.py \
   --genericagent-root "$GENERICAGENT_ROOT" \
   --model "$MODEL" \
   --mode all \
-  --bench core \
-  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+  --bench core
 ```
 
-Use `--limit 1` for a smoke test before running the full benchmark.
+With `--bench core`, the runner fans out into separate benchmark roots instead of sharing one combined directory: `results/sop_${MODEL//-/_}` and `results/lifelong_${MODEL//-/_}`. If you pass `--out-root temp/core_${MODEL//-/_}`, it is treated as a parent directory and the runs go to `temp/core_${MODEL//-/_}/sop` and `temp/core_${MODEL//-/_}/lifelong`.
+
+Use `--limit 1` for a smoke test before running the full benchmark. For RealFin-Bench, keep the same command shape and set `--bench realfin`; the default root becomes `results/realfin_${MODEL//-/_}`.
 
 Run incremental repair against an existing GA baseline by passing its result files. The script reruns only failed tasks:
 
 ```bash
 "$GENERICAGENT_ROOT/.venv/bin/python" \
-  experiments/run_sop_lifelong.py \
+  experiments/run_benchmarks.py \
   --genericagent-root "$GENERICAGENT_ROOT" \
   --model "$MODEL" \
   --mode bayesian-incremental \
   --bench core \
   --baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
-  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
-  --out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
+  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json
 ```
 
 ## 🐍 Python API
@@ -344,20 +344,19 @@ The result shows that Bayesian-Agent can work as a plug-in repair layer: it can
 
 Experiment artifacts are stored under [`artifacts/`](artifacts/), and the method note is in [`docs/method.md`](docs/method.md).
 
-To reproduce the same experiment shape with another model, change only `--model` and `--out-root`:
+To reproduce the same experiment shape with another model, change only `--model`:
 
 ```bash
 export MODEL="deepseek-v4-pro"
 "$GENERICAGENT_ROOT/.venv/bin/python" \
-  experiments/run_sop_lifelong.py \
+  experiments/run_benchmarks.py \
   --genericagent-root "$GENERICAGENT_ROOT" \
   --model "$MODEL" \
   --mode all \
-  --bench core \
-  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+  --bench core
 ```
 
-The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Results are written to `<out-root>/summary.md`.
+The script runs three phases by default: GA baseline, Bayesian full self-evolution, and Bayesian incremental repair using the fresh baseline for the selected model. Each selected benchmark writes its own `summary.md` under its benchmark-specific result root.
 
 ## 🔌 GenericAgent and Cross-Harness Adaptation
 
 
@@ -223,36 +223,36 @@ bayesian-agent summarize \
   --out temp/summary.json
 ```
 
-跑一次真实 GenericAgent-backed SOP/Lifelong 实验。用 `--model` 在 `deepseek-v4-flash` 和 `deepseek-v4-pro` 之间切换：
+跑一次真实 GenericAgent-backed benchmark 实验。SOP-Bench、Lifelong AgentBench 和 RealFin-Bench 都用同一个脚本；通过 `--bench core`、`--bench sop`、`--bench lifelong` 或 `--bench realfin` 切换。用 `--model` 在 `deepseek-v4-flash` 和 `deepseek-v4-pro` 之间切换：
 
 ```bash
 cd Bayesian-Agent
 export GENERICAGENT_ROOT="/path/to/GenericAgent"
 export DEEPSEEK_API_KEY="sk-..."
 export MODEL="deepseek-v4-flash"
 "$GENERICAGENT_ROOT/.venv/bin/python" \
-  experiments/run_sop_lifelong.py \
+  experiments/run_benchmarks.py \
   --genericagent-root "$GENERICAGENT_ROOT" \
   --model "$MODEL" \
   --mode all \
-  --bench core \
-  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+  --bench core
 ```
 
-想先 smoke test 可以加 `--limit 1`，确认脚本和 token 统计正常后再跑全量。
+使用 `--bench core` 时，runner 会 fan-out 到独立 benchmark root，而不是共用一个组合目录：`results/sop_${MODEL//-/_}` 和 `results/lifelong_${MODEL//-/_}`。如果显式传 `--out-root temp/core_${MODEL//-/_}`，它会被当作父目录，实际结果写到 `temp/core_${MODEL//-/_}/sop` 和 `temp/core_${MODEL//-/_}/lifelong`。
+
+想先 smoke test 可以加 `--limit 1`，确认脚本和 token 统计正常后再跑全量。RealFin-Bench 也保持同样命令形态，把 `--bench` 改成 `realfin` 即可，默认 root 是 `results/realfin_${MODEL//-/_}`。
 
 如果要接一个已有 GA baseline 做增量修复，把结果文件通过 `--baseline-results` 传进来即可。脚本只会重跑失败任务：
 
 ```bash
 "$GENERICAGENT_ROOT/.venv/bin/python" \
-  experiments/run_sop_lifelong.py \
+  experiments/run_benchmarks.py \
   --genericagent-root "$GENERICAGENT_ROOT" \
   --model "$MODEL" \
   --mode bayesian-incremental \
   --bench core \
   --baseline-results artifacts/ga_deepseek_baseline/sop_results.json \
-  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json \
-  --out-root "temp/sop_lifelong_${MODEL//-/_}_incremental_from_ga"
+  --baseline-results artifacts/ga_deepseek_baseline/lifelong_results.json
 ```
 
 ## 🐍 Python API
@@ -344,20 +344,19 @@ v0.4 原型基于 GenericAgent 与 `deepseek-v4-flash`，在 SOP-Bench 和 Lifel
 
 实验 artifacts 位于 [`artifacts/`](artifacts/)，方法说明位于 [`docs/method.md`](docs/method.md)。
 
-如果要用另一个模型复现实验形态，只需要改 `--model` 和 `--out-root`：
+如果要用另一个模型复现实验形态，只需要改 `--model`：
 
 ```bash
 export MODEL="deepseek-v4-pro"
 "$GENERICAGENT_ROOT/.venv/bin/python" \
-  experiments/run_sop_lifelong.py \
+  experiments/run_benchmarks.py \
   --genericagent-root "$GENERICAGENT_ROOT" \
   --model "$MODEL" \
   --mode all \
-  --bench core \
-  --out-root "temp/sop_lifelong_${MODEL//-/_}"
+  --bench core
 ```
 
-默认会依次跑三段：GA baseline、Bayesian 全量自进化、Bayesian 基于所选模型新 baseline 的增量修复。结果会写到 `<out-root>/summary.md`。
+默认会依次跑三段：GA baseline、Bayesian 全量自进化、Bayesian 基于所选模型新 baseline 的增量修复。每个选中的 benchmark 都会写入自己的 benchmark-specific result root 和 `summary.md`。
 
 ## 🔌 GenericAgent 与跨 Harness 适配
 
 
@@ -58,6 +58,8 @@ class GenericAgentAdapter:
     protocol: str = "openai"
     max_tokens: int = 8192
     context_win: int = 50000
+    verify_ssl: bool = True
+    host_header: str = ""
 
     def integration_note(self) -> str:
         return (
@@ -135,7 +137,10 @@ def model_configs(self, api_key: str) -> MutableMapping[str, Mapping[str, Any]]:
             "read_timeout": 180,
             "max_tokens": self.max_tokens,
             "context_win": self.context_win,
+            "verify": self.verify_ssl,
         }
+        if self.host_header:
+            common["host_header"] = self.host_header
         if self.protocol == "anthropic":
             return {
                 "native_claude_config_bayesian_agent": {
 
@@ -1,5 +1,6 @@
 """Benchmark orchestration helpers owned by Bayesian-Agent."""
 
 from bayesian_agent.benchmarks.evolution import build_benchmark_skill_context, classify_failure
+from bayesian_agent.benchmarks.realfin import run_realfin
 
-__all__ = ["build_benchmark_skill_context", "classify_failure"]
+__all__ = ["build_benchmark_skill_context", "classify_failure", "run_realfin"]
@@ -38,6 +38,21 @@ def classify_failure(benchmark: str, run: Mapping[str, Any]) -> str:
             return "wrote_transcript_instead_of_sql_after_workspace_confusion"
         if run.get("error"):
             return str(run.get("error"))[:160]
+    if benchmark == "realfin_benchmark":
+        scores = dict(run.get("scores") or {})
+        if run.get("error"):
+            return str(run.get("error"))[:160]
+        if scores.get("file_created") == 0.0:
+            transcript = str(run.get("transcript") or "")
+            if "could not convert string to float" in transcript or "ValueError" in transcript:
+                return "blank_ohlcv_field_crashed_calculation"
+            return "missing_requested_output_file"
+        if any(float(scores.get(key) or 0.0) < 1.0 for key in _realfin_format_score_keys(scores)):
+            return "invalid_realfin_output_format"
+        if any(float(scores.get(key) or 0.0) < 1.0 for key in _realfin_analysis_trace_score_keys(scores)):
+            return "missing_required_analysis_trace"
+        if scores:
+            return "realfin_automated_check_failed"
     return str(run.get("error") or "benchmark_failure")[:160]
 
 
@@ -155,6 +170,18 @@ def _stable_rules(benchmark: str):
             "For mutation tasks, write executable SQL that reproduces the expected table state.",
             "If SQL ranking is needed, express ranking inside a subquery and keep the final output to one SQL statement.",
         ]
+    if benchmark == "realfin_benchmark":
+        return [
+            "Read `task.json` and `realfin_cache_manifest.json` in the current workspace before calculating.",
+            "Use the local `api_cache` symlink for market data; do not call EastMoney historical endpoints such as `push2his.eastmoney.com`.",
+            "Create exactly the requested output file in the workspace; do not wrap the file content in Markdown.",
+            "Map 创业板 code `300XXX` to baostock CSV `api_cache/baostock/daily_qfq_20230101_20260331/sz.300XXX.csv`.",
+            "When writing stock codes to output files, strip cache market prefixes unless explicitly requested: use `300531`, not `sz.300531`.",
+            "Use auxiliary baostock cache for indexes such as `sh.000001` and `sz.399006`.",
+            "Use Tencent ETF cache files for ETF symbols such as `sz159642` or `sh511010`.",
+            "When a task asks for indicators or constraints, compute them from cached OHLCV data and keep the output format aligned with the prompt.",
+            "Filter cached rows to valid trading rows with non-empty numeric OHLCV fields; skip blank rows instead of crashing numeric conversion.",
+        ]
     return []
 
 
@@ -201,9 +228,79 @@ def _patch_rule_catalog(benchmark: str):
                 "Read only the current task workspace and avoid copying content from sibling benchmark runs.",
             ],
         }
+    if benchmark == "realfin_benchmark":
+        return {
+            "missing_requested_output_file": [
+                "Before finishing, list the task's requested `.txt` output file and verify it exists in the workspace.",
+                "If calculations find no qualifying symbols, still create the requested file with the task-accepted empty-result wording or header.",
+            ],
+            "blank_ohlcv_field_crashed_calculation": [
+                "When reading cached CSV files, skip rows where open/high/low/close/volume is blank or non-numeric.",
+                "Filter to `tradestatus == 1` where available before indicator calculations.",
+                "After handling sparse rows, re-run the calculation and create the requested output file.",
+            ],
+            "invalid_realfin_output_format": [
+                "Match the prompt's output format exactly: headers, comma-separated columns, code format, numeric precision, and sort order.",
+                "For stock-code outputs, strip cache prefixes like `sz.` and `sh.` unless the task explicitly requests prefixed codes.",
+                "Re-read the output file and validate it against the task's automated format constraints before finishing.",
+            ],
+            "missing_required_analysis_trace": [
+                "Run an explicit calculation script over cached OHLCV data and mention the required indicators or checks in the analysis transcript.",
+                "For indicator tasks, use the task's exact indicator names such as MACD, RSI, KDJ, Bollinger, MA, volume, ATR, or correlation.",
+            ],
+        }
     return {}
 
 
+def _realfin_format_score_keys(scores: Mapping[str, Any]):
+    return [
+        key
+        for key in scores
+        if key
+        in {
+            "valid_codes",
+            "valid_format",
+            "valid_values",
+            "reasonable_values",
+            "five_records",
+            "three_records",
+            "has_count",
+            "has_ratio",
+            "correlation_value",
+            "correlation_type",
+            "consistency",
+            "sorted_desc",
+            "valid_dates",
+            "count_limit",
+            "price_positive",
+            "vol_negative",
+            "divergence_positive",
+        }
+    ]
+
+
+def _realfin_analysis_trace_score_keys(scores: Mapping[str, Any]):
+    return [
+        key
+        for key in scores
+        if key.endswith("_computed")
+        or key.endswith("_checked")
+        or key
+        in {
+            "data_fetched",
+            "kdj_computed",
+            "macd_computed",
+            "rsi_computed",
+            "ma_computed",
+            "bollinger_computed",
+            "histogram_computed",
+            "consecutive_checked",
+            "volume_checked",
+            "range_checked",
+        }
+    ]
+
+
 def _benchmark_skill_state(benchmark: str, registry: BayesianSkillRegistry):
     skill_id = f"benchmark/{benchmark}"
     known = skill_id in registry.data.get("skills", {})