DataArcTech
diff --git a/‎README.md‎
Lines changed: 26 additions & 11 deletions b/‎README.md‎
Lines changed: 26 additions & 11 deletions
diff --git a/‎README_ZH.md‎
Lines changed: 26 additions & 11 deletions b/‎README_ZH.md‎
Lines changed: 26 additions & 11 deletions
diff --git a/‎docs/articles/bayesian-evidence-acquired-learning.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/articles/bayesian-evidence-acquired-learning.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/articles/complex-bayesian-rewrite-example.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/articles/complex-bayesian-rewrite-example.md‎
Lines changed: 6 additions & 6 deletions
@@ -85,9 +85,9 @@ P(success | theta, C, skill)
 
 After each verified trajectory, the framework updates a posterior belief over that Skill. The posterior is used internally for Skill ranking, rewrite decisions, and failure-mode patches; model-facing benchmark prompts receive executable Skill/SOP text instead of raw probability summaries.
 
-### What "Bayesian" Means in v0.x
+### What "Bayesian" Means in v0.5
 
-Current Bayesian-Agent v0.x defaults to a **Bayesian Evidence Model** for each Skill/SOP. The default implementation is a feature-conditioned categorical likelihood model: it estimates whether a Skill will succeed under observed evidence features such as task context, failure mode, token bucket, turn bucket, latency bucket, and selected metadata.
+Current Bayesian-Agent v0.5 defaults to a **Bayesian Evidence Model** for each Skill/SOP. The default implementation is a feature-conditioned categorical likelihood model: it estimates whether a Skill will succeed under observed evidence features such as task context, failure mode, token bucket, turn bucket, latency bucket, and selected metadata.
 
 For a Skill hypothesis `h_k`, evidence `D_k = {(x_i, y_i)}` contains discrete features `x_i` and verified labels `y_i in {success, failure}`:
 
@@ -99,14 +99,27 @@ P(y = success | h_k, x) ∝ P(y = success | h_k) * Π_j P(x_j | y = success, h_k
 
 The implementation uses Laplace smoothing with `alpha = 1`. This is Bayesian in the posterior-belief sense: verified experience updates the probability of a Skill succeeding under a particular context and runtime signature. The default backend is exposed as `algorithm="categorical_bayes"`; `algorithm="naive_bayes"` remains accepted as a legacy alias for the same factorized categorical likelihood.
 
+The current likelihood model uses **five fixed categorical evidence terms plus optional short metadata terms**:
+
+| Evidence term | Why it is included |
+|---|---|
+| `context` | Captures task family, benchmark, or harness context. |
+| `failure_mode` | Captures reusable error patterns that can become concrete Skill/SOP patches. |
+| `token_bucket` | Captures whether a trajectory succeeded cheaply or only after expensive search. |
+| `turn_bucket` | Captures recovery loops and interaction complexity. |
+| `latency_bucket` | Captures slow tool, data, or API paths that may require different SOPs. |
+| `metadata.*` | Adds harness-specific short scalar diagnostics without baking one harness schema into the core. |
+
+`metadata.*` features are included only when the value is a short scalar (`str`, `int`, `float`, or `bool`, with string length at most 80). Runtime numbers are bucketed before entering the likelihood model so sparse exact values do not dominate early evidence.
+
 For compatibility and ablation, the original **Beta-Bernoulli** posterior is still available via `algorithm="beta_bernoulli"` or `bayesian-agent evolve --algorithm beta_bernoulli`:
 
 ```text
 p_k | D_k ~ Beta(alpha_0 + s_k, beta_0 + f_k)
 E[p_k | D_k] = (alpha_0 + s_k) / (alpha_0 + beta_0 + s_k + f_k)
 ```
 
-Both backends feed the same Skill ranking, posterior audit rendering, and rewrite actions such as `patch`, `split`, `compress`, `retire`, and `explore`. Full Bayesian model selection over competing Skill hypotheses is planned, but not claimed in v0.x.
+Both backends feed the same Skill ranking, posterior audit rendering, and rewrite actions such as `patch`, `split`, `compress`, `retire`, and `explore`. Full Bayesian model selection over competing Skill hypotheses is planned, but not claimed in v0.5.
 
 ## 📋 Core Features
 
@@ -161,15 +174,17 @@ For each Skill or benchmark SOP, Bayesian-Agent maintains:
 - context distribution
 - rewrite policy recommendations
 
-The default rewrite policy is intentionally small:
+The default rewrite policy is intentionally small and matches the current implementation:
 
-| Posterior signal | Policy action |
-|---|---|
-| repeated verified success | compress or reinforce |
-| clustered failures | patch |
-| mixed outcomes across contexts | split or specialize |
-| dominant failures | retire or rewrite |
-| sparse evidence | explore |
+| Policy action | Current trigger | Why |
+|---|---|---|
+| `explore` | no observations, or posterior remains uncertain | Avoids rewriting before verified evidence exists. |
+| `retire` | `beta >= 4` and `success_probability < 0.45` | Avoids retiring after one or two unlucky failures, but removes clearly harmful Skills. |
+| `patch` | one `failure_mode` appears at least twice | Treats repeated failures as actionable evidence while avoiding one-off overfitting. |
+| `split` | at least 3 contexts and at least 4 observations | Prevents one broad SOP from covering incompatible task contexts. |
+| `compress` | at least 3 observations and `success_probability >= 0.72` | Distills stable Skills to reduce token cost after enough positive evidence. |
+
+These thresholds are conservative v0.5 heuristics, not claims of optimality. The design goal is an inspectable posterior-driven policy that can be swapped out by downstream harnesses.
 
 ## 🚀 Install
 
 
@@ -85,9 +85,9 @@ P(success | theta, C, skill)
 
 每次得到经过验证的执行轨迹后，框架都会更新该 Skill 的 posterior belief。posterior 用于内部 Skill 排序、rewrite 决策和 failure-mode patch 生成；benchmark 的真实模型输入只接收可执行 Skill/SOP 文本，而不是原始概率摘要。
 
-### v0.x 里的 “Bayesian” 准确指什么
+### v0.5 里的 “Bayesian” 准确指什么
 
-当前 Bayesian-Agent v0.x 默认使用 **Bayesian Evidence Model**。它的默认实现是 feature-conditioned categorical likelihood model：为每条 Skill/SOP 估计它在某类证据特征下成功或失败的概率。特征包括 task context、failure mode、token bucket、turn bucket、latency bucket 以及部分 metadata。
+当前 Bayesian-Agent v0.5 默认使用 **Bayesian Evidence Model**。它的默认实现是 feature-conditioned categorical likelihood model：为每条 Skill/SOP 估计它在某类证据特征下成功或失败的概率。特征包括 task context、failure mode、token bucket、turn bucket、latency bucket 以及部分 metadata。
 
 对一条 Skill hypothesis `h_k`，证据 `D_k = {(x_i, y_i)}` 包含离散特征 `x_i` 和验证标签 `y_i in {success, failure}`：
 
@@ -99,14 +99,27 @@ P(y = success | h_k, x) ∝ P(y = success | h_k) * Π_j P(x_j | y = success, h_k
 
 当前实现使用 `alpha = 1` 的 Laplace smoothing。它的 Bayesian 含义是：把 verified experience 作为证据，持续更新某条 Skill 在特定 context 和 runtime signature 下成功的 posterior belief。默认 backend 对外暴露为 `algorithm="categorical_bayes"`；`algorithm="naive_bayes"` 仍作为同一套 factorized categorical likelihood 的历史兼容 alias 被接受。
 
+当前 likelihood model 使用 **5 个固定 categorical evidence 项，加上可选的短 metadata 项**：
+
+| Evidence 项 | 为什么放进去 |
+|---|---|
+| `context` | 表示任务族、benchmark 或 harness 场景。 |
+| `failure_mode` | 记录可复用的错误模式，后续可以转成具体 Skill/SOP patch。 |
+| `token_bucket` | 区分低成本成功和高 token 搜索式成功。 |
+| `turn_bucket` | 表示交互复杂度和是否出现反复恢复循环。 |
+| `latency_bucket` | 表示慢工具、慢数据源、慢 API 等执行路径。 |
+| `metadata.*` | 接收 harness 特有的短标量诊断信息，但不把某个 harness schema 写死进 core。 |
+
+`metadata.*` 只接收短标量值：`str`、`int`、`float`、`bool`，并且字符串长度不超过 80。token、turn、latency 先离散成 bucket 再进入 likelihood model，避免早期样本里精确数值过稀疏。
+
 为了兼容和消融实验，原来的 **Beta-Bernoulli** posterior 仍然保留为可选 backend，可以使用 `algorithm="beta_bernoulli"` 或 `bayesian-agent evolve --algorithm beta_bernoulli`：
 
 ```text
 p_k | D_k ~ Beta(alpha_0 + s_k, beta_0 + f_k)
 E[p_k | D_k] = (alpha_0 + s_k) / (alpha_0 + beta_0 + s_k + f_k)
 ```
 
-两个 backend 都会进入同一套 Skill 排序、posterior 审计渲染，以及 `patch`、`split`、`compress`、`retire`、`explore` 等 rewrite actions。完整的多 Skill hypothesis Bayesian model selection 在 roadmap 中，不作为 v0.x 已完成能力来宣传。
+两个 backend 都会进入同一套 Skill 排序、posterior 审计渲染，以及 `patch`、`split`、`compress`、`retire`、`explore` 等 rewrite actions。完整的多 Skill hypothesis Bayesian model selection 在 roadmap 中，不作为 v0.5 已完成能力来宣传。
 
 ## 📋 核心特性
 
@@ -161,15 +174,17 @@ E[p_k | D_k] = (alpha_0 + s_k) / (alpha_0 + beta_0 + s_k + f_k)
 - context 分布
 - rewrite policy 建议
 
-默认 rewrite policy 保持小而清晰：
+默认 rewrite policy 保持小而清晰，并和当前代码实现一致：
 
-| Posterior 信号 | Policy 动作 |
-|---|---|
-| 多次验证成功 | compress 或 reinforce |
-| 失败模式聚集 | patch |
-| 不同 context 下表现分化 | split 或 specialize |
-| 失败占主导 | retire 或 rewrite |
-| 证据稀疏 | explore |
+| Policy 动作 | 当前触发条件 | 为什么这样设 |
+|---|---|---|
+| `explore` | 没有观测，或 posterior 仍不确定 | 没有 verified evidence 前不急着改 Skill。 |
+| `retire` | `beta >= 4` 且 `success_probability < 0.45` | 避免一两次偶然失败就废弃 Skill，但会移除明显有害的 Skill。 |
+| `patch` | 某个 `failure_mode` 至少出现 2 次 | 把重复失败当成可行动证据，同时降低单样本过拟合。 |
+| `split` | context 至少 3 个，观测至少 4 次 | 避免一条过宽 SOP 覆盖互相不兼容的任务场景。 |
+| `compress` | 观测至少 3 次，且 `success_probability >= 0.72` | 在成功证据稳定后压缩 Skill，降低 token 成本。 |
+
+这些阈值是 v0.5 的保守启发式，不宣称最优。当前目标是提供一套可审计、可替换的 posterior-driven rewrite policy。
 
 ## 🚀 安装
 
 
@@ -143,7 +143,7 @@ P(y | x_1, x_2, ..., x_m)
 = P(x_1, x_2, ..., x_m | y) P(y) / P(x_1, x_2, ..., x_m)
 ```
 
-如果特征很多，直接估计联合概率 `P(x_1, x_2, ..., x_m | y)` 会非常困难。v0.x 采用一个轻量的 factorized categorical likelihood：在给定标签 `y` 后，各个特征近似条件独立：
+如果特征很多，直接估计联合概率 `P(x_1, x_2, ..., x_m | y)` 会非常困难。v0.5 采用一个轻量的 factorized categorical likelihood：在给定标签 `y` 后，各个特征近似条件独立：
 
 ```text
 P(x_1, x_2, ..., x_m | y)
@@ -256,7 +256,7 @@ P(spam | x)
 | Posterior | <code>P(spam &#124; x)</code>，看到这些特征后邮件是垃圾邮件的概率 |
 | Factorization assumption | 给定 `spam` 或 `normal` 后，折扣、链接、陌生发件人近似独立 |
 
-这就是 categorical evidence model 的直觉：它不需要理解“折扣”和“链接”的深层语义，只要不断从历史样本里统计“哪些特征经常和哪个标签一起出现”，就可以做出可解释的概率判断。在 Bayesian-Agent 里，它被定位为 v0.x 的第一个可解释 evidence backend，而不是方法的上限。
+这就是 categorical evidence model 的直觉：它不需要理解“折扣”和“链接”的深层语义，只要不断从历史样本里统计“哪些特征经常和哪个标签一起出现”，就可以做出可解释的概率判断。在 Bayesian-Agent 里，它被定位为 v0.5 的第一个可解释 evidence backend，而不是方法的上限。
 
 ## 三、人的后天学习：一个技能是如何从经验里长出来的
 
@@ -716,7 +716,7 @@ rewrite = patch
 reason = failures cluster around left_expected_output_blank
 ```
 
-改写后的 Skill context 不是泛泛地说“仔细一点”，而是把反复出现的失败模式变成可执行约束。当前 v0.x 实现会先把单次失败作为 candidate evidence 保存在 audit artifact 中；同一 failure mode 至少出现两次后，才会在下一轮 prompt 里注入类似这样的 active patch section：
+改写后的 Skill context 不是泛泛地说“仔细一点”，而是把反复出现的失败模式变成可执行约束。当前 v0.5 实现会先把单次失败作为 candidate evidence 保存在 audit artifact 中；同一 failure mode 至少出现两次后，才会在下一轮 prompt 里注入类似这样的 active patch section：
 
 ```text
 ### Bayesian Failure-Mode Patches: sop_bench
@@ -1195,7 +1195,7 @@ Acquired learning lets the agent learn from its own verified experience.
 
 三门问题告诉我们：信息不是“看起来还剩几个选项”，而是“这个证据在不同假设下出现的概率是多少”。
 
-Bayesian Evidence Model 告诉我们：当证据由多个特征组成时，可以用一个可解释的 likelihood model，把每个特征对成功或失败的贡献统计出来；v0.x 采用的是 factorized categorical likelihood，后续可以替换成更强的贝叶斯模型。
+Bayesian Evidence Model 告诉我们：当证据由多个特征组成时，可以用一个可解释的 likelihood model，把每个特征对成功或失败的贡献统计出来；v0.5 采用的是 factorized categorical likelihood，后续可以替换成更强的贝叶斯模型。
 
 人的后天学习告诉我们：技能不是一次性写死的规则，而是在经验、反馈和训练中不断校准的操作假设。
 
 
@@ -12,7 +12,7 @@ P(y | h_k, x) ∝ P(y | h_k) * Π_j P(x_j | y, h_k)
 
 ## 一、当前实现到底使用哪些 evidence features
 
-Bayesian-Agent v0.x 的默认 backend 是 `categorical_bayes`。对每条 `TrajectoryEvidence`，当前实现会抽取这些离散特征：
+Bayesian-Agent v0.5 的默认 backend 是 `categorical_bayes`。对每条 `TrajectoryEvidence`，当前实现会抽取这些离散特征：
 
 ```text
 features = {
@@ -237,7 +237,7 @@ left_expected_output_blank + high token bucket + long turns + high latency
 
 ## 六、真正触发 Skill rewrite 的条件是什么
 
-这里要非常精确。当前 v0.x 的 `RewritePolicy` 并不是直接读取上面的 `P_h(failure | x_risk)` 来决定 rewrite。
+这里要非常精确。当前 v0.5 的 `RewritePolicy` 并不是直接读取上面的 `P_h(failure | x_risk)` 来决定 rewrite。
 
 当前代码中的触发顺序可以概括为：
 
@@ -285,7 +285,7 @@ Current task files and runtime feedback remain authoritative.
 
 - `posterior_success`：用于 Skill 排序和审计展示。
 - feature-conditioned posterior：用于解释某个 failure cluster 为什么危险。
-- `failure_modes` count：当前 v0.x 里实际触发 `patch` 的规则。
+- `failure_modes` count：当前 v0.5 里实际触发 `patch` 的规则。
 
 ## 七、patch 不是泛泛提醒，而是 failure-mode-specific guardrail
 
@@ -304,7 +304,7 @@ failure_mode = left_expected_output_blank
   - If the target cell is empty, write the computed raw category string before finishing.
 ```
 
-这就是 Skill rewrite 在当前 v0.x 里的具体形态：它不是直接生成一个全新的 child Skill，也不是更新模型参数，而是在下一轮 prompt/context 中加入针对失败模式的可执行约束。
+这就是 Skill rewrite 在当前 v0.5 里的具体形态：它不是直接生成一个全新的 child Skill，也不是更新模型参数，而是在下一轮 prompt/context 中加入针对失败模式的可执行约束。
 
 这个 patch 会和稳定的 benchmark guardrails 一起进入下一轮任务 context，例如：
 
@@ -434,7 +434,7 @@ P_h(failure | x_risk) ≈ 0.997
 2. left_expected_output_blank 这类失败簇仍然需要被 guardrail 约束。
 ```
 
-所以当前 v0.x 里，即使后续 repair 成功，`failure_modes` 计数仍然会留在 registry 中。第一次出现的 failure mode 只作为 candidate evidence 保存在 audit artifact 中；同一 failure mode 至少出现两次后，context 里才会保留相关 active patch。这比“一错就改 skill”更稳，也能降低单个异常样本导致过拟合的风险。
+所以当前 v0.5 里，即使后续 repair 成功，`failure_modes` 计数仍然会留在 registry 中。第一次出现的 failure mode 只作为 candidate evidence 保存在 audit artifact 中；同一 failure mode 至少出现两次后，context 里才会保留相关 active patch。这比“一错就改 skill”更稳，也能降低单个异常样本导致过拟合的风险。
 
 ## 十、这个例子说明了什么
 
@@ -472,7 +472,7 @@ repair 成功:
 2. **Likelihood**：context、failure mode、token bucket、turn bucket、latency bucket、metadata 都以 categorical likelihood 的形式统计。
 3. **Posterior**：新 evidence 改变下一轮 Skill 排序、failure patch、context 渲染和 repair 行为。
 
-同时也要准确地说，v0.x 不是完整的 Bayesian model selection。它还没有把多个 child Skill hypothesis 放进一个统一的后验竞争框架：
+同时也要准确地说，v0.5 不是完整的 Bayesian model selection。它还没有把多个 child Skill hypothesis 放进一个统一的后验竞争框架：
 
 ```text
 P(h_k | D) ∝ P(D | h_k) P(h_k)