Skip to content

Commit 6125b29

Browse files
committed
docs: prepare v0.5 documentation
1 parent 80b4593 commit 6125b29

8 files changed

Lines changed: 133 additions & 57 deletions

File tree

README.md

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,9 @@ P(success | theta, C, skill)
8585

8686
After each verified trajectory, the framework updates a posterior belief over that Skill. The posterior is used internally for Skill ranking, rewrite decisions, and failure-mode patches; model-facing benchmark prompts receive executable Skill/SOP text instead of raw probability summaries.
8787

88-
### What "Bayesian" Means in v0.x
88+
### What "Bayesian" Means in v0.5
8989

90-
Current Bayesian-Agent v0.x defaults to a **Bayesian Evidence Model** for each Skill/SOP. The default implementation is a feature-conditioned categorical likelihood model: it estimates whether a Skill will succeed under observed evidence features such as task context, failure mode, token bucket, turn bucket, latency bucket, and selected metadata.
90+
Current Bayesian-Agent v0.5 defaults to a **Bayesian Evidence Model** for each Skill/SOP. The default implementation is a feature-conditioned categorical likelihood model: it estimates whether a Skill will succeed under observed evidence features such as task context, failure mode, token bucket, turn bucket, latency bucket, and selected metadata.
9191

9292
For a Skill hypothesis `h_k`, evidence `D_k = {(x_i, y_i)}` contains discrete features `x_i` and verified labels `y_i in {success, failure}`:
9393

@@ -99,14 +99,27 @@ P(y = success | h_k, x) ∝ P(y = success | h_k) * Π_j P(x_j | y = success, h_k
9999

100100
The implementation uses Laplace smoothing with `alpha = 1`. This is Bayesian in the posterior-belief sense: verified experience updates the probability of a Skill succeeding under a particular context and runtime signature. The default backend is exposed as `algorithm="categorical_bayes"`; `algorithm="naive_bayes"` remains accepted as a legacy alias for the same factorized categorical likelihood.
101101

102+
The current likelihood model uses **five fixed categorical evidence terms plus optional short metadata terms**:
103+
104+
| Evidence term | Why it is included |
105+
|---|---|
106+
| `context` | Captures task family, benchmark, or harness context. |
107+
| `failure_mode` | Captures reusable error patterns that can become concrete Skill/SOP patches. |
108+
| `token_bucket` | Captures whether a trajectory succeeded cheaply or only after expensive search. |
109+
| `turn_bucket` | Captures recovery loops and interaction complexity. |
110+
| `latency_bucket` | Captures slow tool, data, or API paths that may require different SOPs. |
111+
| `metadata.*` | Adds harness-specific short scalar diagnostics without baking one harness schema into the core. |
112+
113+
`metadata.*` features are included only when the value is a short scalar (`str`, `int`, `float`, or `bool`, with string length at most 80). Runtime numbers are bucketed before entering the likelihood model so sparse exact values do not dominate early evidence.
114+
102115
For compatibility and ablation, the original **Beta-Bernoulli** posterior is still available via `algorithm="beta_bernoulli"` or `bayesian-agent evolve --algorithm beta_bernoulli`:
103116

104117
```text
105118
p_k | D_k ~ Beta(alpha_0 + s_k, beta_0 + f_k)
106119
E[p_k | D_k] = (alpha_0 + s_k) / (alpha_0 + beta_0 + s_k + f_k)
107120
```
108121

109-
Both backends feed the same Skill ranking, posterior audit rendering, and rewrite actions such as `patch`, `split`, `compress`, `retire`, and `explore`. Full Bayesian model selection over competing Skill hypotheses is planned, but not claimed in v0.x.
122+
Both backends feed the same Skill ranking, posterior audit rendering, and rewrite actions such as `patch`, `split`, `compress`, `retire`, and `explore`. Full Bayesian model selection over competing Skill hypotheses is planned, but not claimed in v0.5.
110123

111124
## 📋 Core Features
112125

@@ -161,15 +174,17 @@ For each Skill or benchmark SOP, Bayesian-Agent maintains:
161174
- context distribution
162175
- rewrite policy recommendations
163176

164-
The default rewrite policy is intentionally small:
177+
The default rewrite policy is intentionally small and matches the current implementation:
165178

166-
| Posterior signal | Policy action |
167-
|---|---|
168-
| repeated verified success | compress or reinforce |
169-
| clustered failures | patch |
170-
| mixed outcomes across contexts | split or specialize |
171-
| dominant failures | retire or rewrite |
172-
| sparse evidence | explore |
179+
| Policy action | Current trigger | Why |
180+
|---|---|---|
181+
| `explore` | no observations, or posterior remains uncertain | Avoids rewriting before verified evidence exists. |
182+
| `retire` | `beta >= 4` and `success_probability < 0.45` | Avoids retiring after one or two unlucky failures, but removes clearly harmful Skills. |
183+
| `patch` | one `failure_mode` appears at least twice | Treats repeated failures as actionable evidence while avoiding one-off overfitting. |
184+
| `split` | at least 3 contexts and at least 4 observations | Prevents one broad SOP from covering incompatible task contexts. |
185+
| `compress` | at least 3 observations and `success_probability >= 0.72` | Distills stable Skills to reduce token cost after enough positive evidence. |
186+
187+
These thresholds are conservative v0.5 heuristics, not claims of optimality. The design goal is an inspectable posterior-driven policy that can be swapped out by downstream harnesses.
173188

174189
## 🚀 Install
175190

README_ZH.md

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -85,9 +85,9 @@ P(success | theta, C, skill)
8585

8686
每次得到经过验证的执行轨迹后,框架都会更新该 Skill 的 posterior belief。posterior 用于内部 Skill 排序、rewrite 决策和 failure-mode patch 生成;benchmark 的真实模型输入只接收可执行 Skill/SOP 文本,而不是原始概率摘要。
8787

88-
### v0.x 里的 “Bayesian” 准确指什么
88+
### v0.5 里的 “Bayesian” 准确指什么
8989

90-
当前 Bayesian-Agent v0.x 默认使用 **Bayesian Evidence Model**。它的默认实现是 feature-conditioned categorical likelihood model:为每条 Skill/SOP 估计它在某类证据特征下成功或失败的概率。特征包括 task context、failure mode、token bucket、turn bucket、latency bucket 以及部分 metadata。
90+
当前 Bayesian-Agent v0.5 默认使用 **Bayesian Evidence Model**。它的默认实现是 feature-conditioned categorical likelihood model:为每条 Skill/SOP 估计它在某类证据特征下成功或失败的概率。特征包括 task context、failure mode、token bucket、turn bucket、latency bucket 以及部分 metadata。
9191

9292
对一条 Skill hypothesis `h_k`,证据 `D_k = {(x_i, y_i)}` 包含离散特征 `x_i` 和验证标签 `y_i in {success, failure}`
9393

@@ -99,14 +99,27 @@ P(y = success | h_k, x) ∝ P(y = success | h_k) * Π_j P(x_j | y = success, h_k
9999

100100
当前实现使用 `alpha = 1` 的 Laplace smoothing。它的 Bayesian 含义是:把 verified experience 作为证据,持续更新某条 Skill 在特定 context 和 runtime signature 下成功的 posterior belief。默认 backend 对外暴露为 `algorithm="categorical_bayes"``algorithm="naive_bayes"` 仍作为同一套 factorized categorical likelihood 的历史兼容 alias 被接受。
101101

102+
当前 likelihood model 使用 **5 个固定 categorical evidence 项,加上可选的短 metadata 项**
103+
104+
| Evidence 项 | 为什么放进去 |
105+
|---|---|
106+
| `context` | 表示任务族、benchmark 或 harness 场景。 |
107+
| `failure_mode` | 记录可复用的错误模式,后续可以转成具体 Skill/SOP patch。 |
108+
| `token_bucket` | 区分低成本成功和高 token 搜索式成功。 |
109+
| `turn_bucket` | 表示交互复杂度和是否出现反复恢复循环。 |
110+
| `latency_bucket` | 表示慢工具、慢数据源、慢 API 等执行路径。 |
111+
| `metadata.*` | 接收 harness 特有的短标量诊断信息,但不把某个 harness schema 写死进 core。 |
112+
113+
`metadata.*` 只接收短标量值:`str``int``float``bool`,并且字符串长度不超过 80。token、turn、latency 先离散成 bucket 再进入 likelihood model,避免早期样本里精确数值过稀疏。
114+
102115
为了兼容和消融实验,原来的 **Beta-Bernoulli** posterior 仍然保留为可选 backend,可以使用 `algorithm="beta_bernoulli"``bayesian-agent evolve --algorithm beta_bernoulli`
103116

104117
```text
105118
p_k | D_k ~ Beta(alpha_0 + s_k, beta_0 + f_k)
106119
E[p_k | D_k] = (alpha_0 + s_k) / (alpha_0 + beta_0 + s_k + f_k)
107120
```
108121

109-
两个 backend 都会进入同一套 Skill 排序、posterior 审计渲染,以及 `patch``split``compress``retire``explore` 等 rewrite actions。完整的多 Skill hypothesis Bayesian model selection 在 roadmap 中,不作为 v0.x 已完成能力来宣传。
122+
两个 backend 都会进入同一套 Skill 排序、posterior 审计渲染,以及 `patch``split``compress``retire``explore` 等 rewrite actions。完整的多 Skill hypothesis Bayesian model selection 在 roadmap 中,不作为 v0.5 已完成能力来宣传。
110123

111124
## 📋 核心特性
112125

@@ -161,15 +174,17 @@ E[p_k | D_k] = (alpha_0 + s_k) / (alpha_0 + beta_0 + s_k + f_k)
161174
- context 分布
162175
- rewrite policy 建议
163176

164-
默认 rewrite policy 保持小而清晰:
177+
默认 rewrite policy 保持小而清晰,并和当前代码实现一致
165178

166-
| Posterior 信号 | Policy 动作 |
167-
|---|---|
168-
| 多次验证成功 | compress 或 reinforce |
169-
| 失败模式聚集 | patch |
170-
| 不同 context 下表现分化 | split 或 specialize |
171-
| 失败占主导 | retire 或 rewrite |
172-
| 证据稀疏 | explore |
179+
| Policy 动作 | 当前触发条件 | 为什么这样设 |
180+
|---|---|---|
181+
| `explore` | 没有观测,或 posterior 仍不确定 | 没有 verified evidence 前不急着改 Skill。 |
182+
| `retire` | `beta >= 4``success_probability < 0.45` | 避免一两次偶然失败就废弃 Skill,但会移除明显有害的 Skill。 |
183+
| `patch` | 某个 `failure_mode` 至少出现 2 次 | 把重复失败当成可行动证据,同时降低单样本过拟合。 |
184+
| `split` | context 至少 3 个,观测至少 4 次 | 避免一条过宽 SOP 覆盖互相不兼容的任务场景。 |
185+
| `compress` | 观测至少 3 次,且 `success_probability >= 0.72` | 在成功证据稳定后压缩 Skill,降低 token 成本。 |
186+
187+
这些阈值是 v0.5 的保守启发式,不宣称最优。当前目标是提供一套可审计、可替换的 posterior-driven rewrite policy。
173188

174189
## 🚀 安装
175190

docs/articles/bayesian-evidence-acquired-learning.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ P(y | x_1, x_2, ..., x_m)
143143
= P(x_1, x_2, ..., x_m | y) P(y) / P(x_1, x_2, ..., x_m)
144144
```
145145

146-
如果特征很多,直接估计联合概率 `P(x_1, x_2, ..., x_m | y)` 会非常困难。v0.x 采用一个轻量的 factorized categorical likelihood:在给定标签 `y` 后,各个特征近似条件独立:
146+
如果特征很多,直接估计联合概率 `P(x_1, x_2, ..., x_m | y)` 会非常困难。v0.5 采用一个轻量的 factorized categorical likelihood:在给定标签 `y` 后,各个特征近似条件独立:
147147

148148
```text
149149
P(x_1, x_2, ..., x_m | y)
@@ -256,7 +256,7 @@ P(spam | x)
256256
| Posterior | <code>P(spam &#124; x)</code>,看到这些特征后邮件是垃圾邮件的概率 |
257257
| Factorization assumption | 给定 `spam``normal` 后,折扣、链接、陌生发件人近似独立 |
258258

259-
这就是 categorical evidence model 的直觉:它不需要理解“折扣”和“链接”的深层语义,只要不断从历史样本里统计“哪些特征经常和哪个标签一起出现”,就可以做出可解释的概率判断。在 Bayesian-Agent 里,它被定位为 v0.x 的第一个可解释 evidence backend,而不是方法的上限。
259+
这就是 categorical evidence model 的直觉:它不需要理解“折扣”和“链接”的深层语义,只要不断从历史样本里统计“哪些特征经常和哪个标签一起出现”,就可以做出可解释的概率判断。在 Bayesian-Agent 里,它被定位为 v0.5 的第一个可解释 evidence backend,而不是方法的上限。
260260

261261
## 三、人的后天学习:一个技能是如何从经验里长出来的
262262

@@ -716,7 +716,7 @@ rewrite = patch
716716
reason = failures cluster around left_expected_output_blank
717717
```
718718

719-
改写后的 Skill context 不是泛泛地说“仔细一点”,而是把反复出现的失败模式变成可执行约束。当前 v0.x 实现会先把单次失败作为 candidate evidence 保存在 audit artifact 中;同一 failure mode 至少出现两次后,才会在下一轮 prompt 里注入类似这样的 active patch section:
719+
改写后的 Skill context 不是泛泛地说“仔细一点”,而是把反复出现的失败模式变成可执行约束。当前 v0.5 实现会先把单次失败作为 candidate evidence 保存在 audit artifact 中;同一 failure mode 至少出现两次后,才会在下一轮 prompt 里注入类似这样的 active patch section:
720720

721721
```text
722722
### Bayesian Failure-Mode Patches: sop_bench
@@ -1195,7 +1195,7 @@ Acquired learning lets the agent learn from its own verified experience.
11951195

11961196
三门问题告诉我们:信息不是“看起来还剩几个选项”,而是“这个证据在不同假设下出现的概率是多少”。
11971197

1198-
Bayesian Evidence Model 告诉我们:当证据由多个特征组成时,可以用一个可解释的 likelihood model,把每个特征对成功或失败的贡献统计出来;v0.x 采用的是 factorized categorical likelihood,后续可以替换成更强的贝叶斯模型。
1198+
Bayesian Evidence Model 告诉我们:当证据由多个特征组成时,可以用一个可解释的 likelihood model,把每个特征对成功或失败的贡献统计出来;v0.5 采用的是 factorized categorical likelihood,后续可以替换成更强的贝叶斯模型。
11991199

12001200
人的后天学习告诉我们:技能不是一次性写死的规则,而是在经验、反馈和训练中不断校准的操作假设。
12011201

docs/articles/complex-bayesian-rewrite-example.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ P(y | h_k, x) ∝ P(y | h_k) * Π_j P(x_j | y, h_k)
1212

1313
## 一、当前实现到底使用哪些 evidence features
1414

15-
Bayesian-Agent v0.x 的默认 backend 是 `categorical_bayes`。对每条 `TrajectoryEvidence`,当前实现会抽取这些离散特征:
15+
Bayesian-Agent v0.5 的默认 backend 是 `categorical_bayes`。对每条 `TrajectoryEvidence`,当前实现会抽取这些离散特征:
1616

1717
```text
1818
features = {
@@ -237,7 +237,7 @@ left_expected_output_blank + high token bucket + long turns + high latency
237237

238238
## 六、真正触发 Skill rewrite 的条件是什么
239239

240-
这里要非常精确。当前 v0.x`RewritePolicy` 并不是直接读取上面的 `P_h(failure | x_risk)` 来决定 rewrite。
240+
这里要非常精确。当前 v0.5`RewritePolicy` 并不是直接读取上面的 `P_h(failure | x_risk)` 来决定 rewrite。
241241

242242
当前代码中的触发顺序可以概括为:
243243

@@ -285,7 +285,7 @@ Current task files and runtime feedback remain authoritative.
285285

286286
- `posterior_success`:用于 Skill 排序和审计展示。
287287
- feature-conditioned posterior:用于解释某个 failure cluster 为什么危险。
288-
- `failure_modes` count:当前 v0.x 里实际触发 `patch` 的规则。
288+
- `failure_modes` count:当前 v0.5 里实际触发 `patch` 的规则。
289289

290290
## 七、patch 不是泛泛提醒,而是 failure-mode-specific guardrail
291291

@@ -304,7 +304,7 @@ failure_mode = left_expected_output_blank
304304
- If the target cell is empty, write the computed raw category string before finishing.
305305
```
306306

307-
这就是 Skill rewrite 在当前 v0.x 里的具体形态:它不是直接生成一个全新的 child Skill,也不是更新模型参数,而是在下一轮 prompt/context 中加入针对失败模式的可执行约束。
307+
这就是 Skill rewrite 在当前 v0.5 里的具体形态:它不是直接生成一个全新的 child Skill,也不是更新模型参数,而是在下一轮 prompt/context 中加入针对失败模式的可执行约束。
308308

309309
这个 patch 会和稳定的 benchmark guardrails 一起进入下一轮任务 context,例如:
310310

@@ -434,7 +434,7 @@ P_h(failure | x_risk) ≈ 0.997
434434
2. left_expected_output_blank 这类失败簇仍然需要被 guardrail 约束。
435435
```
436436

437-
所以当前 v0.x 里,即使后续 repair 成功,`failure_modes` 计数仍然会留在 registry 中。第一次出现的 failure mode 只作为 candidate evidence 保存在 audit artifact 中;同一 failure mode 至少出现两次后,context 里才会保留相关 active patch。这比“一错就改 skill”更稳,也能降低单个异常样本导致过拟合的风险。
437+
所以当前 v0.5 里,即使后续 repair 成功,`failure_modes` 计数仍然会留在 registry 中。第一次出现的 failure mode 只作为 candidate evidence 保存在 audit artifact 中;同一 failure mode 至少出现两次后,context 里才会保留相关 active patch。这比“一错就改 skill”更稳,也能降低单个异常样本导致过拟合的风险。
438438

439439
## 十、这个例子说明了什么
440440

@@ -472,7 +472,7 @@ repair 成功:
472472
2. **Likelihood**:context、failure mode、token bucket、turn bucket、latency bucket、metadata 都以 categorical likelihood 的形式统计。
473473
3. **Posterior**:新 evidence 改变下一轮 Skill 排序、failure patch、context 渲染和 repair 行为。
474474

475-
同时也要准确地说,v0.x 不是完整的 Bayesian model selection。它还没有把多个 child Skill hypothesis 放进一个统一的后验竞争框架:
475+
同时也要准确地说,v0.5 不是完整的 Bayesian model selection。它还没有把多个 child Skill hypothesis 放进一个统一的后验竞争框架:
476476

477477
```text
478478
P(h_k | D) ∝ P(D | h_k) P(h_k)

0 commit comments

Comments
 (0)