You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+26-11Lines changed: 26 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -85,9 +85,9 @@ P(success | theta, C, skill)
85
85
86
86
After each verified trajectory, the framework updates a posterior belief over that Skill. The posterior is used internally for Skill ranking, rewrite decisions, and failure-mode patches; model-facing benchmark prompts receive executable Skill/SOP text instead of raw probability summaries.
87
87
88
-
### What "Bayesian" Means in v0.x
88
+
### What "Bayesian" Means in v0.5
89
89
90
-
Current Bayesian-Agent v0.x defaults to a **Bayesian Evidence Model** for each Skill/SOP. The default implementation is a feature-conditioned categorical likelihood model: it estimates whether a Skill will succeed under observed evidence features such as task context, failure mode, token bucket, turn bucket, latency bucket, and selected metadata.
90
+
Current Bayesian-Agent v0.5 defaults to a **Bayesian Evidence Model** for each Skill/SOP. The default implementation is a feature-conditioned categorical likelihood model: it estimates whether a Skill will succeed under observed evidence features such as task context, failure mode, token bucket, turn bucket, latency bucket, and selected metadata.
91
91
92
92
For a Skill hypothesis `h_k`, evidence `D_k = {(x_i, y_i)}` contains discrete features `x_i` and verified labels `y_i in {success, failure}`:
The implementation uses Laplace smoothing with `alpha = 1`. This is Bayesian in the posterior-belief sense: verified experience updates the probability of a Skill succeeding under a particular context and runtime signature. The default backend is exposed as `algorithm="categorical_bayes"`; `algorithm="naive_bayes"` remains accepted as a legacy alias for the same factorized categorical likelihood.
101
101
102
+
The current likelihood model uses **five fixed categorical evidence terms plus optional short metadata terms**:
103
+
104
+
| Evidence term | Why it is included |
105
+
|---|---|
106
+
|`context`| Captures task family, benchmark, or harness context. |
107
+
|`failure_mode`| Captures reusable error patterns that can become concrete Skill/SOP patches. |
108
+
|`token_bucket`| Captures whether a trajectory succeeded cheaply or only after expensive search. |
109
+
|`turn_bucket`| Captures recovery loops and interaction complexity. |
110
+
|`latency_bucket`| Captures slow tool, data, or API paths that may require different SOPs. |
111
+
|`metadata.*`| Adds harness-specific short scalar diagnostics without baking one harness schema into the core. |
112
+
113
+
`metadata.*` features are included only when the value is a short scalar (`str`, `int`, `float`, or `bool`, with string length at most 80). Runtime numbers are bucketed before entering the likelihood model so sparse exact values do not dominate early evidence.
114
+
102
115
For compatibility and ablation, the original **Beta-Bernoulli** posterior is still available via `algorithm="beta_bernoulli"` or `bayesian-agent evolve --algorithm beta_bernoulli`:
Both backends feed the same Skill ranking, posterior audit rendering, and rewrite actions such as `patch`, `split`, `compress`, `retire`, and `explore`. Full Bayesian model selection over competing Skill hypotheses is planned, but not claimed in v0.x.
122
+
Both backends feed the same Skill ranking, posterior audit rendering, and rewrite actions such as `patch`, `split`, `compress`, `retire`, and `explore`. Full Bayesian model selection over competing Skill hypotheses is planned, but not claimed in v0.5.
110
123
111
124
## 📋 Core Features
112
125
@@ -161,15 +174,17 @@ For each Skill or benchmark SOP, Bayesian-Agent maintains:
161
174
- context distribution
162
175
- rewrite policy recommendations
163
176
164
-
The default rewrite policy is intentionally small:
177
+
The default rewrite policy is intentionally small and matches the current implementation:
165
178
166
-
| Posterior signal | Policy action |
167
-
|---|---|
168
-
| repeated verified success | compress or reinforce |
169
-
| clustered failures | patch |
170
-
| mixed outcomes across contexts | split or specialize |
171
-
| dominant failures | retire or rewrite |
172
-
| sparse evidence | explore |
179
+
| Policy action | Current trigger | Why |
180
+
|---|---|---|
181
+
|`explore`| no observations, or posterior remains uncertain | Avoids rewriting before verified evidence exists. |
182
+
|`retire`|`beta >= 4` and `success_probability < 0.45`| Avoids retiring after one or two unlucky failures, but removes clearly harmful Skills. |
183
+
|`patch`| one `failure_mode` appears at least twice | Treats repeated failures as actionable evidence while avoiding one-off overfitting. |
184
+
|`split`| at least 3 contexts and at least 4 observations | Prevents one broad SOP from covering incompatible task contexts. |
185
+
|`compress`| at least 3 observations and `success_probability >= 0.72`| Distills stable Skills to reduce token cost after enough positive evidence. |
186
+
187
+
These thresholds are conservative v0.5 heuristics, not claims of optimality. The design goal is an inspectable posterior-driven policy that can be swapped out by downstream harnesses.
0 commit comments