You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+81-16Lines changed: 81 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,18 @@ On a 30-task benchmark with 7 MCP tools, using Gemini 2.5 Pro:
25
25
26
26
McNemar's test: **p = 0.004** vs zero-shot.
27
27
28
-
> **Note:** These numbers are from the published paper. To reproduce them yourself, see [Running the Real Benchmark](#running-the-real-benchmark) below.
28
+
**Reproduced live run** (gemini-2.5-pro, pgvector, May 2026):
McNemar's test: **p = 0.039** vs zero-shot (statistically significant).
38
+
39
+
> All reproduced metrics fall within the paper's 95% bootstrap confidence intervals. See [Running the Real Benchmark](#running-the-real-benchmark) to reproduce yourself.
29
40
30
41
---
31
42
@@ -169,11 +180,15 @@ The benchmark sends 30 tasks through 3 strategies (zero-shot, static few-shot, d
169
180
170
181
### Prerequisites
171
182
172
-
Only a Google API key. No PostgreSQL required — the benchmark uses `InMemoryTraceStore`.
183
+
Only a Google API key is required. PostgreSQL is optional — the benchmark defaults to `InMemoryTraceStore`, but for exact paper reproduction use `--postgres`.
173
184
174
185
```bash
175
186
pip install -e ".[agent,eval]"
176
187
export GOOGLE_API_KEY=your-key-here
188
+
189
+
# Optional: for pgvector mode (paper reproduction)
Results include per-task breakdowns, difficulty-tier analysis, and McNemar's test.
212
231
232
+
### Reproducing Paper Numbers Exactly
233
+
234
+
The paper used PostgreSQL+pgvector for trace storage. The in-memory store gives equivalent TSA/ESA results but lower PV/PCR due to differences in nearest-neighbor retrieval fidelity. To reproduce the exact paper numbers:
|`TraceStore`(pgvector) | Production, paper reproduction, persistent memory | PostgreSQL + pgvector (`podman-compose up -d`) | Exact paper numbers|
317
358
318
359
### The Framework is Model-Agnostic
319
360
@@ -326,24 +367,46 @@ behavioral-memory/
326
367
327
368
---
328
369
329
-
## Feedback Loop (Langfuse)
370
+
## How the Agent Learns (Feedback Loop)
330
371
331
-
The system learns from human feedback via Langfuse:
372
+
The architecture implements a continuous learning cycle via Langfuse (Section III.F):
332
373
333
-
1. Agent generates a plan → logged to Langfuse
334
-
2. SME reviews and scores the trace in Langfuse
335
-
3. FeedbackPoller detects positive scores
336
-
4. Gatekeeper validates the trace (schema + sandbox + dedup)
337
-
5. Validated trace enters behavioral memory
338
-
6. Future queries retrieve this trace as a reference example
374
+
```
375
+
User Query → Agent generates plan → Logged to Langfuse
376
+
↓
377
+
SME reviews in Langfuse dashboard
378
+
Assigns quality score (≥1.0 = positive)
379
+
↓
380
+
FeedbackPoller detects positive scores
381
+
↓
382
+
GatekeeperPipeline.submit(trace)
383
+
├── Gate 1: Schema validation
384
+
├── Gate 2: Sandboxed execution
385
+
└── Gate 3: Semantic deduplication
386
+
↓
387
+
If all gates pass → stored in memory
388
+
↓
389
+
Future queries retrieve this trace
390
+
→ Agent produces better plans
391
+
```
392
+
393
+
**Key insight:** The gatekeeper ensures only high-quality, non-duplicate, structurally valid traces enter memory. This is what separates our approach from systems like Reflexion that store unstructured reflections without validation.
394
+
395
+
> **Note:** The paper's benchmark used a fixed memory of 12 seed traces to isolate the impact of retrieval. The feedback loop is implemented but was not exercised during evaluation (see Section V.C). Longitudinal testing with a growing memory is identified as the most important next step.
339
396
340
397
```python
341
-
from behavioral_memory import FeedbackPoller, GatekeeperPipeline
398
+
from behavioral_memory import FeedbackPoller, GatekeeperPipeline, AnnotationHandler
0 commit comments