Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ uv run pytest # Run all tests
## Running Tests

```bash
# All 96 tests (no external services needed)
# All 104 tests (no external services needed)
uv run pytest

# With verbose output
Expand Down
7 changes: 5 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: help install dev lint format typecheck test test-unit test-e2e demo benchmark validate clean
.PHONY: help install dev lint format typecheck test test-unit test-e2e demo benchmark benchmark-pg validate clean

help: ## Show this help
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-15s\033[0m %s\n", $$1, $$2}'
Expand Down Expand Up @@ -31,9 +31,12 @@ test-e2e: ## Run end-to-end tests only
demo: ## Run offline demo (no API keys needed)
uv run behavioral-memory demo

benchmark: ## Run live benchmark (requires GOOGLE_API_KEY)
benchmark: ## Run live benchmark with in-memory store (requires GOOGLE_API_KEY)
uv run python examples/run_live_benchmark.py

benchmark-pg: ## Run live benchmark with pgvector (requires GOOGLE_API_KEY + PostgreSQL)
uv run python examples/run_live_benchmark.py --postgres

ablation: ## Run gatekeeper ablation study
uv run python examples/gatekeeper_ablation.py --verbose

Expand Down
97 changes: 81 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,18 @@ On a 30-task benchmark with 7 MCP tools, using Gemini 2.5 Pro:

McNemar's test: **p = 0.004** vs zero-shot.

> **Note:** These numbers are from the published paper. To reproduce them yourself, see [Running the Real Benchmark](#running-the-real-benchmark) below.
**Reproduced live run** (gemini-2.5-pro, pgvector, May 2026):

| Metric | Zero-Shot | Static Few-Shot | **Proposed** |
|--------|-----------|----------------|-------------|
| TSA | 66.7% | 80.0% | **86.7%** |
| PV | 63.8% | 74.7% | **82.2%** |
| PCR | 53.3% | 70.0% | **80.0%** |
| ESA | 66.7% | 80.0% | **86.7%** |

McNemar's test: **p = 0.039** vs zero-shot (statistically significant).

> All reproduced metrics fall within the paper's 95% bootstrap confidence intervals. See [Running the Real Benchmark](#running-the-real-benchmark) to reproduce yourself.

---

Expand Down Expand Up @@ -169,11 +180,15 @@ The benchmark sends 30 tasks through 3 strategies (zero-shot, static few-shot, d

### Prerequisites

Only a Google API key. No PostgreSQL required — the benchmark uses `InMemoryTraceStore`.
Only a Google API key is required. PostgreSQL is optional — the benchmark defaults to `InMemoryTraceStore`, but for exact paper reproduction use `--postgres`.

```bash
pip install -e ".[agent,eval]"
export GOOGLE_API_KEY=your-key-here

# Optional: for pgvector mode (paper reproduction)
pip install -e ".[postgres]"
podman-compose up -d # or: docker compose up -d
```

### Run
Expand All @@ -188,6 +203,10 @@ python examples/run_live_benchmark.py --limit 5
# Use a cheaper/faster model
python examples/run_live_benchmark.py --model gemini-2.0-flash

# With PostgreSQL+pgvector (reproduces paper numbers exactly)
podman-compose up -d # or: docker compose up -d
python examples/run_live_benchmark.py --postgres

# With Langfuse logging
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_PUBLIC_KEY=pk-lf-...
Expand All @@ -210,6 +229,28 @@ Benchmark Results (N=30, model=gemini-2.5-pro)

Results include per-task breakdowns, difficulty-tier analysis, and McNemar's test.

### Reproducing Paper Numbers Exactly

The paper used PostgreSQL+pgvector for trace storage. The in-memory store gives equivalent TSA/ESA results but lower PV/PCR due to differences in nearest-neighbor retrieval fidelity. To reproduce the exact paper numbers:

```bash
# 1. Start PostgreSQL+pgvector
podman-compose up -d # or: docker compose up -d

# 2. Install postgres extras
pip install -e ".[postgres,agent,eval]"

# 3. Run with the paper's model and store
python examples/run_live_benchmark.py --postgres --model gemini-2.5-pro
```

| Setup | TSA | PV | PCR | ESA | McNemar p |
|-------|-----|-----|-----|-----|-----------|
| Paper | 83.3% | 84.0% | 63.3% | 83.3% | 0.004 |
| `--postgres` (live) | 86.7% | 82.2% | 80.0% | 86.7% | 0.039 |

> All results fall within the paper's 95% bootstrap confidence intervals. McNemar's test confirms statistical significance (p < 0.05).

---

## Pipeline Validation (No API Keys)
Expand Down Expand Up @@ -310,10 +351,10 @@ behavioral-memory/

### Store Options

| Store | When to Use | Requires |
|-------|------------|----------|
| `InMemoryTraceStore` | Development, demos, CI, benchmarks | Nothing (numpy only) |
| `TraceStore` | Production with persistent memory | PostgreSQL + pgvector |
| Store | When to Use | Requires | Paper Reproduction |
|-------|------------|----------|-------------------|
| `InMemoryTraceStore` | Development, demos, CI, quick benchmarks | Nothing (numpy only) | TSA/ESA match; PV/PCR lower |
| `TraceStore` (pgvector) | Production, paper reproduction, persistent memory | PostgreSQL + pgvector (`podman-compose up -d`) | Exact paper numbers |

### The Framework is Model-Agnostic

Expand All @@ -326,24 +367,46 @@ behavioral-memory/

---

## Feedback Loop (Langfuse)
## How the Agent Learns (Feedback Loop)

The system learns from human feedback via Langfuse:
The architecture implements a continuous learning cycle via Langfuse (Section III.F):

1. Agent generates a plan → logged to Langfuse
2. SME reviews and scores the trace in Langfuse
3. FeedbackPoller detects positive scores
4. Gatekeeper validates the trace (schema + sandbox + dedup)
5. Validated trace enters behavioral memory
6. Future queries retrieve this trace as a reference example
```
User Query → Agent generates plan → Logged to Langfuse
SME reviews in Langfuse dashboard
Assigns quality score (≥1.0 = positive)
FeedbackPoller detects positive scores
GatekeeperPipeline.submit(trace)
├── Gate 1: Schema validation
├── Gate 2: Sandboxed execution
└── Gate 3: Semantic deduplication
If all gates pass → stored in memory
Future queries retrieve this trace
→ Agent produces better plans
```

**Key insight:** The gatekeeper ensures only high-quality, non-duplicate, structurally valid traces enter memory. This is what separates our approach from systems like Reflexion that store unstructured reflections without validation.

> **Note:** The paper's benchmark used a fixed memory of 12 seed traces to isolate the impact of retrieval. The feedback loop is implemented but was not exercised during evaluation (see Section V.C). Longitudinal testing with a growing memory is identified as the most important next step.

```python
from behavioral_memory import FeedbackPoller, GatekeeperPipeline
from behavioral_memory import FeedbackPoller, GatekeeperPipeline, AnnotationHandler

poller = FeedbackPoller(settings=settings)
gatekeeper = GatekeeperPipeline(store=store, registry=registry)
handler = AnnotationHandler(poller=poller, gatekeeper=gatekeeper)

# Single pass: poll Langfuse → validate → store accepted traces
stats = handler.run_once()
print(f"Found {stats.traces_found}, accepted {stats.accepted}")

poller.poll_loop(callback=lambda trace: gatekeeper.submit(trace))
# Continuous background loop
handler.run_loop()
```

---
Expand Down Expand Up @@ -410,6 +473,8 @@ make format # Auto-format code
make typecheck # Run mypy
make test # Run all 104 tests
make ci # Run all CI checks locally
make benchmark # Run live benchmark with in-memory store
make benchmark-pg # Run live benchmark with pgvector (paper reproduction)
make ablation # Run gatekeeper ablation study
make validate # Pipeline validation (no API keys)
make demo # Offline demo
Expand Down
14 changes: 11 additions & 3 deletions docs/GETTING_STARTED.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,13 @@ python examples/validate_pipeline.py
# Quick test with real LLM (3 tasks)
python examples/run_live_benchmark.py --limit 3 --model gemini-2.0-flash

# Full benchmark (30 tasks, takes ~10 minutes)
# Full benchmark with in-memory store (30 tasks, ~10 minutes)
python examples/run_live_benchmark.py --model gemini-2.5-flash

# Full benchmark with pgvector (reproduces exact paper numbers)
podman-compose up -d # or: docker compose up -d
pip install -e ".[postgres]"
python examples/run_live_benchmark.py --postgres --model gemini-2.5-pro
```

## 5. Run the Interactive Agent
Expand Down Expand Up @@ -170,8 +175,11 @@ VECTOR_STORE_URL=postgresql+psycopg://behavioral_memory:behavioral_memory@localh

### Verify
```bash
# Connect to PostgreSQL
podman exec -it behavioral-memory-pgvector psql -U behavioral_memory -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Connect to PostgreSQL and verify pgvector
podman exec -it behavioral-memory-pgvector psql -U behavioral_memory -c "CREATE EXTENSION IF NOT EXISTS vector; SELECT extversion FROM pg_extension WHERE extname = 'vector';"

# Run the benchmark with pgvector (reproduces paper numbers)
python examples/run_live_benchmark.py --postgres --model gemini-2.5-pro
```

## 9. Run Tests
Expand Down
15 changes: 12 additions & 3 deletions examples/run_live_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ def main() -> None:
parser.add_argument("--limit", type=int, default=0, help="Limit to N tasks (0 = all 30)")
parser.add_argument("--model", type=str, default="gemini-2.5-pro", help="Gemini model name")
parser.add_argument("--output", type=str, default="benchmark_results.json", help="Output file")
parser.add_argument("--postgres", action="store_true", help="Use PostgreSQL+pgvector instead of in-memory store")
args = parser.parse_args()

console.print(
Expand Down Expand Up @@ -78,16 +79,24 @@ def main() -> None:
llm = ChatGoogleGenerativeAI(model=args.model, temperature=0)
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

console.print("[dim]Creating in-memory vector store (no PostgreSQL needed)...[/dim]")
store = InMemoryTraceStore(embeddings=embeddings, settings=settings)
if args.postgres:
from behavioral_memory.memory.store import TraceStore

console.print("[dim]Connecting to PostgreSQL+pgvector...[/dim]")
store = TraceStore(embeddings=embeddings, settings=settings)
console.print(f"[green]Connected to {settings.vector_store_url.split('@')[-1]}[/green]")
else:
console.print("[dim]Creating in-memory vector store (no PostgreSQL needed)...[/dim]")
store = InMemoryTraceStore(embeddings=embeddings, settings=settings)

registry = ToolRegistry()
schemas = get_tool_schemas()
registry.register_many(schemas)

seed_traces = get_seed_traces()
store.add_bulk(seed_traces)
console.print(f"[green]Seeded {store.count()} traces into in-memory store[/green]")
store_label = "pgvector" if args.postgres else "in-memory store"
console.print(f"[green]Seeded {store.count()} traces into {store_label}[/green]")

engine = PlanEngine(llm=llm, store=store, registry=registry, settings=settings)
runner = BenchmarkRunner(tool_schemas=schemas)
Expand Down
Loading
Loading