Skip to content

Commit 7a8b92a

Browse files
authored
Merge pull request #2 from harsh-kr11/feat/live-agent-and-benchmark
feat: Add InMemoryTraceStore, live benchmark, and interactive agent
2 parents c1e4a44 + 9179bd0 commit 7a8b92a

15 files changed

Lines changed: 1081 additions & 96 deletions

File tree

README.md

Lines changed: 207 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ Based on the IEEE paper: *"Behavioral Memory for Tool Orchestration: Semantic Re
1212

1313
---
1414

15-
## Key Results
15+
## Key Results (from the paper)
1616

17-
On a 30-task benchmark with 7 MCP tools:
17+
On a 30-task benchmark with 7 MCP tools, using Gemini 2.5 Pro:
1818

1919
| Metric | Zero-Shot | Static Few-Shot | **Proposed** |
2020
|--------|-----------|----------------|-------------|
@@ -25,6 +25,45 @@ On a 30-task benchmark with 7 MCP tools:
2525

2626
McNemar's test: **p = 0.004** vs zero-shot.
2727

28+
> **Note:** These numbers are from the published paper. To reproduce them yourself, see [Running the Real Benchmark](#running-the-real-benchmark) below.
29+
30+
---
31+
32+
## Quick Start
33+
34+
### Option A: No API keys needed (validation + demo)
35+
36+
```bash
37+
git clone https://github.com/harsh-kr11/behavioral-memory.git
38+
cd behavioral-memory
39+
pip install -e ".[agent,eval,dev]"
40+
41+
# Validate the entire pipeline (30/30 checks, no external services)
42+
python examples/validate_pipeline.py
43+
44+
# Quick demo showing behavioral memory impact
45+
behavioral-memory demo
46+
```
47+
48+
### Option B: With a Google API key (real benchmark)
49+
50+
```bash
51+
export GOOGLE_API_KEY=your-key-here
52+
python examples/run_live_benchmark.py # all 30 tasks
53+
python examples/run_live_benchmark.py --limit 5 # quick test with 5 tasks
54+
python examples/run_live_benchmark.py --model gemini-2.0-flash # cheaper model
55+
```
56+
57+
### Option C: Interactive agent
58+
59+
```bash
60+
export GOOGLE_API_KEY=your-key-here
61+
python -m agent.app --interactive
62+
63+
# Or single query:
64+
python -m agent.app "Build a revenue analysis pipeline"
65+
```
66+
2867
---
2968

3069
## How It Works
@@ -35,7 +74,8 @@ User Query
3574
3675
┌─────────────────────────────────────────────────────┐
3776
│ 1. BEHAVIORAL LAYER │
38-
│ Retrieve top-k similar traces from pgvector │
77+
│ Retrieve top-k similar traces from memory │
78+
│ (pgvector or in-memory — your choice) │
3979
│ │
4080
│ 2. TOOL LAYER │
4181
│ Fetch available tool schemas via MCP │
@@ -69,48 +109,136 @@ User Query
69109

70110
## Two Ways to Use
71111

72-
### 1. Bring Your Own Agent (library)
112+
### 1. As a Library (Bring Your Own Agent)
73113

74-
Install the framework and plug it into your existing agent:
114+
Install and plug into your existing agent:
75115

76116
```bash
77117
pip install behavioral-memory
78118
```
79119

80120
```python
81-
from behavioral_memory import TraceStore, PlanEngine, ToolRegistry
82-
from langchain_openai import ChatOpenAI, OpenAIEmbeddings # or any provider
121+
from behavioral_memory import PlanEngine, ToolRegistry, InMemoryTraceStore
122+
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
83123

84124
llm = ChatOpenAI(model="gpt-4o", temperature=0)
85125
embeddings = OpenAIEmbeddings()
86126

87-
store = TraceStore(embeddings=embeddings, connection_url="postgresql+psycopg://...")
127+
# No PostgreSQL needed — InMemoryTraceStore works anywhere
128+
store = InMemoryTraceStore(embeddings=embeddings)
88129
registry = ToolRegistry()
89130
engine = PlanEngine(llm=llm, store=store, registry=registry)
90131

91132
plan = engine.generate(query="Get revenue data and email a report")
92133
```
93134

94-
### 2. Run the Reference Agent (LangGraph 1.x)
135+
For production with PostgreSQL + pgvector:
95136

96-
Clone the repo and run the complete system:
137+
```python
138+
from behavioral_memory import TraceStore
139+
140+
store = TraceStore(embeddings=embeddings, connection_url="postgresql+psycopg://...")
141+
```
142+
143+
### 2. Run the Reference Agent (LangGraph 1.x)
97144

98145
```bash
99146
git clone https://github.com/harsh-kr11/behavioral-memory.git
100147
cd behavioral-memory
101148
pip install -e ".[agent]"
102149

150+
export GOOGLE_API_KEY=your-key
151+
152+
# Interactive mode
153+
python -m agent.app --interactive
154+
155+
# Single query
103156
python -m agent.app "Build a revenue analysis pipeline"
104157
```
105158

159+
The interactive agent supports:
160+
- `/compare <query>` — run with AND without memory, see the difference
161+
- `/memory` — inspect what's in behavioral memory
162+
- `/quit` — exit
163+
164+
---
165+
166+
## Running the Real Benchmark
167+
168+
The benchmark sends 30 tasks through 3 strategies (zero-shot, static few-shot, dynamic retrieval), scoring each plan against gold tool chains.
169+
170+
### Prerequisites
171+
172+
Only a Google API key. No PostgreSQL required — the benchmark uses `InMemoryTraceStore`.
173+
174+
```bash
175+
pip install -e ".[agent,eval]"
176+
export GOOGLE_API_KEY=your-key-here
177+
```
178+
179+
### Run
180+
181+
```bash
182+
# Full benchmark (30 tasks × 3 strategies = 90 LLM calls)
183+
python examples/run_live_benchmark.py
184+
185+
# Quick test (5 tasks × 3 strategies = 15 LLM calls)
186+
python examples/run_live_benchmark.py --limit 5
187+
188+
# Use a cheaper/faster model
189+
python examples/run_live_benchmark.py --model gemini-2.0-flash
190+
191+
# With Langfuse logging
192+
export LANGFUSE_SECRET_KEY=sk-lf-...
193+
export LANGFUSE_PUBLIC_KEY=pk-lf-...
194+
python examples/run_live_benchmark.py
195+
```
196+
197+
### What you'll see
198+
199+
```
200+
Benchmark Results (N=30, model=gemini-2.5-pro)
201+
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
202+
┃ Metric ┃ Zero-Shot ┃ Static Few-Shot ┃ Dynamic (Proposed) ┃
203+
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
204+
│ TSA │ 63.3% [53%, 73%] │ 70.0% [56%, 83%] │ 83.3% [70%, 93%] │
205+
│ PV │ 72.2% │ 79.6% │ 84.0% │
206+
│ PCR │ 33.3% [16%, 50%] │ 50.0% [33%, 66%] │ 63.3% [46%, 80%] │
207+
│ ESA │ 63.3% [46%, 80%] │ 70.0% [53%, 86%] │ 83.3% [70%, 93%] │
208+
└────────┴──────────────────┴─────────────────────┴──────────────────────────┘
209+
```
210+
211+
Results include per-task breakdowns, difficulty-tier analysis, and McNemar's test.
212+
213+
---
214+
215+
## Pipeline Validation (No API Keys)
216+
217+
Validates every component works correctly using mock services:
218+
219+
```bash
220+
python examples/validate_pipeline.py
221+
```
222+
223+
This verifies:
224+
- 12 seed traces load and pass schema validation
225+
- 30 ground truth tasks have correct structure
226+
- InMemoryTraceStore embeds, stores, and retrieves traces
227+
- PlanEngine generates plans (zero-shot, static, dynamic)
228+
- BenchmarkRunner scores and compares strategies
229+
- Gatekeeper pipeline accepts/rejects traces
230+
- Langfuse tracer handles offline mode gracefully
231+
232+
All **30 checks** pass with zero external dependencies.
233+
106234
---
107235

108236
## Installation
109237

110238
### Prerequisites
111239

112240
- Python 3.11+
113-
- PostgreSQL with [pgvector](https://github.com/pgvector/pgvector) extension
241+
- (Optional) PostgreSQL with [pgvector](https://github.com/pgvector/pgvector) for production deployments
114242

115243
### Install with uv (recommended)
116244

@@ -128,19 +256,22 @@ pip install behavioral-memory
128256
pip install behavioral-memory[agent,eval]
129257
```
130258

131-
### Configure
259+
### Environment Setup
132260

133261
```bash
262+
# Interactive setup (guides you through each variable)
263+
behavioral-memory setup
264+
265+
# Or manual
134266
cp .env.example .env
135-
# Edit .env with your credentials
136267
```
137268

138269
| Variable | Required | Description |
139270
|----------|----------|-------------|
140-
| `VECTOR_STORE_URL` | Yes | PostgreSQL+pgvector connection string |
141-
| `GOOGLE_API_KEY` | For reference agent | Gemini API key |
142-
| `LANGFUSE_SECRET_KEY` | For feedback loop | Langfuse secret key |
143-
| `LANGFUSE_PUBLIC_KEY` | For feedback loop | Langfuse public key |
271+
| `GOOGLE_API_KEY` | For LLM calls | Gemini API key (or use any LangChain-compatible LLM) |
272+
| `VECTOR_STORE_URL` | For PostgreSQL mode | `postgresql+psycopg://localhost/behavioral_memory` |
273+
| `LANGFUSE_SECRET_KEY` | For observability | Langfuse secret key |
274+
| `LANGFUSE_PUBLIC_KEY` | For observability | Langfuse public key |
144275

145276
---
146277

@@ -152,7 +283,7 @@ cp .env.example .env
152283
behavioral-memory/
153284
├── src/behavioral_memory/ # The pip-installable library
154285
│ ├── core/ # Schemas, config, exceptions
155-
│ ├── memory/ # Behavioral Layer (TraceStore, dedup, token budget)
286+
│ ├── memory/ # Behavioral Layer (TraceStore, InMemoryTraceStore, dedup)
156287
│ ├── tools/ # Tool Layer (MCP client, registry, mock tools)
157288
│ ├── planner/ # Executive Layer (PlanEngine, prompt, postprocess)
158289
│ ├── gatekeeper/ # Gatekeeper (schema validator, sandbox, dedup gate)
@@ -162,13 +293,25 @@ behavioral-memory/
162293
│ ├── graph.py # StateGraph definition
163294
│ ├── state.py # Agent state
164295
│ └── nodes/ # Graph nodes (retrieve, plan, execute, observe)
165-
├── tests/ # Unit + integration tests
166-
└── examples/ # Usage examples
296+
├── tests/ # 104 tests (unit + integration + e2e)
297+
│ ├── unit/ # 61 unit tests
298+
│ ├── integration/ # 3 integration tests
299+
│ └── e2e/ # 40 end-to-end tests
300+
├── examples/
301+
│ ├── validate_pipeline.py # Full pipeline validation (no API keys)
302+
│ ├── run_live_benchmark.py # Real benchmark (needs API key)
303+
│ └── run_benchmark.py # Benchmark with PostgreSQL
304+
└── .github/workflows/ # CI/CD
167305
```
168306

169-
### The Framework is Model-Agnostic
307+
### Store Options
170308

171-
The library accepts any LangChain-compatible model:
309+
| Store | When to Use | Requires |
310+
|-------|------------|----------|
311+
| `InMemoryTraceStore` | Development, demos, CI, benchmarks | Nothing (numpy only) |
312+
| `TraceStore` | Production with persistent memory | PostgreSQL + pgvector |
313+
314+
### The Framework is Model-Agnostic
172315

173316
| Provider | LLM | Embeddings |
174317
|----------|-----|------------|
@@ -179,7 +322,7 @@ The library accepts any LangChain-compatible model:
179322

180323
---
181324

182-
## Feedback Loop
325+
## Feedback Loop (Langfuse)
183326

184327
The system learns from human feedback via Langfuse:
185328

@@ -196,31 +339,45 @@ from behavioral_memory import FeedbackPoller, GatekeeperPipeline
196339
poller = FeedbackPoller(settings=settings)
197340
gatekeeper = GatekeeperPipeline(store=store, registry=registry)
198341

199-
# Auto-learn in the background
200342
poller.poll_loop(callback=lambda trace: gatekeeper.submit(trace))
201343
```
202344

203345
---
204346

205-
## Evaluation
347+
## Testing
206348

207-
### Reproduce Paper Results
349+
### Run all tests (104 tests, no external services needed)
208350

209351
```bash
210-
pip install behavioral-memory[agent,eval]
211-
python examples/run_benchmark.py
352+
pip install -e ".[dev]"
353+
pytest tests/ -v
354+
```
355+
356+
### Test breakdown
357+
358+
| Suite | Tests | What it covers |
359+
|-------|-------|---------------|
360+
| `tests/unit/` | 61 | Schemas, metrics, postprocessing, prompt assembly, token budget, in-memory store |
361+
| `tests/integration/` | 3 | Schema validator + sandbox with real traces |
362+
| `tests/e2e/` | 40 | Full pipeline: seed traces → prompt → mock LLM → metrics → gatekeeper |
363+
364+
### Pipeline validation
365+
366+
```bash
367+
python examples/validate_pipeline.py # 30 checks, 0 external deps
212368
```
213369

214-
### CLI Tools
370+
### Linting and type checking
215371

216372
```bash
217-
behavioral-memory benchmark info # Dataset summary
218-
behavioral-memory benchmark ground-truth # View all 30 tasks
219-
behavioral-memory benchmark seed-traces # View 12 seed traces
220-
behavioral-memory benchmark tools # View 7 tool definitions
373+
ruff check src/ tests/ agent/
374+
ruff format src/ tests/ agent/
375+
mypy src/
221376
```
222377

223-
### Metrics (Section IV.C)
378+
---
379+
380+
## Evaluation Metrics (Section IV.C)
224381

225382
| Metric | Description |
226383
|--------|-------------|
@@ -231,6 +388,19 @@ behavioral-memory benchmark tools # View 7 tool definitions
231388

232389
---
233390

391+
## CLI Tools
392+
393+
```bash
394+
behavioral-memory setup # Interactive .env setup
395+
behavioral-memory demo # Offline demo of behavioral memory
396+
behavioral-memory benchmark info # Dataset summary
397+
behavioral-memory benchmark ground-truth # View all 30 tasks
398+
behavioral-memory benchmark seed-traces # View 12 seed traces
399+
behavioral-memory benchmark tools # View 7 tool definitions
400+
```
401+
402+
---
403+
234404
## Configuration
235405

236406
All settings via environment variables or `.env`:
@@ -253,16 +423,16 @@ All settings via environment variables or `.env`:
253423

254424
| Component | Technology |
255425
|-----------|-----------|
256-
| Vector Store | PostgreSQL + pgvector |
426+
| Vector Store | PostgreSQL + pgvector (production) / In-memory (development) |
257427
| Embeddings | Any LangChain Embeddings (default: Gemini) |
258428
| LLM | Any LangChain ChatModel (default: Gemini 2.5 Pro) |
259429
| Agent Framework | LangGraph 1.x (reference agent) |
260430
| Observability | Langfuse |
261431
| Config | Pydantic Settings |
262432
| Tokenization | tiktoken |
263433
| CLI | Typer + Rich |
264-
| Testing | pytest |
265-
| Linting | ruff |
434+
| Testing | pytest (104 tests) |
435+
| Linting | ruff + pre-commit hooks |
266436
| Type Checking | mypy (strict) |
267437
| Package Management | uv |
268438

0 commit comments

Comments
 (0)