@@ -12,9 +12,9 @@ Based on the IEEE paper: *"Behavioral Memory for Tool Orchestration: Semantic Re
1212
1313---
1414
15- ## Key Results
15+ ## Key Results (from the paper)
1616
17- On a 30-task benchmark with 7 MCP tools:
17+ On a 30-task benchmark with 7 MCP tools, using Gemini 2.5 Pro :
1818
1919| Metric | Zero-Shot | Static Few-Shot | ** Proposed** |
2020| --------| -----------| ----------------| -------------|
@@ -25,6 +25,45 @@ On a 30-task benchmark with 7 MCP tools:
2525
2626McNemar's test: ** p = 0.004** vs zero-shot.
2727
28+ > ** Note:** These numbers are from the published paper. To reproduce them yourself, see [ Running the Real Benchmark] ( #running-the-real-benchmark ) below.
29+
30+ ---
31+
32+ ## Quick Start
33+
34+ ### Option A: No API keys needed (validation + demo)
35+
36+ ``` bash
37+ git clone https://github.com/harsh-kr11/behavioral-memory.git
38+ cd behavioral-memory
39+ pip install -e " .[agent,eval,dev]"
40+
41+ # Validate the entire pipeline (30/30 checks, no external services)
42+ python examples/validate_pipeline.py
43+
44+ # Quick demo showing behavioral memory impact
45+ behavioral-memory demo
46+ ```
47+
48+ ### Option B: With a Google API key (real benchmark)
49+
50+ ``` bash
51+ export GOOGLE_API_KEY=your-key-here
52+ python examples/run_live_benchmark.py # all 30 tasks
53+ python examples/run_live_benchmark.py --limit 5 # quick test with 5 tasks
54+ python examples/run_live_benchmark.py --model gemini-2.0-flash # cheaper model
55+ ```
56+
57+ ### Option C: Interactive agent
58+
59+ ``` bash
60+ export GOOGLE_API_KEY=your-key-here
61+ python -m agent.app --interactive
62+
63+ # Or single query:
64+ python -m agent.app " Build a revenue analysis pipeline"
65+ ```
66+
2867---
2968
3069## How It Works
@@ -35,7 +74,8 @@ User Query
3574 ▼
3675┌─────────────────────────────────────────────────────┐
3776│ 1. BEHAVIORAL LAYER │
38- │ Retrieve top-k similar traces from pgvector │
77+ │ Retrieve top-k similar traces from memory │
78+ │ (pgvector or in-memory — your choice) │
3979│ │
4080│ 2. TOOL LAYER │
4181│ Fetch available tool schemas via MCP │
@@ -69,48 +109,136 @@ User Query
69109
70110## Two Ways to Use
71111
72- ### 1. Bring Your Own Agent (library )
112+ ### 1. As a Library ( Bring Your Own Agent)
73113
74- Install the framework and plug it into your existing agent:
114+ Install and plug into your existing agent:
75115
76116``` bash
77117pip install behavioral-memory
78118```
79119
80120``` python
81- from behavioral_memory import TraceStore, PlanEngine, ToolRegistry
82- from langchain_openai import ChatOpenAI, OpenAIEmbeddings # or any provider
121+ from behavioral_memory import PlanEngine, ToolRegistry, InMemoryTraceStore
122+ from langchain_openai import ChatOpenAI, OpenAIEmbeddings
83123
84124llm = ChatOpenAI(model = " gpt-4o" , temperature = 0 )
85125embeddings = OpenAIEmbeddings()
86126
87- store = TraceStore(embeddings = embeddings, connection_url = " postgresql+psycopg://..." )
127+ # No PostgreSQL needed — InMemoryTraceStore works anywhere
128+ store = InMemoryTraceStore(embeddings = embeddings)
88129registry = ToolRegistry()
89130engine = PlanEngine(llm = llm, store = store, registry = registry)
90131
91132plan = engine.generate(query = " Get revenue data and email a report" )
92133```
93134
94- ### 2. Run the Reference Agent (LangGraph 1.x)
135+ For production with PostgreSQL + pgvector:
95136
96- Clone the repo and run the complete system:
137+ ``` python
138+ from behavioral_memory import TraceStore
139+
140+ store = TraceStore(embeddings = embeddings, connection_url = " postgresql+psycopg://..." )
141+ ```
142+
143+ ### 2. Run the Reference Agent (LangGraph 1.x)
97144
98145``` bash
99146git clone https://github.com/harsh-kr11/behavioral-memory.git
100147cd behavioral-memory
101148pip install -e " .[agent]"
102149
150+ export GOOGLE_API_KEY=your-key
151+
152+ # Interactive mode
153+ python -m agent.app --interactive
154+
155+ # Single query
103156python -m agent.app " Build a revenue analysis pipeline"
104157```
105158
159+ The interactive agent supports:
160+ - ` /compare <query> ` — run with AND without memory, see the difference
161+ - ` /memory ` — inspect what's in behavioral memory
162+ - ` /quit ` — exit
163+
164+ ---
165+
166+ ## Running the Real Benchmark
167+
168+ The benchmark sends 30 tasks through 3 strategies (zero-shot, static few-shot, dynamic retrieval), scoring each plan against gold tool chains.
169+
170+ ### Prerequisites
171+
172+ Only a Google API key. No PostgreSQL required — the benchmark uses ` InMemoryTraceStore ` .
173+
174+ ``` bash
175+ pip install -e " .[agent,eval]"
176+ export GOOGLE_API_KEY=your-key-here
177+ ```
178+
179+ ### Run
180+
181+ ``` bash
182+ # Full benchmark (30 tasks × 3 strategies = 90 LLM calls)
183+ python examples/run_live_benchmark.py
184+
185+ # Quick test (5 tasks × 3 strategies = 15 LLM calls)
186+ python examples/run_live_benchmark.py --limit 5
187+
188+ # Use a cheaper/faster model
189+ python examples/run_live_benchmark.py --model gemini-2.0-flash
190+
191+ # With Langfuse logging
192+ export LANGFUSE_SECRET_KEY=sk-lf-...
193+ export LANGFUSE_PUBLIC_KEY=pk-lf-...
194+ python examples/run_live_benchmark.py
195+ ```
196+
197+ ### What you'll see
198+
199+ ```
200+ Benchmark Results (N=30, model=gemini-2.5-pro)
201+ ┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
202+ ┃ Metric ┃ Zero-Shot ┃ Static Few-Shot ┃ Dynamic (Proposed) ┃
203+ ┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
204+ │ TSA │ 63.3% [53%, 73%] │ 70.0% [56%, 83%] │ 83.3% [70%, 93%] │
205+ │ PV │ 72.2% │ 79.6% │ 84.0% │
206+ │ PCR │ 33.3% [16%, 50%] │ 50.0% [33%, 66%] │ 63.3% [46%, 80%] │
207+ │ ESA │ 63.3% [46%, 80%] │ 70.0% [53%, 86%] │ 83.3% [70%, 93%] │
208+ └────────┴──────────────────┴─────────────────────┴──────────────────────────┘
209+ ```
210+
211+ Results include per-task breakdowns, difficulty-tier analysis, and McNemar's test.
212+
213+ ---
214+
215+ ## Pipeline Validation (No API Keys)
216+
217+ Validates every component works correctly using mock services:
218+
219+ ``` bash
220+ python examples/validate_pipeline.py
221+ ```
222+
223+ This verifies:
224+ - 12 seed traces load and pass schema validation
225+ - 30 ground truth tasks have correct structure
226+ - InMemoryTraceStore embeds, stores, and retrieves traces
227+ - PlanEngine generates plans (zero-shot, static, dynamic)
228+ - BenchmarkRunner scores and compares strategies
229+ - Gatekeeper pipeline accepts/rejects traces
230+ - Langfuse tracer handles offline mode gracefully
231+
232+ All ** 30 checks** pass with zero external dependencies.
233+
106234---
107235
108236## Installation
109237
110238### Prerequisites
111239
112240- Python 3.11+
113- - PostgreSQL with [ pgvector] ( https://github.com/pgvector/pgvector ) extension
241+ - (Optional) PostgreSQL with [ pgvector] ( https://github.com/pgvector/pgvector ) for production deployments
114242
115243### Install with uv (recommended)
116244
@@ -128,19 +256,22 @@ pip install behavioral-memory
128256pip install behavioral-memory[agent,eval]
129257```
130258
131- ### Configure
259+ ### Environment Setup
132260
133261``` bash
262+ # Interactive setup (guides you through each variable)
263+ behavioral-memory setup
264+
265+ # Or manual
134266cp .env.example .env
135- # Edit .env with your credentials
136267```
137268
138269| Variable | Required | Description |
139270| ----------| ----------| -------------|
140- | ` VECTOR_STORE_URL ` | Yes | PostgreSQL+pgvector connection string |
141- | ` GOOGLE_API_KEY ` | For reference agent | Gemini API key |
142- | ` LANGFUSE_SECRET_KEY ` | For feedback loop | Langfuse secret key |
143- | ` LANGFUSE_PUBLIC_KEY ` | For feedback loop | Langfuse public key |
271+ | ` GOOGLE_API_KEY ` | For LLM calls | Gemini API key (or use any LangChain-compatible LLM) |
272+ | ` VECTOR_STORE_URL ` | For PostgreSQL mode | ` postgresql+psycopg://localhost/behavioral_memory ` |
273+ | ` LANGFUSE_SECRET_KEY ` | For observability | Langfuse secret key |
274+ | ` LANGFUSE_PUBLIC_KEY ` | For observability | Langfuse public key |
144275
145276---
146277
@@ -152,7 +283,7 @@ cp .env.example .env
152283behavioral-memory/
153284├── src/behavioral_memory/ # The pip-installable library
154285│ ├── core/ # Schemas, config, exceptions
155- │ ├── memory/ # Behavioral Layer (TraceStore, dedup, token budget )
286+ │ ├── memory/ # Behavioral Layer (TraceStore, InMemoryTraceStore, dedup )
156287│ ├── tools/ # Tool Layer (MCP client, registry, mock tools)
157288│ ├── planner/ # Executive Layer (PlanEngine, prompt, postprocess)
158289│ ├── gatekeeper/ # Gatekeeper (schema validator, sandbox, dedup gate)
@@ -162,13 +293,25 @@ behavioral-memory/
162293│ ├── graph.py # StateGraph definition
163294│ ├── state.py # Agent state
164295│ └── nodes/ # Graph nodes (retrieve, plan, execute, observe)
165- ├── tests/ # Unit + integration tests
166- └── examples/ # Usage examples
296+ ├── tests/ # 104 tests (unit + integration + e2e)
297+ │ ├── unit/ # 61 unit tests
298+ │ ├── integration/ # 3 integration tests
299+ │ └── e2e/ # 40 end-to-end tests
300+ ├── examples/
301+ │ ├── validate_pipeline.py # Full pipeline validation (no API keys)
302+ │ ├── run_live_benchmark.py # Real benchmark (needs API key)
303+ │ └── run_benchmark.py # Benchmark with PostgreSQL
304+ └── .github/workflows/ # CI/CD
167305```
168306
169- ### The Framework is Model-Agnostic
307+ ### Store Options
170308
171- The library accepts any LangChain-compatible model:
309+ | Store | When to Use | Requires |
310+ | -------| ------------| ----------|
311+ | ` InMemoryTraceStore ` | Development, demos, CI, benchmarks | Nothing (numpy only) |
312+ | ` TraceStore ` | Production with persistent memory | PostgreSQL + pgvector |
313+
314+ ### The Framework is Model-Agnostic
172315
173316| Provider | LLM | Embeddings |
174317| ----------| -----| ------------|
@@ -179,7 +322,7 @@ The library accepts any LangChain-compatible model:
179322
180323---
181324
182- ## Feedback Loop
325+ ## Feedback Loop (Langfuse)
183326
184327The system learns from human feedback via Langfuse:
185328
@@ -196,31 +339,45 @@ from behavioral_memory import FeedbackPoller, GatekeeperPipeline
196339poller = FeedbackPoller(settings = settings)
197340gatekeeper = GatekeeperPipeline(store = store, registry = registry)
198341
199- # Auto-learn in the background
200342poller.poll_loop(callback = lambda trace : gatekeeper.submit(trace))
201343```
202344
203345---
204346
205- ## Evaluation
347+ ## Testing
206348
207- ### Reproduce Paper Results
349+ ### Run all tests (104 tests, no external services needed)
208350
209351``` bash
210- pip install behavioral-memory[agent,eval]
211- python examples/run_benchmark.py
352+ pip install -e " .[dev]"
353+ pytest tests/ -v
354+ ```
355+
356+ ### Test breakdown
357+
358+ | Suite | Tests | What it covers |
359+ | -------| -------| ---------------|
360+ | ` tests/unit/ ` | 61 | Schemas, metrics, postprocessing, prompt assembly, token budget, in-memory store |
361+ | ` tests/integration/ ` | 3 | Schema validator + sandbox with real traces |
362+ | ` tests/e2e/ ` | 40 | Full pipeline: seed traces → prompt → mock LLM → metrics → gatekeeper |
363+
364+ ### Pipeline validation
365+
366+ ``` bash
367+ python examples/validate_pipeline.py # 30 checks, 0 external deps
212368```
213369
214- ### CLI Tools
370+ ### Linting and type checking
215371
216372``` bash
217- behavioral-memory benchmark info # Dataset summary
218- behavioral-memory benchmark ground-truth # View all 30 tasks
219- behavioral-memory benchmark seed-traces # View 12 seed traces
220- behavioral-memory benchmark tools # View 7 tool definitions
373+ ruff check src/ tests/ agent/
374+ ruff format src/ tests/ agent/
375+ mypy src/
221376```
222377
223- ### Metrics (Section IV.C)
378+ ---
379+
380+ ## Evaluation Metrics (Section IV.C)
224381
225382| Metric | Description |
226383| --------| -------------|
@@ -231,6 +388,19 @@ behavioral-memory benchmark tools # View 7 tool definitions
231388
232389---
233390
391+ ## CLI Tools
392+
393+ ``` bash
394+ behavioral-memory setup # Interactive .env setup
395+ behavioral-memory demo # Offline demo of behavioral memory
396+ behavioral-memory benchmark info # Dataset summary
397+ behavioral-memory benchmark ground-truth # View all 30 tasks
398+ behavioral-memory benchmark seed-traces # View 12 seed traces
399+ behavioral-memory benchmark tools # View 7 tool definitions
400+ ```
401+
402+ ---
403+
234404## Configuration
235405
236406All settings via environment variables or ` .env ` :
@@ -253,16 +423,16 @@ All settings via environment variables or `.env`:
253423
254424| Component | Technology |
255425| -----------| -----------|
256- | Vector Store | PostgreSQL + pgvector |
426+ | Vector Store | PostgreSQL + pgvector (production) / In-memory (development) |
257427| Embeddings | Any LangChain Embeddings (default: Gemini) |
258428| LLM | Any LangChain ChatModel (default: Gemini 2.5 Pro) |
259429| Agent Framework | LangGraph 1.x (reference agent) |
260430| Observability | Langfuse |
261431| Config | Pydantic Settings |
262432| Tokenization | tiktoken |
263433| CLI | Typer + Rich |
264- | Testing | pytest |
265- | Linting | ruff |
434+ | Testing | pytest (104 tests) |
435+ | Linting | ruff + pre-commit hooks |
266436| Type Checking | mypy (strict) |
267437| Package Management | uv |
268438
0 commit comments