diff --git a/README.md b/README.md
index 8557526..f3f7eb7 100644
--- a/README.md
+++ b/README.md
@@ -12,9 +12,9 @@ Based on the IEEE paper: *"Behavioral Memory for Tool Orchestration: Semantic Re
 
 ---
 
-## Key Results
+## Key Results (from the paper)
 
-On a 30-task benchmark with 7 MCP tools:
+On a 30-task benchmark with 7 MCP tools, using Gemini 2.5 Pro:
 
 | Metric | Zero-Shot | Static Few-Shot | **Proposed** |
 |--------|-----------|----------------|-------------|
@@ -25,6 +25,45 @@ On a 30-task benchmark with 7 MCP tools:
 
 McNemar's test: **p = 0.004** vs zero-shot.
 
+> **Note:** These numbers are from the published paper. To reproduce them yourself, see [Running the Real Benchmark](#running-the-real-benchmark) below.
+
+---
+
+## Quick Start
+
+### Option A: No API keys needed (validation + demo)
+
+```bash
+git clone https://github.com/harsh-kr11/behavioral-memory.git
+cd behavioral-memory
+pip install -e ".[agent,eval,dev]"
+
+# Validate the entire pipeline (30/30 checks, no external services)
+python examples/validate_pipeline.py
+
+# Quick demo showing behavioral memory impact
+behavioral-memory demo
+```
+
+### Option B: With a Google API key (real benchmark)
+
+```bash
+export GOOGLE_API_KEY=your-key-here
+python examples/run_live_benchmark.py               # all 30 tasks
+python examples/run_live_benchmark.py --limit 5      # quick test with 5 tasks
+python examples/run_live_benchmark.py --model gemini-2.0-flash  # cheaper model
+```
+
+### Option C: Interactive agent
+
+```bash
+export GOOGLE_API_KEY=your-key-here
+python -m agent.app --interactive
+
+# Or single query:
+python -m agent.app "Build a revenue analysis pipeline"
+```
+
 ---
 
 ## How It Works
@@ -35,7 +74,8 @@ User Query
     ▼
 ┌─────────────────────────────────────────────────────┐
 │  1. BEHAVIORAL LAYER                                │
-│     Retrieve top-k similar traces from pgvector     │
+│     Retrieve top-k similar traces from memory       │
+│     (pgvector or in-memory — your choice)           │
 │                                                     │
 │  2. TOOL LAYER                                      │
 │     Fetch available tool schemas via MCP             │
@@ -69,40 +109,128 @@ User Query
 
 ## Two Ways to Use
 
-### 1. Bring Your Own Agent (library)
+### 1. As a Library (Bring Your Own Agent)
 
-Install the framework and plug it into your existing agent:
+Install and plug into your existing agent:
 
 ```bash
 pip install behavioral-memory
 ```
 
 ```python
-from behavioral_memory import TraceStore, PlanEngine, ToolRegistry
-from langchain_openai import ChatOpenAI, OpenAIEmbeddings  # or any provider
+from behavioral_memory import PlanEngine, ToolRegistry, InMemoryTraceStore
+from langchain_openai import ChatOpenAI, OpenAIEmbeddings
 
 llm = ChatOpenAI(model="gpt-4o", temperature=0)
 embeddings = OpenAIEmbeddings()
 
-store = TraceStore(embeddings=embeddings, connection_url="postgresql+psycopg://...")
+# No PostgreSQL needed — InMemoryTraceStore works anywhere
+store = InMemoryTraceStore(embeddings=embeddings)
 registry = ToolRegistry()
 engine = PlanEngine(llm=llm, store=store, registry=registry)
 
 plan = engine.generate(query="Get revenue data and email a report")
 ```
 
-### 2. Run the Reference Agent (LangGraph 1.x)
+For production with PostgreSQL + pgvector:
 
-Clone the repo and run the complete system:
+```python
+from behavioral_memory import TraceStore
+
+store = TraceStore(embeddings=embeddings, connection_url="postgresql+psycopg://...")
+```
+
+### 2. Run the Reference Agent (LangGraph 1.x)
 
 ```bash
 git clone https://github.com/harsh-kr11/behavioral-memory.git
 cd behavioral-memory
 pip install -e ".[agent]"
 
+export GOOGLE_API_KEY=your-key
+
+# Interactive mode
+python -m agent.app --interactive
+
+# Single query
 python -m agent.app "Build a revenue analysis pipeline"
 ```
 
+The interactive agent supports:
+- `/compare <query>` — run with AND without memory, see the difference
+- `/memory` — inspect what's in behavioral memory
+- `/quit` — exit
+
+---
+
+## Running the Real Benchmark
+
+The benchmark sends 30 tasks through 3 strategies (zero-shot, static few-shot, dynamic retrieval), scoring each plan against gold tool chains.
+
+### Prerequisites
+
+Only a Google API key. No PostgreSQL required — the benchmark uses `InMemoryTraceStore`.
+
+```bash
+pip install -e ".[agent,eval]"
+export GOOGLE_API_KEY=your-key-here
+```
+
+### Run
+
+```bash
+# Full benchmark (30 tasks × 3 strategies = 90 LLM calls)
+python examples/run_live_benchmark.py
+
+# Quick test (5 tasks × 3 strategies = 15 LLM calls)
+python examples/run_live_benchmark.py --limit 5
+
+# Use a cheaper/faster model
+python examples/run_live_benchmark.py --model gemini-2.0-flash
+
+# With Langfuse logging
+export LANGFUSE_SECRET_KEY=sk-lf-...
+export LANGFUSE_PUBLIC_KEY=pk-lf-...
+python examples/run_live_benchmark.py
+```
+
+### What you'll see
+
+```
+Benchmark Results (N=30, model=gemini-2.5-pro)
+┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Metric ┃ Zero-Shot        ┃ Static Few-Shot     ┃ Dynamic (Proposed)       ┃
+┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ TSA    │ 63.3% [53%, 73%] │ 70.0% [56%, 83%]    │ 83.3% [70%, 93%]         │
+│ PV     │ 72.2%            │ 79.6%               │ 84.0%                    │
+│ PCR    │ 33.3% [16%, 50%] │ 50.0% [33%, 66%]    │ 63.3% [46%, 80%]         │
+│ ESA    │ 63.3% [46%, 80%] │ 70.0% [53%, 86%]    │ 83.3% [70%, 93%]         │
+└────────┴──────────────────┴─────────────────────┴──────────────────────────┘
+```
+
+Results include per-task breakdowns, difficulty-tier analysis, and McNemar's test.
+
+---
+
+## Pipeline Validation (No API Keys)
+
+Validates every component works correctly using mock services:
+
+```bash
+python examples/validate_pipeline.py
+```
+
+This verifies:
+- 12 seed traces load and pass schema validation
+- 30 ground truth tasks have correct structure
+- InMemoryTraceStore embeds, stores, and retrieves traces
+- PlanEngine generates plans (zero-shot, static, dynamic)
+- BenchmarkRunner scores and compares strategies
+- Gatekeeper pipeline accepts/rejects traces
+- Langfuse tracer handles offline mode gracefully
+
+All **30 checks** pass with zero external dependencies.
+
 ---
 
 ## Installation
@@ -110,7 +238,7 @@ python -m agent.app "Build a revenue analysis pipeline"
 ### Prerequisites
 
 - Python 3.11+
-- PostgreSQL with [pgvector](https://github.com/pgvector/pgvector) extension
+- (Optional) PostgreSQL with [pgvector](https://github.com/pgvector/pgvector) for production deployments
 
 ### Install with uv (recommended)
 
@@ -128,19 +256,22 @@ pip install behavioral-memory
 pip install behavioral-memory[agent,eval]
 ```
 
-### Configure
+### Environment Setup
 
 ```bash
+# Interactive setup (guides you through each variable)
+behavioral-memory setup
+
+# Or manual
 cp .env.example .env
-# Edit .env with your credentials
 ```
 
 | Variable | Required | Description |
 |----------|----------|-------------|
-| `VECTOR_STORE_URL` | Yes | PostgreSQL+pgvector connection string |
-| `GOOGLE_API_KEY` | For reference agent | Gemini API key |
-| `LANGFUSE_SECRET_KEY` | For feedback loop | Langfuse secret key |
-| `LANGFUSE_PUBLIC_KEY` | For feedback loop | Langfuse public key |
+| `GOOGLE_API_KEY` | For LLM calls | Gemini API key (or use any LangChain-compatible LLM) |
+| `VECTOR_STORE_URL` | For PostgreSQL mode | `postgresql+psycopg://localhost/behavioral_memory` |
+| `LANGFUSE_SECRET_KEY` | For observability | Langfuse secret key |
+| `LANGFUSE_PUBLIC_KEY` | For observability | Langfuse public key |
 
 ---
 
@@ -152,7 +283,7 @@ cp .env.example .env
 behavioral-memory/
 ├── src/behavioral_memory/     # The pip-installable library
 │   ├── core/                  #   Schemas, config, exceptions
-│   ├── memory/                #   Behavioral Layer (TraceStore, dedup, token budget)
+│   ├── memory/                #   Behavioral Layer (TraceStore, InMemoryTraceStore, dedup)
 │   ├── tools/                 #   Tool Layer (MCP client, registry, mock tools)
 │   ├── planner/               #   Executive Layer (PlanEngine, prompt, postprocess)
 │   ├── gatekeeper/            #   Gatekeeper (schema validator, sandbox, dedup gate)
@@ -162,13 +293,25 @@ behavioral-memory/
 │   ├── graph.py               #   StateGraph definition
 │   ├── state.py               #   Agent state
 │   └── nodes/                 #   Graph nodes (retrieve, plan, execute, observe)
-├── tests/                     # Unit + integration tests
-└── examples/                  # Usage examples
+├── tests/                     # 104 tests (unit + integration + e2e)
+│   ├── unit/                  #   61 unit tests
+│   ├── integration/           #   3 integration tests
+│   └── e2e/                   #   40 end-to-end tests
+├── examples/
+│   ├── validate_pipeline.py   #   Full pipeline validation (no API keys)
+│   ├── run_live_benchmark.py  #   Real benchmark (needs API key)
+│   └── run_benchmark.py       #   Benchmark with PostgreSQL
+└── .github/workflows/         # CI/CD
 ```
 
-### The Framework is Model-Agnostic
+### Store Options
 
-The library accepts any LangChain-compatible model:
+| Store | When to Use | Requires |
+|-------|------------|----------|
+| `InMemoryTraceStore` | Development, demos, CI, benchmarks | Nothing (numpy only) |
+| `TraceStore` | Production with persistent memory | PostgreSQL + pgvector |
+
+### The Framework is Model-Agnostic
 
 | Provider | LLM | Embeddings |
 |----------|-----|------------|
@@ -179,7 +322,7 @@ The library accepts any LangChain-compatible model:
 
 ---
 
-## Feedback Loop
+## Feedback Loop (Langfuse)
 
 The system learns from human feedback via Langfuse:
 
@@ -196,31 +339,45 @@ from behavioral_memory import FeedbackPoller, GatekeeperPipeline
 poller = FeedbackPoller(settings=settings)
 gatekeeper = GatekeeperPipeline(store=store, registry=registry)
 
-# Auto-learn in the background
 poller.poll_loop(callback=lambda trace: gatekeeper.submit(trace))
 ```
 
 ---
 
-## Evaluation
+## Testing
 
-### Reproduce Paper Results
+### Run all tests (104 tests, no external services needed)
 
 ```bash
-pip install behavioral-memory[agent,eval]
-python examples/run_benchmark.py
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+
+### Test breakdown
+
+| Suite | Tests | What it covers |
+|-------|-------|---------------|
+| `tests/unit/` | 61 | Schemas, metrics, postprocessing, prompt assembly, token budget, in-memory store |
+| `tests/integration/` | 3 | Schema validator + sandbox with real traces |
+| `tests/e2e/` | 40 | Full pipeline: seed traces → prompt → mock LLM → metrics → gatekeeper |
+
+### Pipeline validation
+
+```bash
+python examples/validate_pipeline.py   # 30 checks, 0 external deps
 ```
 
-### CLI Tools
+### Linting and type checking
 
 ```bash
-behavioral-memory benchmark info          # Dataset summary
-behavioral-memory benchmark ground-truth  # View all 30 tasks
-behavioral-memory benchmark seed-traces   # View 12 seed traces
-behavioral-memory benchmark tools         # View 7 tool definitions
+ruff check src/ tests/ agent/
+ruff format src/ tests/ agent/
+mypy src/
 ```
 
-### Metrics (Section IV.C)
+---
+
+## Evaluation Metrics (Section IV.C)
 
 | Metric | Description |
 |--------|-------------|
@@ -231,6 +388,19 @@ behavioral-memory benchmark tools         # View 7 tool definitions
 
 ---
 
+## CLI Tools
+
+```bash
+behavioral-memory setup                    # Interactive .env setup
+behavioral-memory demo                     # Offline demo of behavioral memory
+behavioral-memory benchmark info           # Dataset summary
+behavioral-memory benchmark ground-truth   # View all 30 tasks
+behavioral-memory benchmark seed-traces    # View 12 seed traces
+behavioral-memory benchmark tools          # View 7 tool definitions
+```
+
+---
+
 ## Configuration
 
 All settings via environment variables or `.env`:
@@ -253,7 +423,7 @@ All settings via environment variables or `.env`:
 
 | Component | Technology |
 |-----------|-----------|
-| Vector Store | PostgreSQL + pgvector |
+| Vector Store | PostgreSQL + pgvector (production) / In-memory (development) |
 | Embeddings | Any LangChain Embeddings (default: Gemini) |
 | LLM | Any LangChain ChatModel (default: Gemini 2.5 Pro) |
 | Agent Framework | LangGraph 1.x (reference agent) |
@@ -261,8 +431,8 @@ All settings via environment variables or `.env`:
 | Config | Pydantic Settings |
 | Tokenization | tiktoken |
 | CLI | Typer + Rich |
-| Testing | pytest |
-| Linting | ruff |
+| Testing | pytest (104 tests) |
+| Linting | ruff + pre-commit hooks |
 | Type Checking | mypy (strict) |
 | Package Management | uv |
 
diff --git a/agent/app.py b/agent/app.py
index d727b33..0917cd2 100644
--- a/agent/app.py
+++ b/agent/app.py
@@ -1,82 +1,216 @@
-"""Reference agent entry point.
+"""Reference agent entry point — works with real LLM + in-memory store.
 
-Demonstrates the full behavioral memory system end-to-end using
-LangGraph 1.x with Gemini as the default LLM.
+Run modes:
+    python -m agent.app "Build a revenue analysis pipeline"   # single query
+    python -m agent.app --interactive                          # REPL mode
+    python -m agent.app --benchmark --limit 5                  # quick benchmark
 """
 
 from __future__ import annotations
 
-import json
 import sys
 
 from rich.console import Console
+from rich.panel import Panel
 
 console = Console()
 
 
-def run_agent(query: str, verbose: bool = False) -> dict:
-    """Run the reference agent on a single query."""
-    from langchain_core.embeddings import Embeddings
+def create_agent(model: str = "gemini-2.5-pro", use_postgres: bool = False):
+    """Create the agent with all components wired up.
+
+    Returns (graph, store, registry, tracer, settings).
+    """
     from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
 
     from agent.graph import build_agent_graph
     from behavioral_memory.core.config import Settings
-    from behavioral_memory.memory.store import TraceStore
+    from behavioral_memory.evaluation.seed_traces import get_seed_traces
+    from behavioral_memory.observability.tracer import LangfuseTracer
     from behavioral_memory.tools.mock_tools import get_tool_schemas
     from behavioral_memory.tools.registry import ToolRegistry
 
     settings = Settings()
 
-    llm = ChatGoogleGenerativeAI(
-        model="gemini-2.5-pro",
-        temperature=0,
-    )
+    llm = ChatGoogleGenerativeAI(model=model, temperature=0)
+    embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
 
-    embeddings: Embeddings = GoogleGenerativeAIEmbeddings(
-        model="models/gemini-embedding-001",
-    )
+    if use_postgres:
+        from behavioral_memory.memory.store import TraceStore
+        store = TraceStore(embeddings=embeddings, settings=settings)
+    else:
+        from behavioral_memory.memory.in_memory_store import InMemoryTraceStore
+        store = InMemoryTraceStore(embeddings=embeddings, settings=settings)
 
-    store = TraceStore(embeddings=embeddings, settings=settings)
     registry = ToolRegistry()
     registry.register_many(get_tool_schemas())
+    tracer = LangfuseTracer(settings=settings)
+
+    seed_traces = get_seed_traces()
+    store.add_bulk(seed_traces)
 
     graph = build_agent_graph(
-        llm=llm,
-        store=store,
-        registry=registry,
-        settings=settings,
+        llm=llm, store=store, registry=registry, settings=settings,
     )
 
+    return graph, store, registry, tracer, settings
+
+
+def run_single(query: str, model: str = "gemini-2.5-pro", verbose: bool = True) -> dict:
+    """Run the agent on a single query and display results."""
+    console.print(f"[dim]Model: {model}[/dim]")
+    console.print(f"[dim]Query: {query}[/dim]\n")
+
+    graph, store, _registry, tracer, _settings = create_agent(model=model)
+
+    console.print(f"[dim]Memory: {store.count()} seed traces loaded[/dim]")
+    if tracer.enabled:
+        console.print("[green]Langfuse tracing: enabled[/green]")
+
     compiled = graph.compile()
     result = compiled.invoke({"query": query})
 
-    if verbose:
+    plan = result.get("plan")
+    if plan:
+        console.print(f"\n[bold green]Plan generated — {len(plan.steps)} steps:[/bold green]")
+        for step in plan.steps:
+            console.print(f"  [cyan]{step.step_id}[/cyan]: {step.tool_name}")
+            if verbose:
+                for k, v in step.parameters.items():
+                    val_str = str(v)[:100]
+                    console.print(f"    {k}: {val_str}")
+            if step.depends_on:
+                console.print(f"    [dim]depends_on: {step.depends_on}[/dim]")
+
+        if plan.retrieved_traces:
+            console.print(f"\n[dim]Retrieved {len(plan.retrieved_traces)} traces from memory:[/dim]")
+            for t in plan.retrieved_traces:
+                console.print(f"  [dim]• {t.task_description[:70]}[/dim]")
+
+        console.print(f"\n[dim]Token budget used: {plan.token_budget_used}[/dim]")
+
+        if tracer.enabled:
+            trace_id = tracer.log_plan(plan, tags=["agent-run"])
+            if trace_id:
+                console.print(f"[green]Logged to Langfuse: {trace_id}[/green]")
+            tracer.flush()
+    else:
+        console.print(f"\n[red]Planning failed: {result.get('error', 'unknown')}[/red]")
+
+    return result
+
+
+def run_interactive(model: str = "gemini-2.5-pro") -> None:
+    """Interactive REPL — type queries, see plans, compare with/without memory."""
+    console.print(Panel.fit(
+        "[bold]Behavioral Memory Agent — Interactive Mode[/bold]\n\n"
+        f"Model: {model}\n"
+        "Type a query to generate a plan. The agent retrieves relevant\n"
+        "traces from behavioral memory to guide its planning.\n\n"
+        "Special commands:\n"
+        "  /compare <query>  — run with AND without memory, show difference\n"
+        "  /memory            — show what's in behavioral memory\n"
+        "  /quit              — exit",
+        title="Interactive Agent",
+    ))
+
+    graph, store, registry, tracer, settings = create_agent(model=model)
+    compiled = graph.compile()
+
+    console.print(f"[green]Ready. Memory: {store.count()} traces loaded.[/green]\n")
+
+    while True:
+        try:
+            query = console.input("[bold]Query>[/bold] ").strip()
+        except (EOFError, KeyboardInterrupt):
+            break
+
+        if not query:
+            continue
+        if query.lower() in ("/quit", "/exit", "quit", "exit"):
+            break
+
+        if query.startswith("/memory"):
+            from behavioral_memory.evaluation.seed_traces import get_seed_traces
+            for trace in get_seed_traces():
+                tools = " → ".join(trace.tool_names)
+                console.print(f"  [cyan]{trace.task_description[:60]}[/cyan]")
+                console.print(f"    [dim]{tools}[/dim]")
+            console.print(f"\n  [dim]Total: {store.count()} traces[/dim]\n")
+            continue
+
+        if query.startswith("/compare "):
+            actual_query = query[9:].strip()
+            _run_comparison(compiled, actual_query, store, registry, settings, tracer)
+            continue
+
+        result = compiled.invoke({"query": query})
         plan = result.get("plan")
         if plan:
-            console.print(f"\n[cyan]Plan ({len(plan.steps)} steps):[/cyan]")
+            console.print(f"\n[green]Plan ({len(plan.steps)} steps):[/green]")
             for step in plan.steps:
-                console.print(f"  {step.step_id}: {step.tool_name}")
-                console.print(f"    params: {json.dumps(step.parameters, indent=4)}")
+                console.print(f"  [cyan]{step.step_id}[/cyan]: {step.tool_name}")
+                for k, v in step.parameters.items():
+                    console.print(f"    {k}: {str(v)[:80]}")
+            if plan.retrieved_traces:
+                console.print(f"\n  [dim]Retrieved {len(plan.retrieved_traces)} traces from memory[/dim]")
+            if tracer.enabled:
+                tracer.log_plan(plan, tags=["interactive"])
+                tracer.flush()
+        else:
+            console.print(f"[red]Failed: {result.get('error', 'unknown')}[/red]")
+        console.print()
+
+
+def _run_comparison(compiled, query, store, registry, settings, tracer):
+    """Run with and without memory, show the difference."""
+    from behavioral_memory.planner.prompt import build_prompt
+    from behavioral_memory.tools.mock_tools import get_tool_schemas
 
-    return result
+    console.print(f"\n[bold]Comparing: \"{query}\"[/bold]\n")
 
+    result_with = compiled.invoke({"query": query})
+    plan_with = result_with.get("plan")
 
-def main() -> None:
-    """CLI entry point for the reference agent."""
-    if len(sys.argv) < 2:
-        console.print("[red]Usage: python -m agent.app 'your query here'[/red]")
-        sys.exit(1)
+    schemas = get_tool_schemas()
 
-    query = " ".join(sys.argv[1:])
-    console.print(f"[dim]Query:[/dim] {query}")
+    console.print("[yellow]WITHOUT memory (zero-shot):[/yellow]")
+    zs_prompt = build_prompt(query=query, traces=[], tool_schemas=schemas)
+    console.print(f"  Prompt: {len(zs_prompt)} chars, 0 reference examples")
 
-    result = run_agent(query, verbose=True)
+    console.print("\n[green]WITH memory (dynamic retrieval):[/green]")
+    if plan_with:
+        console.print(f"  Retrieved: {len(plan_with.retrieved_traces)} traces")
+        for t in plan_with.retrieved_traces:
+            console.print(f"    [dim]• {t.task_description[:60]}[/dim]")
+        console.print(f"\n  Plan ({len(plan_with.steps)} steps):")
+        for step in plan_with.steps:
+            console.print(f"    [cyan]{step.step_id}[/cyan]: {step.tool_name}")
+    else:
+        console.print(f"  [red]Failed: {result_with.get('error')}[/red]")
+    console.print()
 
-    plan = result.get("plan")
-    if plan:
-        console.print(f"\n[green]Plan generated with {len(plan.steps)} steps[/green]")
+
+def main() -> None:
+    import argparse
+
+    parser = argparse.ArgumentParser(description="Behavioral Memory Reference Agent")
+    parser.add_argument("query", nargs="*", help="Query to process")
+    parser.add_argument("--interactive", "-i", action="store_true", help="Interactive REPL mode")
+    parser.add_argument("--model", default="gemini-2.5-pro", help="Gemini model to use")
+    parser.add_argument("--postgres", action="store_true", help="Use PostgreSQL instead of in-memory store")
+    args = parser.parse_args()
+
+    if args.interactive:
+        run_interactive(model=args.model)
+    elif args.query:
+        query = " ".join(args.query)
+        run_single(query, model=args.model)
     else:
-        console.print(f"\n[red]Planning failed: {result.get('error', 'unknown')}[/red]")
+        console.print("[red]Usage:[/red]")
+        console.print("  python -m agent.app 'your query here'")
+        console.print("  python -m agent.app --interactive")
+        sys.exit(1)
 
 
 if __name__ == "__main__":
diff --git a/agent/graph.py b/agent/graph.py
index 6392143..4113c3c 100644
--- a/agent/graph.py
+++ b/agent/graph.py
@@ -8,6 +8,8 @@
 
 from __future__ import annotations
 
+from typing import Any
+
 from langchain_core.language_models import BaseChatModel
 from langgraph.graph import END, START, StateGraph
 
@@ -18,7 +20,6 @@
 from agent.nodes.retrieve import make_retrieve_node
 from agent.state import AgentState
 from behavioral_memory.core.config import Settings
-from behavioral_memory.memory.store import TraceStore
 from behavioral_memory.observability.tracer import LangfuseTracer
 from behavioral_memory.planner.engine import PlanEngine
 from behavioral_memory.tools.registry import ToolRegistry
@@ -26,7 +27,7 @@
 
 def build_agent_graph(
     llm: BaseChatModel,
-    store: TraceStore,
+    store: Any,
     registry: ToolRegistry,
     settings: Settings | None = None,
 ) -> StateGraph:
diff --git a/agent/nodes/retrieve.py b/agent/nodes/retrieve.py
index 522c475..4295526 100644
--- a/agent/nodes/retrieve.py
+++ b/agent/nodes/retrieve.py
@@ -2,12 +2,13 @@
 
 from __future__ import annotations
 
+from typing import Any
+
 from agent.state import AgentState
-from behavioral_memory.memory.store import TraceStore
 from behavioral_memory.memory.token_budget import select_traces_within_budget
 
 
-def make_retrieve_node(store: TraceStore):
+def make_retrieve_node(store: Any):
     """Factory that creates a retrieve_traces node bound to a TraceStore."""
 
     def retrieve_traces(state: AgentState) -> dict:
diff --git a/examples/run_live_benchmark.py b/examples/run_live_benchmark.py
new file mode 100644
index 0000000..0b5399c
--- /dev/null
+++ b/examples/run_live_benchmark.py
@@ -0,0 +1,238 @@
+"""Run the REAL benchmark — calls the LLM and produces actual numbers.
+
+This script:
+  1. Seeds 12 traces into an in-memory vector store (no PostgreSQL needed)
+  2. Runs all 30 tasks through 3 strategies (zero-shot, static, dynamic)
+  3. Scores every plan against gold tool chains
+  4. Prints real TSA/PV/PCR/ESA numbers with bootstrap confidence intervals
+  5. Optionally logs every plan to Langfuse
+
+Prerequisites:
+    pip install behavioral-memory[agent,eval]
+    export GOOGLE_API_KEY=your-key-here
+
+    Optional (for Langfuse tracing):
+    export LANGFUSE_SECRET_KEY=sk-lf-...
+    export LANGFUSE_PUBLIC_KEY=pk-lf-...
+
+Usage:
+    python examples/run_live_benchmark.py
+    python examples/run_live_benchmark.py --limit 5    # quick test with 5 tasks
+    python examples/run_live_benchmark.py --model gemini-2.0-flash  # cheaper model
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import time
+
+from rich.console import Console
+from rich.panel import Panel
+from rich.table import Table
+
+console = Console()
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run the behavioral memory benchmark")
+    parser.add_argument("--limit", type=int, default=0, help="Limit to N tasks (0 = all 30)")
+    parser.add_argument("--model", type=str, default="gemini-2.5-pro", help="Gemini model name")
+    parser.add_argument("--output", type=str, default="benchmark_results.json", help="Output file")
+    args = parser.parse_args()
+
+    console.print(Panel.fit(
+        "[bold]Behavioral Memory — Live Benchmark[/bold]\n\n"
+        "This runs the REAL benchmark from the paper.\n"
+        "It calls the LLM for every task and scores plans against gold chains.\n"
+        f"Model: {args.model}  |  Tasks: {'all 30' if args.limit == 0 else args.limit}",
+        title="Live Benchmark",
+    ))
+
+    try:
+        from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
+    except ImportError:
+        console.print("[red]Missing dependency: pip install langchain-google-genai[/red]")
+        sys.exit(1)
+
+    from behavioral_memory.core.config import Settings
+    from behavioral_memory.evaluation.benchmark import BenchmarkRunner
+    from behavioral_memory.evaluation.seed_traces import get_seed_traces
+    from behavioral_memory.evaluation.strategies import (
+        DynamicRetrievalStrategy,
+        StaticFewShotStrategy,
+        ZeroShotStrategy,
+    )
+    from behavioral_memory.memory.in_memory_store import InMemoryTraceStore
+    from behavioral_memory.observability.tracer import LangfuseTracer
+    from behavioral_memory.planner.engine import PlanEngine
+    from behavioral_memory.tools.mock_tools import get_tool_schemas
+    from behavioral_memory.tools.registry import ToolRegistry
+
+    settings = Settings()
+
+    console.print("\n[dim]Initializing LLM and embeddings...[/dim]")
+    llm = ChatGoogleGenerativeAI(model=args.model, temperature=0)
+    embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
+
+    console.print("[dim]Creating in-memory vector store (no PostgreSQL needed)...[/dim]")
+    store = InMemoryTraceStore(embeddings=embeddings, settings=settings)
+
+    registry = ToolRegistry()
+    schemas = get_tool_schemas()
+    registry.register_many(schemas)
+
+    seed_traces = get_seed_traces()
+    store.add_bulk(seed_traces)
+    console.print(f"[green]Seeded {store.count()} traces into in-memory store[/green]")
+
+    engine = PlanEngine(llm=llm, store=store, registry=registry, settings=settings)
+    runner = BenchmarkRunner(tool_schemas=schemas)
+
+    tracer = LangfuseTracer(settings=settings)
+    if tracer.enabled:
+        console.print("[green]Langfuse tracing enabled — plans will be logged[/green]")
+    else:
+        console.print("[dim]Langfuse not configured — set LANGFUSE_SECRET_KEY to enable[/dim]")
+
+    limit = args.limit if args.limit > 0 else None
+
+    # --- Zero-shot ---
+    console.print("\n[cyan]Running zero-shot baseline...[/cyan]")
+    t0 = time.time()
+    zero_shot = runner.run(ZeroShotStrategy(engine), limit=limit)
+    zs_time = time.time() - t0
+    console.print(f"  [dim]Completed in {zs_time:.1f}s[/dim]")
+    _log_results_to_langfuse(tracer, zero_shot, "zero-shot")
+
+    # --- Static few-shot ---
+    console.print("[cyan]Running static few-shot baseline...[/cyan]")
+    t0 = time.time()
+    static = runner.run(StaticFewShotStrategy(engine, seed_traces[:3]), limit=limit)
+    sf_time = time.time() - t0
+    console.print(f"  [dim]Completed in {sf_time:.1f}s[/dim]")
+    _log_results_to_langfuse(tracer, static, "static-few-shot")
+
+    # --- Dynamic retrieval (proposed) ---
+    console.print("[cyan]Running dynamic retrieval (proposed)...[/cyan]")
+    t0 = time.time()
+    dynamic = runner.run(DynamicRetrievalStrategy(engine), limit=limit)
+    dr_time = time.time() - t0
+    console.print(f"  [dim]Completed in {dr_time:.1f}s[/dim]")
+    _log_results_to_langfuse(tracer, dynamic, "dynamic-retrieval")
+
+    # --- Results table ---
+    n = zero_shot["n_tasks"]
+    table = Table(title=f"Benchmark Results (N={n}, model={args.model})")
+    table.add_column("Metric", style="bold")
+    table.add_column("Zero-Shot", justify="right")
+    table.add_column("Static Few-Shot", justify="right")
+    table.add_column("Dynamic (Proposed)", justify="right", style="bold green")
+
+    for metric in ["tsa", "pv", "pcr", "esa"]:
+        zs = zero_shot["aggregate"][metric]
+        sf = static["aggregate"][metric]
+        dy = dynamic["aggregate"][metric]
+
+        zs_str = _fmt_metric(zs)
+        sf_str = _fmt_metric(sf)
+        dy_str = _fmt_metric(dy)
+
+        table.add_row(metric.upper(), zs_str, sf_str, dy_str)
+
+    console.print("\n")
+    console.print(table)
+
+    comparison = runner.compare(zero_shot, dynamic, "Zero-Shot", "Proposed")
+    p_val = comparison["mcnemar_pcr"]["p_value"]
+    console.print(f"\nMcNemar's test (zero-shot vs proposed): p = {p_val:.4f}")
+    if p_val < 0.05:
+        console.print("[green]  → Statistically significant (p < 0.05)[/green]")
+    else:
+        console.print("[yellow]  → Not statistically significant (p >= 0.05)[/yellow]")
+
+    # --- Per-difficulty breakdown ---
+    diff_table = Table(title="Plan Correctness by Difficulty")
+    diff_table.add_column("Difficulty", style="bold")
+    diff_table.add_column("n", justify="right")
+    diff_table.add_column("Zero-Shot PCR", justify="right")
+    diff_table.add_column("Static PCR", justify="right")
+    diff_table.add_column("Dynamic PCR", justify="right", style="bold green")
+
+    for diff in ["simple", "moderate", "challenging"]:
+        zs_diff = runner.results_by_difficulty(zero_shot).get(diff, {})
+        sf_diff = runner.results_by_difficulty(static).get(diff, {})
+        dy_diff = runner.results_by_difficulty(dynamic).get(diff, {})
+        diff_table.add_row(
+            diff,
+            str(zs_diff.get("n", 0)),
+            f"{zs_diff.get('pcr', 0):.0%}",
+            f"{sf_diff.get('pcr', 0):.0%}",
+            f"{dy_diff.get('pcr', 0):.0%}",
+        )
+
+    console.print(diff_table)
+
+    # --- Per-task details ---
+    console.print("\n[bold]Per-task breakdown (dynamic retrieval):[/bold]")
+    for task_result in dynamic["per_task"]:
+        m = task_result["metrics"]
+        status = "✓" if m["pcr"] else "✗"
+        style = "green" if m["pcr"] else "red"
+        console.print(
+            f"  [{style}]{status}[/{style}] Task {task_result['task_id']} ({task_result['difficulty']}): "
+            f"TSA={'✓' if m['tsa'] else '✗'} PV={m['pv']:.0%} ESA={'✓' if m['esa'] else '✗'}"
+        )
+
+    # --- Save results ---
+    all_results = {
+        "model": args.model,
+        "n_tasks": n,
+        "zero_shot": zero_shot,
+        "static_few_shot": static,
+        "dynamic_retrieval": dynamic,
+        "comparison": comparison,
+        "timing": {"zero_shot_s": zs_time, "static_s": sf_time, "dynamic_s": dr_time},
+    }
+    with open(args.output, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    console.print(f"\n[dim]Full results saved to {args.output}[/dim]")
+
+    if tracer.enabled:
+        tracer.flush()
+        console.print("[green]All results logged to Langfuse[/green]")
+
+
+def _fmt_metric(m: dict) -> str:
+    mean = m["mean"]
+    if isinstance(mean, bool):
+        return "✓" if mean else "✗"
+    ci = m.get("ci_95")
+    if ci:
+        return f"{mean:.1%} [{ci[0]:.1%}, {ci[1]:.1%}]"
+    return f"{mean:.1%}"
+
+
+def _log_results_to_langfuse(tracer, results: dict, strategy_name: str) -> None:
+    """Log each plan to Langfuse for observability."""
+    if not tracer.enabled:
+        return
+    for task_result in results.get("per_task", []):
+        if "predicted_steps" in task_result:
+            from behavioral_memory.core.schemas import Plan, ToolCall
+
+            steps = [ToolCall(**s) for s in task_result["predicted_steps"]]
+            plan = Plan(
+                query=task_result["task"],
+                steps=steps,
+                raw_llm_output=json.dumps(task_result["predicted_steps"]),
+            )
+            tracer.log_plan(
+                plan,
+                tags=["benchmark", strategy_name, task_result["difficulty"]],
+            )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/validate_pipeline.py b/examples/validate_pipeline.py
new file mode 100644
index 0000000..33b3b13
--- /dev/null
+++ b/examples/validate_pipeline.py
@@ -0,0 +1,266 @@
+"""Validate the entire pipeline end-to-end without any external services.
+
+This script proves:
+  1. InMemoryTraceStore works (embed + search)
+  2. Seed traces load and validate
+  3. PlanEngine generates plans (with a mock LLM)
+  4. Benchmark runner scores plans correctly
+  5. Gatekeeper validates traces
+  6. Langfuse tracer handles offline mode gracefully
+  7. The full zero-shot vs dynamic retrieval pipeline works
+
+No API keys, no PostgreSQL, no network access required.
+
+Usage:
+    python examples/validate_pipeline.py
+"""
+
+from __future__ import annotations
+
+import json
+import sys
+from unittest.mock import MagicMock
+
+from rich.console import Console
+from rich.panel import Panel
+from rich.table import Table
+
+console = Console()
+
+
+def make_mock_embeddings(dim: int = 64):
+    """Deterministic embedding model for validation."""
+    emb = MagicMock()
+
+    def embed_query(text: str) -> list[float]:
+        import hashlib
+
+        h = hashlib.sha256(text.encode()).digest()
+        return [float(b) / 255.0 for b in h[:dim]]
+
+    def embed_documents(texts: list[str]) -> list[list[float]]:
+        return [embed_query(t) for t in texts]
+
+    emb.embed_query = embed_query
+    emb.embed_documents = embed_documents
+    return emb
+
+
+def make_mock_llm(gold_tasks):
+    """Mock LLM that returns the gold tool chain for the closest matching task."""
+
+    def invoke(messages):
+        query = messages[-1].content if hasattr(messages[-1], "content") else str(messages[-1])
+
+        best_match = None
+        best_overlap = 0
+        query_words = set(query.lower().split())
+
+        for task in gold_tasks:
+            task_words = set(task["task"].lower().split())
+            overlap = len(query_words & task_words)
+            if overlap > best_overlap:
+                best_overlap = overlap
+                best_match = task
+
+        if best_match:
+            steps = []
+            for i, gold_step in enumerate(best_match["gold_tool_chain"]):
+                steps.append({
+                    "step_id": f"step_{i + 1}",
+                    "tool_name": gold_step["tool"],
+                    "parameters": gold_step["params"],
+                    "depends_on": [f"step_{j + 1}" for j in range(i)],
+                })
+            response = MagicMock()
+            response.content = json.dumps(steps)
+            return response
+
+        response = MagicMock()
+        response.content = json.dumps([{
+            "step_id": "step_1",
+            "tool_name": "data_fetch",
+            "parameters": {"source": "default"},
+            "depends_on": [],
+        }])
+        return response
+
+    llm = MagicMock()
+    llm.invoke = invoke
+    return llm
+
+
+def main() -> None:
+    console.print(Panel.fit(
+        "[bold]Pipeline Validation — Full End-to-End Check[/bold]\n\n"
+        "Tests every component with mock services.\n"
+        "No API keys or external services needed.",
+        title="Validation",
+    ))
+
+    from behavioral_memory.core.config import Settings
+    from behavioral_memory.evaluation.benchmark import BenchmarkRunner
+    from behavioral_memory.evaluation.ground_truth import EVALUATION_TASKS
+    from behavioral_memory.evaluation.seed_traces import get_seed_traces
+    from behavioral_memory.evaluation.strategies import (
+        DynamicRetrievalStrategy,
+        StaticFewShotStrategy,
+        ZeroShotStrategy,
+    )
+    from behavioral_memory.gatekeeper.pipeline import GatekeeperPipeline
+    from behavioral_memory.memory.in_memory_store import InMemoryTraceStore
+    from behavioral_memory.observability.tracer import LangfuseTracer
+    from behavioral_memory.planner.engine import PlanEngine
+    from behavioral_memory.tools.mock_tools import get_tool_schemas
+    from behavioral_memory.tools.registry import ToolRegistry
+
+    passed = 0
+    failed = 0
+
+    def check(name: str, condition: bool, detail: str = ""):
+        nonlocal passed, failed
+        if condition:
+            console.print(f"  [green]✓[/green] {name}")
+            passed += 1
+        else:
+            console.print(f"  [red]✗[/red] {name}: {detail}")
+            failed += 1
+
+    # --- 1. Seed traces ---
+    console.print("\n[bold cyan]1. Seed Traces[/bold cyan]")
+    seed_traces = get_seed_traces()
+    check("12 seed traces loaded", len(seed_traces) == 12, f"got {len(seed_traces)}")
+    check("All traces validated", all(t.validated for t in seed_traces))
+    check("All have tool chains", all(len(t.tool_chain) > 0 for t in seed_traces))
+
+    # --- 2. Tool schemas ---
+    console.print("\n[bold cyan]2. Tool Schemas[/bold cyan]")
+    schemas = get_tool_schemas()
+    check("7 mock tools loaded", len(schemas) == 7, f"got {len(schemas)}")
+    registry = ToolRegistry()
+    registry.register_many(schemas)
+    check("Registry populated", len(registry) == 7)
+
+    # --- 3. Ground truth tasks ---
+    console.print("\n[bold cyan]3. Ground Truth Tasks[/bold cyan]")
+    check("30 evaluation tasks", len(EVALUATION_TASKS) == 30, f"got {len(EVALUATION_TASKS)}")
+    difficulties = {t["difficulty"] for t in EVALUATION_TASKS}
+    check("Three difficulty tiers", difficulties == {"simple", "moderate", "challenging"})
+    check(
+        "All gold chains reference known tools",
+        all(
+            step["tool"] in registry._tools
+            for task in EVALUATION_TASKS
+            for step in task["gold_tool_chain"]
+        ),
+    )
+
+    # --- 4. InMemoryTraceStore ---
+    console.print("\n[bold cyan]4. InMemory Vector Store[/bold cyan]")
+    embeddings = make_mock_embeddings()
+    settings = Settings()
+    store = InMemoryTraceStore(embeddings=embeddings, settings=settings)
+    n_added = store.add_bulk(seed_traces)
+    check("Bulk add succeeds", n_added == 12)
+    check("Count matches", store.count() == 12)
+
+    results = store.search("Build a revenue analysis pipeline", k=3)
+    check("Search returns results", len(results) > 0, "empty search")
+    check("Results are (trace, score) tuples", all(isinstance(r[1], float) for r in results))
+
+    # --- 5. PlanEngine with mock LLM ---
+    console.print("\n[bold cyan]5. PlanEngine[/bold cyan]")
+    llm = make_mock_llm(EVALUATION_TASKS)
+    engine = PlanEngine(llm=llm, store=store, registry=registry, settings=settings)
+
+    plan = engine.generate(query="Build a revenue analysis pipeline", tool_schemas=schemas)
+    check("Plan generated", plan is not None)
+    check("Plan has steps", len(plan.steps) > 0, "empty plan")
+    check("Retrieved traces attached", len(plan.retrieved_traces) > 0, "no retrieval")
+    check("Token budget tracked", plan.token_budget_used > 0)
+
+    zs_plan = engine.generate_zero_shot("Build a revenue analysis pipeline", schemas)
+    check("Zero-shot plan works", len(zs_plan.steps) > 0)
+    check("Zero-shot has no retrieved traces", len(zs_plan.retrieved_traces) == 0)
+
+    static_plan = engine.generate_static_few_shot(
+        "Build a revenue analysis pipeline", schemas, seed_traces[:3],
+    )
+    check("Static few-shot works", len(static_plan.steps) > 0)
+    check("Static uses provided traces", len(static_plan.retrieved_traces) == 3)
+
+    # --- 6. Benchmark Runner ---
+    console.print("\n[bold cyan]6. Benchmark Runner[/bold cyan]")
+    runner = BenchmarkRunner(tool_schemas=schemas)
+
+    zs_results = runner.run(ZeroShotStrategy(engine), limit=5)
+    check("Zero-shot runs on 5 tasks", zs_results["n_tasks"] == 5)
+    check("Has aggregate metrics", "tsa" in zs_results["aggregate"])
+
+    sf_results = runner.run(StaticFewShotStrategy(engine, seed_traces[:3]), limit=5)
+    check("Static few-shot runs on 5 tasks", sf_results["n_tasks"] == 5)
+
+    dr_results = runner.run(DynamicRetrievalStrategy(engine), limit=5)
+    check("Dynamic retrieval runs on 5 tasks", dr_results["n_tasks"] == 5)
+
+    comparison = runner.compare(zs_results, dr_results, "Zero-Shot", "Dynamic")
+    check("McNemar test runs", "mcnemar_pcr" in comparison)
+    check("p-value is numeric", isinstance(comparison["mcnemar_pcr"]["p_value"], float))
+
+    by_diff = runner.results_by_difficulty(dr_results)
+    check("Difficulty breakdown works", len(by_diff) > 0)
+
+    # --- 7. Gatekeeper ---
+    console.print("\n[bold cyan]7. Gatekeeper Pipeline[/bold cyan]")
+    gk = GatekeeperPipeline(store=store, registry=registry)
+    gk_result = gk.evaluate(seed_traces[0])
+    check("Gatekeeper accepts valid trace", gk_result.accepted or gk_result.is_duplicate,
+          f"rejected: {gk_result.rejection_reason}")
+
+    # --- 8. Langfuse offline ---
+    console.print("\n[bold cyan]8. Langfuse Tracer (offline)[/bold cyan]")
+    tracer = LangfuseTracer(settings=settings)
+    check("Tracer disabled without keys", not tracer.enabled)
+    trace_id = tracer.log_plan(plan)
+    check("Log returns None when disabled", trace_id is None)
+
+    # --- Print metrics from mock run ---
+    console.print("\n[bold cyan]9. Mock Benchmark Results (5 tasks)[/bold cyan]")
+    table = Table(title="Mock Results (N=5, mock LLM)")
+    table.add_column("Metric", style="bold")
+    table.add_column("Zero-Shot", justify="right")
+    table.add_column("Static", justify="right")
+    table.add_column("Dynamic", justify="right", style="bold green")
+
+    for metric in ["tsa", "pv", "pcr", "esa"]:
+        zs = zs_results["aggregate"][metric]
+        sf = sf_results["aggregate"][metric]
+        dy = dr_results["aggregate"][metric]
+        table.add_row(
+            metric.upper(),
+            f"{zs['mean']:.1%}" if isinstance(zs["mean"], float) else str(zs["mean"]),
+            f"{sf['mean']:.1%}" if isinstance(sf["mean"], float) else str(sf["mean"]),
+            f"{dy['mean']:.1%}" if isinstance(dy["mean"], float) else str(dy["mean"]),
+        )
+
+    console.print(table)
+    console.print("[dim]Note: These numbers are from a mock LLM — run with a real "
+                  "API key for actual results.[/dim]")
+
+    # --- Summary ---
+    console.print(f"\n{'=' * 50}")
+    total = passed + failed
+    console.print(f"[bold]Results: {passed}/{total} checks passed[/bold]")
+    if failed > 0:
+        console.print(f"[red]{failed} checks failed[/red]")
+        sys.exit(1)
+    else:
+        console.print("[bold green]All pipeline checks passed![/bold green]")
+        console.print("\n[dim]Next steps:[/dim]")
+        console.print("[dim]  1. Set GOOGLE_API_KEY and run: python examples/run_live_benchmark.py[/dim]")
+        console.print("[dim]  2. Set Langfuse keys for tracing[/dim]")
+        console.print("[dim]  3. Run the interactive agent: python -m agent.app --interactive[/dim]")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/pyproject.toml b/pyproject.toml
index fc96d91..bcbd937 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -48,6 +48,7 @@ dependencies = [
     "tiktoken>=0.7",
     "mcp>=1.0",
     "rich>=13.0",
+    "numpy>=1.26",
 ]
 
 [project.optional-dependencies]
diff --git a/src/behavioral_memory/__init__.py b/src/behavioral_memory/__init__.py
index f82404a..199455a 100644
--- a/src/behavioral_memory/__init__.py
+++ b/src/behavioral_memory/__init__.py
@@ -19,6 +19,7 @@
 )
 from behavioral_memory.gatekeeper.pipeline import GatekeeperPipeline
 from behavioral_memory.memory.dedup import Deduplicator
+from behavioral_memory.memory.in_memory_store import InMemoryTraceStore
 from behavioral_memory.memory.store import TraceStore
 from behavioral_memory.observability.annotation import AnnotationHandler
 from behavioral_memory.observability.feedback import FeedbackPoller
@@ -33,6 +34,7 @@
     "FeedbackPoller",
     "GatekeeperPipeline",
     "GatekeeperResult",
+    "InMemoryTraceStore",
     "LangfuseTracer",
     "Plan",
     "PlanEngine",
diff --git a/src/behavioral_memory/gatekeeper/pipeline.py b/src/behavioral_memory/gatekeeper/pipeline.py
index d8b13b6..aebcd63 100644
--- a/src/behavioral_memory/gatekeeper/pipeline.py
+++ b/src/behavioral_memory/gatekeeper/pipeline.py
@@ -11,6 +11,7 @@
 from __future__ import annotations
 
 import logging
+from typing import Any
 
 from behavioral_memory.core.config import Settings
 from behavioral_memory.core.schemas import ExecutionTrace, GatekeeperResult
@@ -18,7 +19,6 @@
 from behavioral_memory.gatekeeper.sandbox import SandboxExecutor
 from behavioral_memory.gatekeeper.schema_validator import SchemaValidator
 from behavioral_memory.memory.dedup import Deduplicator
-from behavioral_memory.memory.store import TraceStore
 from behavioral_memory.tools.registry import ToolRegistry
 
 logger = logging.getLogger(__name__)
@@ -29,7 +29,7 @@ class GatekeeperPipeline:
 
     def __init__(
         self,
-        store: TraceStore,
+        store: Any,
         registry: ToolRegistry,
         settings: Settings | None = None,
     ) -> None:
diff --git a/src/behavioral_memory/memory/__init__.py b/src/behavioral_memory/memory/__init__.py
index be38d26..c9e48c8 100644
--- a/src/behavioral_memory/memory/__init__.py
+++ b/src/behavioral_memory/memory/__init__.py
@@ -1,5 +1,6 @@
 from behavioral_memory.memory.dedup import Deduplicator
+from behavioral_memory.memory.in_memory_store import InMemoryTraceStore
 from behavioral_memory.memory.store import TraceStore
 from behavioral_memory.memory.token_budget import select_traces_within_budget
 
-__all__ = ["Deduplicator", "TraceStore", "select_traces_within_budget"]
+__all__ = ["Deduplicator", "InMemoryTraceStore", "TraceStore", "select_traces_within_budget"]
diff --git a/src/behavioral_memory/memory/dedup.py b/src/behavioral_memory/memory/dedup.py
index 314b373..e904948 100644
--- a/src/behavioral_memory/memory/dedup.py
+++ b/src/behavioral_memory/memory/dedup.py
@@ -9,10 +9,10 @@
 from __future__ import annotations
 
 import logging
+from typing import Any
 
 from behavioral_memory.core.config import Settings
 from behavioral_memory.core.schemas import ExecutionTrace
-from behavioral_memory.memory.store import TraceStore
 
 logger = logging.getLogger(__name__)
 
@@ -22,7 +22,7 @@ class Deduplicator:
 
     def __init__(
         self,
-        store: TraceStore,
+        store: Any,
         threshold: float | None = None,
         settings: Settings | None = None,
     ) -> None:
@@ -34,22 +34,16 @@ def is_duplicate(self, trace: ExecutionTrace) -> tuple[bool, float]:
         """Check if a trace is too similar to an existing one.
 
         Returns (is_duplicate, similarity_score).
+        Works with both TraceStore (PGVector) and InMemoryTraceStore.
         """
-        results = self._store.vectorstore.similarity_search_with_score(
-            trace.task_description, k=1
-        )
-        if not results:
-            return False, 0.0
-
-        doc, score = results[0]
+        score = self._store.similarity_score(trace.task_description)
         is_dup = score >= self.threshold
         if is_dup:
             logger.info(
-                "Duplicate detected (%.3f >= %.3f): '%s' ~ '%s'",
+                "Duplicate detected (%.3f >= %.3f): '%s'",
                 score,
                 self.threshold,
                 trace.task_description[:60],
-                doc.page_content[:60],
             )
         return is_dup, float(score)
 
diff --git a/src/behavioral_memory/memory/in_memory_store.py b/src/behavioral_memory/memory/in_memory_store.py
new file mode 100644
index 0000000..3c9dc82
--- /dev/null
+++ b/src/behavioral_memory/memory/in_memory_store.py
@@ -0,0 +1,89 @@
+"""In-memory trace store — no PostgreSQL required.
+
+Drop-in replacement for TraceStore that keeps embeddings in memory
+using numpy cosine similarity. Perfect for:
+  - Running the benchmark without database setup
+  - Local development and demos
+  - CI/CD testing with a real LLM
+
+Implements the same public interface as TraceStore so PlanEngine,
+Deduplicator, and the full pipeline work identically.
+"""
+
+from __future__ import annotations
+
+import logging
+
+import numpy as np
+from langchain_core.embeddings import Embeddings
+
+from behavioral_memory.core.config import Settings
+from behavioral_memory.core.schemas import ExecutionTrace
+
+logger = logging.getLogger(__name__)
+
+
+class InMemoryTraceStore:
+    """Vector store backed by in-memory numpy arrays.
+
+    Same interface as TraceStore but needs zero infrastructure.
+    """
+
+    def __init__(
+        self,
+        embeddings: Embeddings,
+        settings: Settings | None = None,
+    ) -> None:
+        self._embeddings = embeddings
+        self._settings = settings or Settings()
+        self._traces: list[ExecutionTrace] = []
+        self._vectors: list[list[float]] = []
+
+    def search(
+        self, query: str, k: int | None = None
+    ) -> list[tuple[ExecutionTrace, float]]:
+        k = k or self._settings.few_shot_k
+        if not self._traces:
+            return []
+
+        query_vec = self._embeddings.embed_query(query)
+        scores = self._cosine_similarities(query_vec, self._vectors)
+
+        top_indices = np.argsort(scores)[::-1][:k]
+        results = []
+        for idx in top_indices:
+            results.append((self._traces[idx], float(scores[idx])))
+        return results
+
+    def add(self, trace: ExecutionTrace) -> None:
+        vec = self._embeddings.embed_query(trace.task_description)
+        self._traces.append(trace)
+        self._vectors.append(vec)
+        logger.info("Stored trace (in-memory): %s", trace.task_description[:80])
+
+    def add_bulk(self, traces: list[ExecutionTrace]) -> int:
+        texts = [t.task_description for t in traces]
+        vecs = self._embeddings.embed_documents(texts)
+        self._traces.extend(traces)
+        self._vectors.extend(vecs)
+        logger.info("Bulk-added %d traces (in-memory)", len(traces))
+        return len(traces)
+
+    def similarity_score(self, query: str) -> float:
+        results = self.search(query, k=1)
+        if not results:
+            return 0.0
+        return results[0][1]
+
+    def count(self) -> int:
+        return len(self._traces)
+
+    @staticmethod
+    def _cosine_similarities(
+        query_vec: list[float], doc_vecs: list[list[float]]
+    ) -> np.ndarray:
+        q = np.array(query_vec)
+        d = np.array(doc_vecs)
+        q_norm = q / (np.linalg.norm(q) + 1e-10)
+        d_norms = d / (np.linalg.norm(d, axis=1, keepdims=True) + 1e-10)
+        return d_norms @ q_norm
diff --git a/src/behavioral_memory/memory/token_budget.py b/src/behavioral_memory/memory/token_budget.py
index edc3fea..883af9d 100644
--- a/src/behavioral_memory/memory/token_budget.py
+++ b/src/behavioral_memory/memory/token_budget.py
@@ -7,11 +7,12 @@
 
 from __future__ import annotations
 
+from typing import Any
+
 import tiktoken
 
 from behavioral_memory.core.config import Settings
 from behavioral_memory.core.schemas import ExecutionTrace, ToolSchema
-from behavioral_memory.memory.store import TraceStore
 
 _ENCODER = tiktoken.get_encoding("cl100k_base")
 
@@ -35,7 +36,7 @@ def _schema_tokens(schemas: list[ToolSchema]) -> int:
 
 
 def select_traces_within_budget(
-    store: TraceStore,
+    store: Any,
     query: str,
     tool_schemas: list[ToolSchema],
     *,
diff --git a/src/behavioral_memory/planner/engine.py b/src/behavioral_memory/planner/engine.py
index 10d90b5..0c1b52d 100644
--- a/src/behavioral_memory/planner/engine.py
+++ b/src/behavioral_memory/planner/engine.py
@@ -12,6 +12,7 @@
 from __future__ import annotations
 
 import logging
+from typing import Any
 
 from langchain_core.language_models import BaseChatModel
 from langchain_core.messages import HumanMessage, SystemMessage
@@ -19,7 +20,6 @@
 from behavioral_memory.core.config import Settings
 from behavioral_memory.core.exceptions import PlanGenerationError
 from behavioral_memory.core.schemas import ExecutionTrace, Plan, ToolSchema
-from behavioral_memory.memory.store import TraceStore
 from behavioral_memory.memory.token_budget import select_traces_within_budget
 from behavioral_memory.planner.postprocess import postprocess_plan
 from behavioral_memory.planner.prompt import SYSTEM_PROMPT, build_prompt
@@ -38,7 +38,7 @@ class PlanEngine:
     def __init__(
         self,
         llm: BaseChatModel,
-        store: TraceStore,
+        store: Any,
         registry: ToolRegistry | None = None,
         settings: Settings | None = None,
     ) -> None:
diff --git a/tests/unit/test_in_memory_store.py b/tests/unit/test_in_memory_store.py
new file mode 100644
index 0000000..2f79dd9
--- /dev/null
+++ b/tests/unit/test_in_memory_store.py
@@ -0,0 +1,87 @@
+"""Tests for InMemoryTraceStore — validates it matches TraceStore interface."""
+
+from __future__ import annotations
+
+from unittest.mock import MagicMock
+
+from behavioral_memory.core.schemas import ExecutionTrace, ToolCall
+from behavioral_memory.memory.in_memory_store import InMemoryTraceStore
+
+
+def _make_trace(desc: str) -> ExecutionTrace:
+    return ExecutionTrace(
+        task_description=desc,
+        tool_chain=[ToolCall(step_id="s1", tool_name="test_tool", parameters={"key": "val"})],
+    )
+
+
+def _make_mock_embeddings(dim: int = 4):
+    """Create a mock embeddings model that returns deterministic vectors."""
+    emb = MagicMock()
+
+    def embed_query(text: str) -> list[float]:
+        h = hash(text) % 10000
+        return [float(h % (i + 2)) / 10.0 for i in range(dim)]
+
+    def embed_documents(texts: list[str]) -> list[list[float]]:
+        return [embed_query(t) for t in texts]
+
+    emb.embed_query = embed_query
+    emb.embed_documents = embed_documents
+    return emb
+
+
+class TestInMemoryTraceStore:
+    def test_empty_search(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        results = store.search("anything")
+        assert results == []
+
+    def test_add_and_count(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        store.add(_make_trace("test task"))
+        assert store.count() == 1
+
+    def test_bulk_add(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        traces = [_make_trace(f"task {i}") for i in range(5)]
+        added = store.add_bulk(traces)
+        assert added == 5
+        assert store.count() == 5
+
+    def test_search_returns_results(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        store.add(_make_trace("build a data pipeline"))
+        store.add(_make_trace("deploy a web application"))
+        store.add(_make_trace("analyze revenue data"))
+
+        results = store.search("data pipeline", k=2)
+        assert len(results) == 2
+        for trace, score in results:
+            assert isinstance(trace, ExecutionTrace)
+            assert isinstance(score, float)
+
+    def test_search_respects_k(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        for i in range(10):
+            store.add(_make_trace(f"task number {i}"))
+
+        results = store.search("query", k=3)
+        assert len(results) == 3
+
+    def test_similarity_score_empty(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        assert store.similarity_score("anything") == 0.0
+
+    def test_similarity_score_with_data(self):
+        store = InMemoryTraceStore(embeddings=_make_mock_embeddings())
+        store.add(_make_trace("test task"))
+        score = store.similarity_score("test task")
+        assert isinstance(score, float)
+        assert score >= -1.0
+
+    def test_interface_matches_trace_store(self):
+        """Verify InMemoryTraceStore has the same public methods as TraceStore."""
+        required_methods = ["search", "add", "add_bulk", "similarity_score", "count"]
+        for method in required_methods:
+            assert hasattr(InMemoryTraceStore, method), f"Missing method: {method}"