feat: Add InMemoryTraceStore, live benchmark, and interactive agent by harsh-kr11 · Pull Request #2 · harsh-kr11/behavioral-memory

harsh-kr11 · 2026-05-18T14:07:27Z

Summary

This PR makes the behavioral-memory framework actually runnable end-to-end without PostgreSQL:

InMemoryTraceStore — drop-in replacement for TraceStore that uses numpy cosine similarity. No PostgreSQL needed.
Live benchmark runner (examples/run_live_benchmark.py) — runs the real 30-task benchmark through 3 strategies with a real LLM. Only needs GOOGLE_API_KEY. Produces actual TSA/PV/PCR/ESA numbers with bootstrap CIs and McNemar's test.
Pipeline validator (examples/validate_pipeline.py) — 30-check end-to-end validation with zero external deps.
Interactive agent (python -m agent.app --interactive) — REPL with /compare (with vs without memory) and /memory commands.
Store-agnostic components — PlanEngine, Deduplicator, Gatekeeper, and graph nodes now accept both store types via duck typing.
Honest README — clearly marks paper numbers as paper-sourced, documents how to reproduce them.

What changed

Area	Changes
New files	`in_memory_store.py`, `run_live_benchmark.py`, `validate_pipeline.py`, `test_in_memory_store.py`
Modified	`engine.py`, `dedup.py`, `pipeline.py`, `token_budget.py`, `graph.py`, `retrieve.py` — all now store-agnostic
Agent	`app.py` rewritten with interactive mode, `/compare`, `/memory`
README	Comprehensive rewrite with quick-start, honest numbers, full testing guide

Test plan

104 pytest tests pass (pytest tests/ -v)
30/30 pipeline validation checks pass (python examples/validate_pipeline.py)
Ruff lint clean
Run live benchmark with real API key to get actual numbers
Test Langfuse integration with real credentials

Made with Cursor

Major changes: - InMemoryTraceStore: drop-in replacement for TraceStore that uses numpy cosine similarity instead of PostgreSQL+pgvector. Enables running the full benchmark and agent with zero infrastructure. - run_live_benchmark.py: sends all 30 tasks through the real LLM across 3 strategies, producing actual TSA/PV/PCR/ESA numbers with bootstrap CIs. Only needs a GOOGLE_API_KEY, no PostgreSQL. - validate_pipeline.py: 30-check validation script that tests the entire pipeline end-to-end with mock services, zero external deps required. - Interactive agent mode (python -m agent.app --interactive) with /compare and /memory commands. - Made all components store-agnostic: PlanEngine, Deduplicator, Gatekeeper, BenchmarkRunner, and graph nodes now accept both TraceStore and InMemoryTraceStore via duck typing. - Updated README with honest documentation: clearly marks paper numbers as paper-sourced, documents how to reproduce them, and adds quick-start guide. - 104 tests passing, 30/30 pipeline validation checks passing. Co-authored-by: Cursor <cursoragent@cursor.com>

harsh-kr11 merged commit 7a8b92a into main May 18, 2026
0 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add InMemoryTraceStore, live benchmark, and interactive agent#2

feat: Add InMemoryTraceStore, live benchmark, and interactive agent#2
harsh-kr11 merged 1 commit into
mainfrom
feat/live-agent-and-benchmark

harsh-kr11 commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harsh-kr11 commented May 18, 2026

Summary

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants