Skip to content

feat: Add InMemoryTraceStore, live benchmark, and interactive agent#2

Merged
harsh-kr11 merged 1 commit into
mainfrom
feat/live-agent-and-benchmark
May 18, 2026
Merged

feat: Add InMemoryTraceStore, live benchmark, and interactive agent#2
harsh-kr11 merged 1 commit into
mainfrom
feat/live-agent-and-benchmark

Conversation

@harsh-kr11
Copy link
Copy Markdown
Owner

Summary

This PR makes the behavioral-memory framework actually runnable end-to-end without PostgreSQL:

  • InMemoryTraceStore — drop-in replacement for TraceStore that uses numpy cosine similarity. No PostgreSQL needed.
  • Live benchmark runner (examples/run_live_benchmark.py) — runs the real 30-task benchmark through 3 strategies with a real LLM. Only needs GOOGLE_API_KEY. Produces actual TSA/PV/PCR/ESA numbers with bootstrap CIs and McNemar's test.
  • Pipeline validator (examples/validate_pipeline.py) — 30-check end-to-end validation with zero external deps.
  • Interactive agent (python -m agent.app --interactive) — REPL with /compare (with vs without memory) and /memory commands.
  • Store-agnostic components — PlanEngine, Deduplicator, Gatekeeper, and graph nodes now accept both store types via duck typing.
  • Honest README — clearly marks paper numbers as paper-sourced, documents how to reproduce them.

What changed

Area Changes
New files in_memory_store.py, run_live_benchmark.py, validate_pipeline.py, test_in_memory_store.py
Modified engine.py, dedup.py, pipeline.py, token_budget.py, graph.py, retrieve.py — all now store-agnostic
Agent app.py rewritten with interactive mode, /compare, /memory
README Comprehensive rewrite with quick-start, honest numbers, full testing guide

Test plan

  • 104 pytest tests pass (pytest tests/ -v)
  • 30/30 pipeline validation checks pass (python examples/validate_pipeline.py)
  • Ruff lint clean
  • Run live benchmark with real API key to get actual numbers
  • Test Langfuse integration with real credentials

Made with Cursor

Major changes:
- InMemoryTraceStore: drop-in replacement for TraceStore that uses numpy
  cosine similarity instead of PostgreSQL+pgvector. Enables running the
  full benchmark and agent with zero infrastructure.
- run_live_benchmark.py: sends all 30 tasks through the real LLM across
  3 strategies, producing actual TSA/PV/PCR/ESA numbers with bootstrap CIs.
  Only needs a GOOGLE_API_KEY, no PostgreSQL.
- validate_pipeline.py: 30-check validation script that tests the entire
  pipeline end-to-end with mock services, zero external deps required.
- Interactive agent mode (python -m agent.app --interactive) with /compare
  and /memory commands.
- Made all components store-agnostic: PlanEngine, Deduplicator, Gatekeeper,
  BenchmarkRunner, and graph nodes now accept both TraceStore and
  InMemoryTraceStore via duck typing.
- Updated README with honest documentation: clearly marks paper numbers as
  paper-sourced, documents how to reproduce them, and adds quick-start guide.
- 104 tests passing, 30/30 pipeline validation checks passing.

Co-authored-by: Cursor <cursoragent@cursor.com>
@harsh-kr11 harsh-kr11 merged commit 7a8b92a into main May 18, 2026
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants