Add LoCoMo benchmark for OpenSearch Agentic Memory#4814
Conversation
Adds a reproducible benchmark comparing OpenSearch Agentic Memory against mem0 on the LoCoMo dataset (10 long conversations, 1,540 QA questions). Results: Hybrid search achieves 79.6% accuracy, beating mem0's 66.9% baseline by 12.7 percentage points. Includes: - setup.py: One-time model registration and container creation - benchmark.py: Resumable benchmark with BM25/semantic/hybrid comparison - memory_client.py: REST client for agentic memory APIs - README with full setup instructions and results Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit b625478.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
Addresses Code-Diff-Analyzer finding: wildcard regex was too permissive. Now only allows Bedrock runtime and OpenAI API endpoints. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4814 +/- ##
============================================
- Coverage 77.42% 77.41% -0.01%
Complexity 11907 11907
============================================
Files 963 963
Lines 53326 53325 -1
Branches 6503 6503
============================================
- Hits 41285 41284 -1
Misses 9289 9289
Partials 2752 2752
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Adds a reproducible benchmark comparing OpenSearch Agentic Memory against mem0 on the LoCoMo dataset (10 long conversations, 1,540 QA questions).
Results
What's Included
setup.py— One-time model registration and container creationbenchmark.py— Resumable benchmark with BM25/semantic/hybrid comparisonmemory_client.py— REST client for agentic memory APIsREADME.md— Full setup instructions, results, and methodologyKey Finding
Namespace configuration is the primary factor. Per-conversation namespaces (79.6%) vs per-speaker namespaces (34.9%) showed a 44.7pp difference. The lesson: namespace boundaries should match query scope, not data source.
Testing
Ran end-to-end on OpenSearch 3.6.0 with Bedrock Titan v2 (embedding) + Claude Sonnet (extraction/answering) + GPT-4o-mini (judge, same as mem0).
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.