Skip to content

Add LoCoMo benchmark for OpenSearch Agentic Memory#4814

Open
dhrubo-os wants to merge 2 commits into
opensearch-project:mainfrom
dhrubo-os:add-locomo-benchmark
Open

Add LoCoMo benchmark for OpenSearch Agentic Memory#4814
dhrubo-os wants to merge 2 commits into
opensearch-project:mainfrom
dhrubo-os:add-locomo-benchmark

Conversation

@dhrubo-os
Copy link
Copy Markdown
Collaborator

Description

Adds a reproducible benchmark comparing OpenSearch Agentic Memory against mem0 on the LoCoMo dataset (10 long conversations, 1,540 QA questions).

Results

Method Overall vs mem0
BM25 (keyword) 70.4% +3.5%
Semantic (neural) 78.9% +12.0%
Hybrid 79.6% +12.7%
mem0 (baseline) 66.9%

What's Included

  • setup.py — One-time model registration and container creation
  • benchmark.py — Resumable benchmark with BM25/semantic/hybrid comparison
  • memory_client.py — REST client for agentic memory APIs
  • README.md — Full setup instructions, results, and methodology

Key Finding

Namespace configuration is the primary factor. Per-conversation namespaces (79.6%) vs per-speaker namespaces (34.9%) showed a 44.7pp difference. The lesson: namespace boundaries should match query scope, not data source.

Testing

Ran end-to-end on OpenSearch 3.6.0 with Bedrock Titan v2 (embedding) + Claude Sonnet (extraction/answering) + GPT-4o-mini (judge, same as mem0).


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Adds a reproducible benchmark comparing OpenSearch Agentic Memory
against mem0 on the LoCoMo dataset (10 long conversations, 1,540 QA
questions).

Results: Hybrid search achieves 79.6% accuracy, beating mem0's 66.9%
baseline by 12.7 percentage points.

Includes:
- setup.py: One-time model registration and container creation
- benchmark.py: Resumable benchmark with BM25/semantic/hybrid comparison
- memory_client.py: REST client for agentic memory APIs
- README with full setup instructions and results

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit b625478.

PathLineSeverityDescription
docs/benchmarks/locomo-agentic-memory/requirements.txt1highFour new Python package dependencies added (requests, boto3, openai, urllib3). Per mandatory supply chain policy, all dependency additions must be flagged regardless of apparent legitimacy — maintainers must verify artifact authenticity and integrity.
docs/benchmarks/locomo-agentic-memory/setup.py107mediumAWS credentials (access_key, secret_key, session_token) are retrieved via boto3 and stored as plaintext inside OpenSearch ML connector configurations via REST API. This persists live AWS credentials in OpenSearch's internal index, accessible to any user with cluster read access.
docs/benchmarks/locomo-agentic-memory/.env.example5mediumSSL certificate verification is disabled by default (OPENSEARCH_VERIFY_SSL=false) and urllib3 warnings are silenced in both benchmark.py and setup.py. This allows silent MITM attacks against the OpenSearch endpoint, which handles credential-bearing requests including AWS keys.
docs/benchmarks/locomo-agentic-memory/README.md92lowThe documented credential-refresh procedure passes AWS access key, secret key, and session token as inline shell variables used directly in curl command arguments. This exposes credentials in shell history and process listings (e.g., via /proc or ps output).

The table above displays the top 10 most important findings.

Total: 4 | Critical: 0 | High: 1 | Medium: 2 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@dhrubo-os dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Error
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Failure
@dhrubo-os dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Inactive
Addresses Code-Diff-Analyzer finding: wildcard regex was too permissive.
Now only allows Bedrock runtime and OpenAI API endpoints.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
@dhrubo-os dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 23:08 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 23:08 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 23:08 — with GitHub Actions Inactive
@dhrubo-os dhrubo-os had a problem deploying to ml-commons-cicd-env May 6, 2026 23:08 — with GitHub Actions Failure
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.41%. Comparing base (12f884e) to head (b625478).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4814      +/-   ##
============================================
- Coverage     77.42%   77.41%   -0.01%     
  Complexity    11907    11907              
============================================
  Files           963      963              
  Lines         53326    53325       -1     
  Branches       6503     6503              
============================================
- Hits          41285    41284       -1     
  Misses         9289     9289              
  Partials       2752     2752              
Flag Coverage Δ
ml-commons 77.41% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant