Add LoCoMo benchmark for OpenSearch Agentic Memory by dhrubo-os · Pull Request #4814 · opensearch-project/ml-commons

dhrubo-os · 2026-05-06T20:41:23Z

Description

Adds a reproducible benchmark comparing OpenSearch Agentic Memory against mem0 on the LoCoMo dataset (10 long conversations, 1,540 QA questions).

Results

Method	Overall	vs mem0
BM25 (keyword)	70.4%	+3.5%
Semantic (neural)	78.9%	+12.0%
Hybrid	79.6%	+12.7%
mem0 (baseline)	66.9%	—

What's Included

setup.py — One-time model registration and container creation
benchmark.py — Resumable benchmark with BM25/semantic/hybrid comparison
memory_client.py — REST client for agentic memory APIs
README.md — Full setup instructions, results, and methodology

Key Finding

Namespace configuration is the primary factor. Per-conversation namespaces (79.6%) vs per-speaker namespaces (34.9%) showed a 44.7pp difference. The lesson: namespace boundaries should match query scope, not data source.

Testing

Ran end-to-end on OpenSearch 3.6.0 with Bedrock Titan v2 (embedding) + Claude Sonnet (extraction/answering) + GPT-4o-mini (judge, same as mem0).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Adds a reproducible benchmark comparing OpenSearch Agentic Memory against mem0 on the LoCoMo dataset (10 long conversations, 1,540 QA questions). Results: Hybrid search achieves 79.6% accuracy, beating mem0's 66.9% baseline by 12.7 percentage points. Includes: - setup.py: One-time model registration and container creation - benchmark.py: Resumable benchmark with BM25/semantic/hybrid comparison - memory_client.py: REST client for agentic memory APIs - README with full setup instructions and results Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

github-actions · 2026-05-06T20:42:21Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit b625478.

Path	Line	Severity	Description
docs/benchmarks/locomo-agentic-memory/requirements.txt	1	high	Four new Python package dependencies added (requests, boto3, openai, urllib3). Per mandatory supply chain policy, all dependency additions must be flagged regardless of apparent legitimacy — maintainers must verify artifact authenticity and integrity.
docs/benchmarks/locomo-agentic-memory/setup.py	107	medium	AWS credentials (access_key, secret_key, session_token) are retrieved via boto3 and stored as plaintext inside OpenSearch ML connector configurations via REST API. This persists live AWS credentials in OpenSearch's internal index, accessible to any user with cluster read access.
docs/benchmarks/locomo-agentic-memory/.env.example	5	medium	SSL certificate verification is disabled by default (OPENSEARCH_VERIFY_SSL=false) and urllib3 warnings are silenced in both benchmark.py and setup.py. This allows silent MITM attacks against the OpenSearch endpoint, which handles credential-bearing requests including AWS keys.
docs/benchmarks/locomo-agentic-memory/README.md	92	low	The documented credential-refresh procedure passes AWS access key, secret key, and session token as inline shell variables used directly in curl command arguments. This exposes credentials in shell history and process listings (e.g., via /proc or ps output).

The table above displays the top 10 most important findings.

Total: 4 | Critical: 0 | High: 1 | Medium: 2 | Low: 1

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Addresses Code-Diff-Analyzer finding: wildcard regex was too permissive. Now only allows Bedrock runtime and OpenAI API endpoints. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

codecov · 2026-05-06T23:31:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.41%. Comparing base (12f884e) to head (b625478).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #4814      +/-   ##
============================================
- Coverage     77.42%   77.41%   -0.01%     
  Complexity    11907    11907              
============================================
  Files           963      963              
  Lines         53326    53325       -1     
  Branches       6503     6503              
============================================
- Hits          41285    41284       -1     
  Misses         9289     9289              
  Partials       2752     2752

Flag	Coverage Δ
ml-commons	`77.41% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dhrubo-os requested review from HenryL27, Zhangxunmt, akolarkunnu, austintlee, b4sjoo, jngz-es, mingshl, model-collapse, pyek-bot, rbhavna, sam-herman, xinyual, ylwu-amzn and zane-neo as code owners May 6, 2026 20:41

dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Inactive

dhrubo-os had a problem deploying to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Error

dhrubo-os had a problem deploying to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Failure

dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 20:42 — with GitHub Actions Inactive

Scope trusted_connector_endpoints_regex to Bedrock and OpenAI only

b625478

Addresses Code-Diff-Analyzer finding: wildcard regex was too permissive. Now only allows Bedrock runtime and OpenAI API endpoints. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

dhrubo-os temporarily deployed to ml-commons-cicd-env May 6, 2026 23:08 — with GitHub Actions Inactive

dhrubo-os had a problem deploying to ml-commons-cicd-env May 6, 2026 23:08 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LoCoMo benchmark for OpenSearch Agentic Memory#4814

Add LoCoMo benchmark for OpenSearch Agentic Memory#4814
dhrubo-os wants to merge 2 commits into
opensearch-project:mainfrom
dhrubo-os:add-locomo-benchmark

dhrubo-os commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dhrubo-os commented May 6, 2026

Description

Results

What's Included

Key Finding

Testing

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Analyzer ❗

Uh oh!

codecov Bot commented May 6, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 6, 2026 •

edited

Loading