feat(exp): add LongMemEval and LoCoMo expr #1937
Open
heaoxiang-ai wants to merge 11 commits intovolcengine:mainfrom
Open
feat(exp): add LongMemEval and LoCoMo expr #1937heaoxiang-ai wants to merge 11 commits intovolcengine:mainfrom
heaoxiang-ai wants to merge 11 commits intovolcengine:mainfrom
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds OpenViking benchmark support for the LongMemEval and LoCoMo datasets, and standardizes both evaluation pipelines around OpenViking-native retrieval instead of the legacy VikingBot agentic loop.
For LongMemEval, the benchmark now uses a single-search-context evaluation flow with OpenViking find/read/rerank, updated answer prompts, memory-token accounting, and supporting analysis/debug outputs.
For LoCoMo, this PR adds an OpenViking-native benchmark pipeline for import, evaluation, and judging, aligned more closely with the Mem0 evaluation methodology where applicable. It also improves LoCoMo ingestion fidelity by chunking sessions and preserving speaker roles during import.
Type of Change
Changes Made
benchmark work does not depend on or overwrite the older VikingBot path.
Testing
Checklist