feat(exp): add LongMemEval and LoCoMo expr by heaoxiang-ai · Pull Request #1937 · volcengine/OpenViking

heaoxiang-ai · 2026-05-09T07:33:28Z

Description

This PR adds OpenViking benchmark support for the LongMemEval and LoCoMo datasets, and standardizes both evaluation pipelines around OpenViking-native retrieval instead of the legacy VikingBot agentic loop.

For LongMemEval, the benchmark now uses a single-search-context evaluation flow with OpenViking find/read/rerank, updated answer prompts, memory-token accounting, and supporting analysis/debug outputs.

For LoCoMo, this PR adds an OpenViking-native benchmark pipeline for import, evaluation, and judging, aligned more closely with the Mem0 evaluation methodology where applicable. It also improves LoCoMo ingestion fidelity by chunking sessions and preserving speaker roles during import.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Added LongMemEval benchmark support on top of OpenViking retrieval.
Switched LongMemEval answering from agentic-loop style execution to single-round search -> read -> rerank -> answer.
Added debug output for model input prompt and richer retrieval traces in LongMemEval results.
Added memory-only token accounting for retrieved context in LongMemEval evaluation/stat scripts.
Updated LongMemEval answer/judge prompt wiring and evaluation utilities.
Simplified facts.yaml for benchmark-oriented memory extraction:
- reduced redundant structured fields
- strengthened stable fact_key guidance for current-state style facts
Added/updated analysis helpers and result-processing scripts for LongMemEval experiments.
Added LoCoMo benchmark support under benchmark scripts.
Added LoCoMo import pipeline for OpenViking memory construction.
Improved LoCoMo import behavior:
- chunked session ingestion
- speaker_a -> user, speaker_b -> assistant
Switched LoCoMo evaluation to OpenViking-native single-search retrieval flow.
Added LoCoMo prompt module aligned with the benchmark answer-generation style.
Updated LoCoMo judge to align with the Mem0-style benchmark judge flow:
- category-aware answer preprocessing
- evidence-aware judge prompt
- exclusion of category 5 adversarial questions from scoring
Updated LoCoMo eval output writing to persist incrementally during long runs.
Added a separate benchmark/locomo/openviking/ benchmark directory so OpenViking-specific
benchmark work does not depend on or overwrite the older VikingBot path.

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

github-actions · 2026-05-09T07:35:12Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add LongMemEval Benchmark Support Relevant files: benchmark/longmemeval/*/ tests/unit/benchmark/test_longmemeval_vikingbot.py Sub-PR theme: Add LoCoMo Benchmark Support Relevant files: benchmark/locomo/*/ Sub-PR theme: Update VikingBot for Benchmark Eval Mode Relevant files: bot/vikingbot/agent/loop.py bot/vikingbot/agent/context.py bot/vikingbot/agent/tools/ov_file.py bot/vikingbot/agent/tools/registry.py bot/vikingbot/utils/openviking_routing.py Sub-PR theme: Simplify Memory Extraction Templates Relevant files: openviking/prompts/templates/memory/entities.yaml openviking/prompts/templates/memory/facts.yaml
⚡ Recommended focus areas for review Potential Breaking Change Changed entities template enabled from true to false. This might break existing users who rely on entity memory extraction unless this is specifically scoped to benchmark runs. enabled: false Error Handling Improvement Bare exception handlers in judge_prefix and judge_response_correctness catch all exceptions without logging. Consider adding logging for debugging purposes. except Exception as exc: return { "sufficient": False, "reason": f"[API ERROR] {exc}", "supporting_uris": [], "raw_response": "", } async def judge_response_correctness( client, model: str, row: dict[str, str], timeout: int, ) -> dict[str, Any]: prompt = RESPONSE_CORRECTNESS_PROMPT.format( question=row.get("question", ""), answer=row.get("answer", ""), question_type=row.get("question_type", ""), question_time=row.get("question_time", ""), response=row.get("response", ""), ) try: resp = await client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], temperature=0, timeout=timeout, ) content = resp.choices[0].message.content or "" result = parse_correctness_response(content) result["raw_response"] = content return result except Exception as exc: return { "correct": False, "reason": f"[API ERROR] {exc}", "raw_response": "", } Minor Code Clarity uri_to_local_path returns candidates[0] even when no candidate exists. This is handled downstream in read_uri_content, but returning None explicitly might improve clarity. def uri_to_local_path(uri: str, data_root: Path) -> Path \| None: if uri.startswith("viking://user/"): rest = uri[len("viking://user/") :] kind = "user" elif uri.startswith("viking://agent/"): rest = uri[len("viking://agent/") :] kind = "agent" else: return None candidates = [ data_root / "viking" / "default" / kind / rest, data_root / "default" / kind / rest, data_root / kind / rest, ] for path in candidates: if path.exists(): return path return candidates[0]

github-actions · 2026-05-09T07:38:02Z

PR Code Suggestions ✨

No code suggestions found for the PR.

heaoxiang-ai added 8 commits April 9, 2026 14:11

feat: add longmemeval

d9c852b

Merge branch 'main' into feat_add_longmemeval

24f9d62

feat: longmemeval

6009b91

feat: openviking in longmemeval

e0258cb

feat: run eval

0fa462b

fix

ddb4e4d

feat: add openviking in locomo and longmem eval

ab0d340

Merge branch 'main' into feat_add_longmemeval

f3edb54

github-project-automation Bot added this to OpenViking project May 9, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 9, 2026

github-actions Bot added the Review effort 4/5 label May 9, 2026

heaoxiang-ai added 3 commits May 9, 2026 15:40

feat: remove unless expr code

62f23ac

fix: unless code

6dfa62d

feat: locomo and longmemeval

0af8c75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(exp): add LongMemEval and LoCoMo expr #1937

feat(exp): add LongMemEval and LoCoMo expr #1937
heaoxiang-ai wants to merge 11 commits intovolcengine:mainfrom
heaoxiang-ai:feat_add_longmemeval

heaoxiang-ai commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heaoxiang-ai commented May 9, 2026

Description

Type of Change

Changes Made

Testing

Checklist

Uh oh!

github-actions Bot commented May 9, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 9, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant