Skip to content

feat(exp): add LongMemEval and LoCoMo expr #1937

Open
heaoxiang-ai wants to merge 11 commits intovolcengine:mainfrom
heaoxiang-ai:feat_add_longmemeval
Open

feat(exp): add LongMemEval and LoCoMo expr #1937
heaoxiang-ai wants to merge 11 commits intovolcengine:mainfrom
heaoxiang-ai:feat_add_longmemeval

Conversation

@heaoxiang-ai
Copy link
Copy Markdown
Contributor

Description

This PR adds OpenViking benchmark support for the LongMemEval and LoCoMo datasets, and standardizes both evaluation pipelines around OpenViking-native retrieval instead of the legacy VikingBot agentic loop.

For LongMemEval, the benchmark now uses a single-search-context evaluation flow with OpenViking find/read/rerank, updated answer prompts, memory-token accounting, and supporting analysis/debug outputs.

For LoCoMo, this PR adds an OpenViking-native benchmark pipeline for import, evaluation, and judging, aligned more closely with the Mem0 evaluation methodology where applicable. It also improves LoCoMo ingestion fidelity by chunking sessions and preserving speaker roles during import.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Added LongMemEval benchmark support on top of OpenViking retrieval.
  • Switched LongMemEval answering from agentic-loop style execution to single-round search -> read -> rerank -> answer.
  • Added debug output for model input prompt and richer retrieval traces in LongMemEval results.
  • Added memory-only token accounting for retrieved context in LongMemEval evaluation/stat scripts.
  • Updated LongMemEval answer/judge prompt wiring and evaluation utilities.
  • Simplified facts.yaml for benchmark-oriented memory extraction:
    • reduced redundant structured fields
    • strengthened stable fact_key guidance for current-state style facts
  • Added/updated analysis helpers and result-processing scripts for LongMemEval experiments.
  • Added LoCoMo benchmark support under benchmark scripts.
  • Added LoCoMo import pipeline for OpenViking memory construction.
  • Improved LoCoMo import behavior:
    • chunked session ingestion
    • speaker_a -> user, speaker_b -> assistant
  • Switched LoCoMo evaluation to OpenViking-native single-search retrieval flow.
  • Added LoCoMo prompt module aligned with the benchmark answer-generation style.
  • Updated LoCoMo judge to align with the Mem0-style benchmark judge flow:
    • category-aware answer preprocessing
    • evidence-aware judge prompt
    • exclusion of category 5 adversarial questions from scoring
  • Updated LoCoMo eval output writing to persist incrementally during long runs.
  • Added a separate benchmark/locomo/openviking/ benchmark directory so OpenViking-specific
    benchmark work does not depend on or overwrite the older VikingBot path.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add LongMemEval Benchmark Support

Relevant files:

  • benchmark/longmemeval/**/*
  • tests/unit/benchmark/test_longmemeval_vikingbot.py

Sub-PR theme: Add LoCoMo Benchmark Support

Relevant files:

  • benchmark/locomo/**/*

Sub-PR theme: Update VikingBot for Benchmark Eval Mode

Relevant files:

  • bot/vikingbot/agent/loop.py
  • bot/vikingbot/agent/context.py
  • bot/vikingbot/agent/tools/ov_file.py
  • bot/vikingbot/agent/tools/registry.py
  • bot/vikingbot/utils/openviking_routing.py

Sub-PR theme: Simplify Memory Extraction Templates

Relevant files:

  • openviking/prompts/templates/memory/entities.yaml
  • openviking/prompts/templates/memory/facts.yaml

⚡ Recommended focus areas for review

Potential Breaking Change

Changed entities template enabled from true to false. This might break existing users who rely on entity memory extraction unless this is specifically scoped to benchmark runs.

enabled: false
Error Handling Improvement

Bare exception handlers in judge_prefix and judge_response_correctness catch all exceptions without logging. Consider adding logging for debugging purposes.

    except Exception as exc:
        return {
            "sufficient": False,
            "reason": f"[API ERROR] {exc}",
            "supporting_uris": [],
            "raw_response": "",
        }


async def judge_response_correctness(
    client,
    model: str,
    row: dict[str, str],
    timeout: int,
) -> dict[str, Any]:
    prompt = RESPONSE_CORRECTNESS_PROMPT.format(
        question=row.get("question", ""),
        answer=row.get("answer", ""),
        question_type=row.get("question_type", ""),
        question_time=row.get("question_time", ""),
        response=row.get("response", ""),
    )
    try:
        resp = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            timeout=timeout,
        )
        content = resp.choices[0].message.content or ""
        result = parse_correctness_response(content)
        result["raw_response"] = content
        return result
    except Exception as exc:
        return {
            "correct": False,
            "reason": f"[API ERROR] {exc}",
            "raw_response": "",
        }
Minor Code Clarity

uri_to_local_path returns candidates[0] even when no candidate exists. This is handled downstream in read_uri_content, but returning None explicitly might improve clarity.

def uri_to_local_path(uri: str, data_root: Path) -> Path | None:
    if uri.startswith("viking://user/"):
        rest = uri[len("viking://user/") :]
        kind = "user"
    elif uri.startswith("viking://agent/"):
        rest = uri[len("viking://agent/") :]
        kind = "agent"
    else:
        return None

    candidates = [
        data_root / "viking" / "default" / kind / rest,
        data_root / "default" / kind / rest,
        data_root / kind / rest,
    ]
    for path in candidates:
        if path.exists():
            return path
    return candidates[0]

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant