Skip to content

LoCoMo public evaluation on evals branch is hard to reproduce: SQLite failure, memory tool-call errors, and lower global_idx=1 score #128

@qzds

Description

@qzds

For a controlled test, I first ran only global_idx=1.

The run completes when using PostgreSQL, but the score is much lower than the official same-question score range. I also observed several memory-tool failures during ingestion.

Environment

  • Branch: evals
  • Evaluation directory: public_evaluations
  • Entry point: public_evaluations/main.py
  • Dataset: LOCOMO
  • Test index: global_idx=1
  • Answer model: gpt-4.1-mini
  • Embedding model: text-embedding-3-small
  • Backend: OpenAI-compatible endpoint
  • DB:
    • SQLite: fails
    • PostgreSQL + pgvector: run completes

Reproduction

1. SQLite path fails

Following the README-style command with an isolated HOME still uses SQLite by default:

cd public_evaluations

python main.py \
  --agent_name mirix \
  --dataset LOCOMO \
  --global_idx 1 \
  --num_exp -1 \
  --sub_datasets 'longmemeval_s*' \
  --config_path ../mirix/configs/mirix_openai_compat_41mini.yaml

It fails with:

sqlite3.OperationalError: near ">>": syntax error

ORDER BY CAST(knowledge_vault.last_modify ->> 'timestamp' AS DATETIME) DESC

This appears to be caused by SQLite not supporting PostgreSQL-style JSON operators.

2. PostgreSQL path completes

Using PostgreSQL via:

export MIRIX_PG_URI="postgresql+pg8000://mirix:mirix@127.0.0.1:5432/<db_name>"
export mirix_pg_uri="$MIRIX_PG_URI"

the official main.py -> run_instance.py path completes successfully for global_idx=1.

Generated result:

results/mirix_LOCOMO-modelgpt-4.1-mini/1_subsetNone_cksizeNone/results.json

Run status:

rows: 81
ERROR_RESPONSE_FAILED: 0
content_filter: 0
Retry limit reached: 0
model_not_found: 0

Observed score

Using the official evaluator:

python evals.py \
  --input_file results/mirix_LOCOMO-modelgpt-4.1-mini/1_subsetNone_cksizeNone/results.json \
  --output_file results/mirix_LOCOMO-modelgpt-4.1-mini/readme_pg_idx1_evaluation_metrics.json \
  --max_workers 3

I get:

Total Questions: 81
Overall Average BLEU Score: 0.2391
Overall Average F1 Score: 0.3282
Overall Average LLM Score: 0.7654

For the same 81 questions, the provided official evaluation metrics appear to be around:

0.8889 - 0.9259 LLM score

so this is a large gap.

Memory tool failures observed

During memory ingestion, I see tool failures such as:

core_memory_append: ValidationError
Value error, Edit failed: Exceeds 5000 character limit

Example:

Error executing function core_memory_append:
ValidationError: 1 validation error for Human
Value error, Edit failed: Exceeds 5000 character limit (requested 5551)

In earlier runs, I also observed repeated tool-call argument mismatches such as:

episodic_memory_merge() got an unexpected keyword argument 'actor'
episodic_memory_merge() got an unexpected keyword argument 'tree_path'

This seems to happen when the model provides full episodic-memory fields to episodic_memory_merge, although the function only accepts:

event_id
combined_summary
combined_details

Error pattern in wrong answers

The low score does not seem to be caused only by the core_memory_append 5000-character failures. The bigger pattern is that answers often retrieve or use adjacent/generic memories instead of the exact evidence.

Example 1

Question:

What offer does Gina make to Jon regarding social media?

Gold:

Helping with making content and managing his social media accounts.

Observed response:

Jon plans to expand his dance studio's social media presence, offer workshops and classes to local schools and centers, and host dance competitions...

Example 2

Question:

What plans does Jon have after receiving advice at the networking event?

Gold:

Sprucing up his business plan, tweaking his pitch to investors, and working on an online platform.

Observed response:

Jon plans to expand his dance studio's social media presence, offer workshops and classes...

Example 3

Question:

When did Gina launch an ad campaign for her store?

Gold:

29 January, 2023

Observed response:

January 2023

So the main issue seems to be loss of precise details / retrieval of nearby memories rather than total API failure.

Questions

  1. Is the README evaluation expected to work with SQLite on Linux, or should PostgreSQL be documented as required for the current evals branch?
  2. Were the official LoCoMo numbers generated from this exact evals branch code and prompt/tool schema?
  3. Is core_memory_append expected to fail when the 5000-character limit is exceeded, or should the tool automatically rewrite/compress/truncate?
  4. Should memory tools ignore extra tool-call arguments, especially for episodic_memory_merge?
  5. Are there recommended backend/router settings for reproducing the official LoCoMo score with gpt-4.1-mini and text-embedding-3-small?

Expected behavior

Following the README evaluation path should either:

  • reproduce scores close to the published LoCoMo result, or
  • document the exact required backend, DB, model, embedding, and config assumptions needed for reproduction.

At minimum, SQLite incompatibility and memory-tool failure modes should be documented or handled gracefully.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions