For a controlled test, I first ran only global_idx=1.
The run completes when using PostgreSQL, but the score is much lower than the official same-question score range. I also observed several memory-tool failures during ingestion.
Environment
- Branch:
evals
- Evaluation directory:
public_evaluations
- Entry point:
public_evaluations/main.py
- Dataset:
LOCOMO
- Test index:
global_idx=1
- Answer model:
gpt-4.1-mini
- Embedding model:
text-embedding-3-small
- Backend: OpenAI-compatible endpoint
- DB:
- SQLite: fails
- PostgreSQL + pgvector: run completes
Reproduction
1. SQLite path fails
Following the README-style command with an isolated HOME still uses SQLite by default:
cd public_evaluations
python main.py \
--agent_name mirix \
--dataset LOCOMO \
--global_idx 1 \
--num_exp -1 \
--sub_datasets 'longmemeval_s*' \
--config_path ../mirix/configs/mirix_openai_compat_41mini.yaml
It fails with:
sqlite3.OperationalError: near ">>": syntax error
ORDER BY CAST(knowledge_vault.last_modify ->> 'timestamp' AS DATETIME) DESC
This appears to be caused by SQLite not supporting PostgreSQL-style JSON operators.
2. PostgreSQL path completes
Using PostgreSQL via:
export MIRIX_PG_URI="postgresql+pg8000://mirix:mirix@127.0.0.1:5432/<db_name>"
export mirix_pg_uri="$MIRIX_PG_URI"
the official main.py -> run_instance.py path completes successfully for global_idx=1.
Generated result:
results/mirix_LOCOMO-modelgpt-4.1-mini/1_subsetNone_cksizeNone/results.json
Run status:
rows: 81
ERROR_RESPONSE_FAILED: 0
content_filter: 0
Retry limit reached: 0
model_not_found: 0
Observed score
Using the official evaluator:
python evals.py \
--input_file results/mirix_LOCOMO-modelgpt-4.1-mini/1_subsetNone_cksizeNone/results.json \
--output_file results/mirix_LOCOMO-modelgpt-4.1-mini/readme_pg_idx1_evaluation_metrics.json \
--max_workers 3
I get:
Total Questions: 81
Overall Average BLEU Score: 0.2391
Overall Average F1 Score: 0.3282
Overall Average LLM Score: 0.7654
For the same 81 questions, the provided official evaluation metrics appear to be around:
0.8889 - 0.9259 LLM score
so this is a large gap.
Memory tool failures observed
During memory ingestion, I see tool failures such as:
core_memory_append: ValidationError
Value error, Edit failed: Exceeds 5000 character limit
Example:
Error executing function core_memory_append:
ValidationError: 1 validation error for Human
Value error, Edit failed: Exceeds 5000 character limit (requested 5551)
In earlier runs, I also observed repeated tool-call argument mismatches such as:
episodic_memory_merge() got an unexpected keyword argument 'actor'
episodic_memory_merge() got an unexpected keyword argument 'tree_path'
This seems to happen when the model provides full episodic-memory fields to episodic_memory_merge, although the function only accepts:
event_id
combined_summary
combined_details
Error pattern in wrong answers
The low score does not seem to be caused only by the core_memory_append 5000-character failures. The bigger pattern is that answers often retrieve or use adjacent/generic memories instead of the exact evidence.
Example 1
Question:
What offer does Gina make to Jon regarding social media?
Gold:
Helping with making content and managing his social media accounts.
Observed response:
Jon plans to expand his dance studio's social media presence, offer workshops and classes to local schools and centers, and host dance competitions...
Example 2
Question:
What plans does Jon have after receiving advice at the networking event?
Gold:
Sprucing up his business plan, tweaking his pitch to investors, and working on an online platform.
Observed response:
Jon plans to expand his dance studio's social media presence, offer workshops and classes...
Example 3
Question:
When did Gina launch an ad campaign for her store?
Gold:
Observed response:
So the main issue seems to be loss of precise details / retrieval of nearby memories rather than total API failure.
Questions
- Is the README evaluation expected to work with SQLite on Linux, or should PostgreSQL be documented as required for the current
evals branch?
- Were the official LoCoMo numbers generated from this exact
evals branch code and prompt/tool schema?
- Is
core_memory_append expected to fail when the 5000-character limit is exceeded, or should the tool automatically rewrite/compress/truncate?
- Should memory tools ignore extra tool-call arguments, especially for
episodic_memory_merge?
- Are there recommended backend/router settings for reproducing the official LoCoMo score with
gpt-4.1-mini and text-embedding-3-small?
Expected behavior
Following the README evaluation path should either:
- reproduce scores close to the published LoCoMo result, or
- document the exact required backend, DB, model, embedding, and config assumptions needed for reproduction.
At minimum, SQLite incompatibility and memory-tool failure modes should be documented or handled gracefully.
For a controlled test, I first ran only
global_idx=1.The run completes when using PostgreSQL, but the score is much lower than the official same-question score range. I also observed several memory-tool failures during ingestion.
Environment
evalspublic_evaluationspublic_evaluations/main.pyLOCOMOglobal_idx=1gpt-4.1-minitext-embedding-3-smallReproduction
1. SQLite path fails
Following the README-style command with an isolated
HOMEstill uses SQLite by default:It fails with:
This appears to be caused by SQLite not supporting PostgreSQL-style JSON operators.
2. PostgreSQL path completes
Using PostgreSQL via:
the official
main.py -> run_instance.pypath completes successfully forglobal_idx=1.Generated result:
Run status:
Observed score
Using the official evaluator:
I get:
For the same 81 questions, the provided official evaluation metrics appear to be around:
so this is a large gap.
Memory tool failures observed
During memory ingestion, I see tool failures such as:
Example:
In earlier runs, I also observed repeated tool-call argument mismatches such as:
This seems to happen when the model provides full episodic-memory fields to
episodic_memory_merge, although the function only accepts:Error pattern in wrong answers
The low score does not seem to be caused only by the
core_memory_append5000-character failures. The bigger pattern is that answers often retrieve or use adjacent/generic memories instead of the exact evidence.Example 1
Question:
Gold:
Observed response:
Example 2
Question:
Gold:
Observed response:
Example 3
Question:
Gold:
Observed response:
So the main issue seems to be loss of precise details / retrieval of nearby memories rather than total API failure.
Questions
evalsbranch?evalsbranch code and prompt/tool schema?core_memory_appendexpected to fail when the 5000-character limit is exceeded, or should the tool automatically rewrite/compress/truncate?episodic_memory_merge?gpt-4.1-miniandtext-embedding-3-small?Expected behavior
Following the README evaluation path should either:
At minimum, SQLite incompatibility and memory-tool failure modes should be documented or handled gracefully.