LoCoMo public evaluation on `evals` branch is hard to reproduce: SQLite failure, memory tool-call errors, and lower `global_idx=1` score

For a controlled test, I first ran only `global_idx=1`.

The run completes when using PostgreSQL, but the score is much lower than the official same-question score range. I also observed several memory-tool failures during ingestion.

## Environment

- Branch: `evals`
- Evaluation directory: `public_evaluations`
- Entry point: `public_evaluations/main.py`
- Dataset: `LOCOMO`
- Test index: `global_idx=1`
- Answer model: `gpt-4.1-mini`
- Embedding model: `text-embedding-3-small`
- Backend: OpenAI-compatible endpoint
- DB:
  - SQLite: fails
  - PostgreSQL + pgvector: run completes

## Reproduction

### 1. SQLite path fails

Following the README-style command with an isolated `HOME` still uses SQLite by default:

```bash
cd public_evaluations

python main.py \
  --agent_name mirix \
  --dataset LOCOMO \
  --global_idx 1 \
  --num_exp -1 \
  --sub_datasets 'longmemeval_s*' \
  --config_path ../mirix/configs/mirix_openai_compat_41mini.yaml
```

It fails with:

```text
sqlite3.OperationalError: near ">>": syntax error

ORDER BY CAST(knowledge_vault.last_modify ->> 'timestamp' AS DATETIME) DESC
```

This appears to be caused by SQLite not supporting PostgreSQL-style JSON operators.

### 2. PostgreSQL path completes

Using PostgreSQL via:

```bash
export MIRIX_PG_URI="postgresql+pg8000://mirix:mirix@127.0.0.1:5432/<db_name>"
export mirix_pg_uri="$MIRIX_PG_URI"
```

the official `main.py -> run_instance.py` path completes successfully for `global_idx=1`.

Generated result:

```text
results/mirix_LOCOMO-modelgpt-4.1-mini/1_subsetNone_cksizeNone/results.json
```

Run status:

```text
rows: 81
ERROR_RESPONSE_FAILED: 0
content_filter: 0
Retry limit reached: 0
model_not_found: 0
```

## Observed score

Using the official evaluator:

```bash
python evals.py \
  --input_file results/mirix_LOCOMO-modelgpt-4.1-mini/1_subsetNone_cksizeNone/results.json \
  --output_file results/mirix_LOCOMO-modelgpt-4.1-mini/readme_pg_idx1_evaluation_metrics.json \
  --max_workers 3
```

I get:

```text
Total Questions: 81
Overall Average BLEU Score: 0.2391
Overall Average F1 Score: 0.3282
Overall Average LLM Score: 0.7654
```

For the same 81 questions, the provided official evaluation metrics appear to be around:

```text
0.8889 - 0.9259 LLM score
```

so this is a large gap.

## Memory tool failures observed

During memory ingestion, I see tool failures such as:

```text
core_memory_append: ValidationError
Value error, Edit failed: Exceeds 5000 character limit
```

Example:

```text
Error executing function core_memory_append:
ValidationError: 1 validation error for Human
Value error, Edit failed: Exceeds 5000 character limit (requested 5551)
```

In earlier runs, I also observed repeated tool-call argument mismatches such as:

```text
episodic_memory_merge() got an unexpected keyword argument 'actor'
episodic_memory_merge() got an unexpected keyword argument 'tree_path'
```

This seems to happen when the model provides full episodic-memory fields to `episodic_memory_merge`, although the function only accepts:

```python
event_id
combined_summary
combined_details
```

## Error pattern in wrong answers

The low score does not seem to be caused only by the `core_memory_append` 5000-character failures. The bigger pattern is that answers often retrieve or use adjacent/generic memories instead of the exact evidence.

### Example 1

Question:

```text
What offer does Gina make to Jon regarding social media?
```

Gold:

```text
Helping with making content and managing his social media accounts.
```

Observed response:

```text
Jon plans to expand his dance studio's social media presence, offer workshops and classes to local schools and centers, and host dance competitions...
```

### Example 2

Question:

```text
What plans does Jon have after receiving advice at the networking event?
```

Gold:

```text
Sprucing up his business plan, tweaking his pitch to investors, and working on an online platform.
```

Observed response:

```text
Jon plans to expand his dance studio's social media presence, offer workshops and classes...
```

### Example 3

Question:

```text
When did Gina launch an ad campaign for her store?
```

Gold:

```text
29 January, 2023
```

Observed response:

```text
January 2023
```

So the main issue seems to be loss of precise details / retrieval of nearby memories rather than total API failure.

## Questions

1. Is the README evaluation expected to work with SQLite on Linux, or should PostgreSQL be documented as required for the current `evals` branch?
2. Were the official LoCoMo numbers generated from this exact `evals` branch code and prompt/tool schema?
3. Is `core_memory_append` expected to fail when the 5000-character limit is exceeded, or should the tool automatically rewrite/compress/truncate?
4. Should memory tools ignore extra tool-call arguments, especially for `episodic_memory_merge`?
5. Are there recommended backend/router settings for reproducing the official LoCoMo score with `gpt-4.1-mini` and `text-embedding-3-small`?

## Expected behavior

Following the README evaluation path should either:

- reproduce scores close to the published LoCoMo result, or
- document the exact required backend, DB, model, embedding, and config assumptions needed for reproduction.

At minimum, SQLite incompatibility and memory-tool failure modes should be documented or handled gracefully.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoCoMo public evaluation on `evals` branch is hard to reproduce: SQLite failure, memory tool-call errors, and lower `global_idx=1` score #128

Environment

Reproduction

1. SQLite path fails

2. PostgreSQL path completes

Observed score

Memory tool failures observed

Error pattern in wrong answers

Example 1

Example 2

Example 3

Questions

Expected behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LoCoMo public evaluation on evals branch is hard to reproduce: SQLite failure, memory tool-call errors, and lower global_idx=1 score #128

Description

Environment

Reproduction

1. SQLite path fails

2. PostgreSQL path completes

Observed score

Memory tool failures observed

Error pattern in wrong answers

Example 1

Example 2

Example 3

Questions

Expected behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

LoCoMo public evaluation on `evals` branch is hard to reproduce: SQLite failure, memory tool-call errors, and lower `global_idx=1` score #128