Question: Is the LoCoMo `_JUDGE_TEMPLATE` too lenient for answer grading?

Hi, thanks for open-sourcing the benchmark implementation.

I had a question about the LoCoMo judge prompt in [`benchmarks/locomo/prompts.py`](https://github.com/mem0ai/memory-benchmarks/blob/main/benchmarks/locomo/prompts.py), especially `_JUDGE_TEMPLATE`.

Some of the current grading rules seem intentionally permissive, for example:

- Partial credit marks an answer as `CORRECT` if it includes at least one correct item from the gold answer list.
- Evidence is used only to accept answers, not to reject them more strictly.
- Same named entity / same referent can be marked `CORRECT` even if the generated answer provides different details.
- The prompt says to mark `WRONG` only when the generated answer contains zero correct items from the gold answer or addresses a completely different topic.
- Dates within 14 days and durations within 50% are accepted.

In our local runs, this appears to make the judge accept some answers that are only partially correct or that miss key details, which may inflate accuracy by turning false positives into `CORRECT` labels.

Could you clarify whether this level of leniency is intended for the reported LoCoMo scores? Or would you consider adding a stricter judge mode / prompt variant for cases where the generated answer should match all required answer items and avoid conflicting extra details?

A stricter alternative might require:

- list/count questions to include all required items, not just one;
- extra details to be accepted only if they do not conflict with the gold answer or evidence;
- evidence to be usable for both accepting and rejecting answers;
- same referent to be insufficient when the question asks for a specific attribute, date, action, or reason.

Happy to provide concrete false-positive examples if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Is the LoCoMo `_JUDGE_TEMPLATE` too lenient for answer grading? #10

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question: Is the LoCoMo _JUDGE_TEMPLATE too lenient for answer grading? #10

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Question: Is the LoCoMo `_JUDGE_TEMPLATE` too lenient for answer grading? #10