Hi, thanks for open-sourcing the benchmark implementation.
I had a question about the LoCoMo judge prompt in benchmarks/locomo/prompts.py, especially _JUDGE_TEMPLATE.
Some of the current grading rules seem intentionally permissive, for example:
- Partial credit marks an answer as
CORRECT if it includes at least one correct item from the gold answer list.
- Evidence is used only to accept answers, not to reject them more strictly.
- Same named entity / same referent can be marked
CORRECT even if the generated answer provides different details.
- The prompt says to mark
WRONG only when the generated answer contains zero correct items from the gold answer or addresses a completely different topic.
- Dates within 14 days and durations within 50% are accepted.
In our local runs, this appears to make the judge accept some answers that are only partially correct or that miss key details, which may inflate accuracy by turning false positives into CORRECT labels.
Could you clarify whether this level of leniency is intended for the reported LoCoMo scores? Or would you consider adding a stricter judge mode / prompt variant for cases where the generated answer should match all required answer items and avoid conflicting extra details?
A stricter alternative might require:
- list/count questions to include all required items, not just one;
- extra details to be accepted only if they do not conflict with the gold answer or evidence;
- evidence to be usable for both accepting and rejecting answers;
- same referent to be insufficient when the question asks for a specific attribute, date, action, or reason.
Happy to provide concrete false-positive examples if useful.
Hi, thanks for open-sourcing the benchmark implementation.
I had a question about the LoCoMo judge prompt in
benchmarks/locomo/prompts.py, especially_JUDGE_TEMPLATE.Some of the current grading rules seem intentionally permissive, for example:
CORRECTif it includes at least one correct item from the gold answer list.CORRECTeven if the generated answer provides different details.WRONGonly when the generated answer contains zero correct items from the gold answer or addresses a completely different topic.In our local runs, this appears to make the judge accept some answers that are only partially correct or that miss key details, which may inflate accuracy by turning false positives into
CORRECTlabels.Could you clarify whether this level of leniency is intended for the reported LoCoMo scores? Or would you consider adding a stricter judge mode / prompt variant for cases where the generated answer should match all required answer items and avoid conflicting extra details?
A stricter alternative might require:
Happy to provide concrete false-positive examples if useful.