Skip to content

Question: Is the LoCoMo _JUDGE_TEMPLATE too lenient for answer grading? #10

@heaoxiang-ai

Description

@heaoxiang-ai

Hi, thanks for open-sourcing the benchmark implementation.

I had a question about the LoCoMo judge prompt in benchmarks/locomo/prompts.py, especially _JUDGE_TEMPLATE.

Some of the current grading rules seem intentionally permissive, for example:

  • Partial credit marks an answer as CORRECT if it includes at least one correct item from the gold answer list.
  • Evidence is used only to accept answers, not to reject them more strictly.
  • Same named entity / same referent can be marked CORRECT even if the generated answer provides different details.
  • The prompt says to mark WRONG only when the generated answer contains zero correct items from the gold answer or addresses a completely different topic.
  • Dates within 14 days and durations within 50% are accepted.

In our local runs, this appears to make the judge accept some answers that are only partially correct or that miss key details, which may inflate accuracy by turning false positives into CORRECT labels.

Could you clarify whether this level of leniency is intended for the reported LoCoMo scores? Or would you consider adding a stricter judge mode / prompt variant for cases where the generated answer should match all required answer items and avoid conflicting extra details?

A stricter alternative might require:

  • list/count questions to include all required items, not just one;
  • extra details to be accepted only if they do not conflict with the gold answer or evidence;
  • evidence to be usable for both accepting and rejecting answers;
  • same referent to be insufficient when the question asks for a specific attribute, date, action, or reason.

Happy to provide concrete false-positive examples if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions