Fix HELM EEE instance metric rows by Erotemic · Pull Request #132 · evaleval/every_eval_ever

Erotemic · 2026-05-08T01:53:10Z

A current problem is that the HELM converter collapses multi-metric samples into a single instance-level row, losing metric-specific results and sometimes marking non-correctness metrics as is_correct.

This PR refactors the converter to emit one InstanceLevelEvaluationLog row per (sample, metric) and only sets is_correct for metrics that are true binary correctness signals.

This preserves information from multi-metric HELM runs and prevents graded or bookkeeping metrics from being counted as binary correctness.

Changes:

~~Emit one row per non-empty inst_stats.stats metric.~~ Emit one row per non-empty core HELM metric, excluding bookkeeping/diagnostic stats from metric rows.
Set evaluation_result_id = metric_name.
Add _score_from_stat() to extract scalar scores from Stat.mean or sum/count.
Preserve the legacy exact-match fallback when inst_stats is missing.
Add _is_correct_for_metric() with an allowlist for correctness metrics such as exact_match*, ifeval_strict_accuracy, chain_of_thought_correctness, and math_equiv*.

Tests:

Updated legacy per-sample assertions for per-metric rows.
Added coverage for per-(sample, metric) emission, is_correct behavior, helper logic, and reasoning_traces=None.
Updated adapter row-count expectations from total_rows == N to total_rows >= N.

Erotemic · 2026-05-13T19:58:03Z

Working on a few more tests before this is ready for merge.

Erotemic · 2026-05-13T23:40:12Z

I've updated this PR to narrow the scope to core HELM evaluation metrics only.

Originally this PR was converting all HELM metrics into EEE metrics, and on review, I don't think that was right. I restricted it to only convert the "core" HELM metrics having to do with benchmark or instance quality.

The existing token/performance bookkeeping fields remain where they already belong, but bookkeeping stats like num_prompt_tokens, inference_runtime, logprob, batch_size, and num_bytes are not promoted into evaluation_results or per-metric sample rows in this PR.

The new behavior does add stable evaluation_result_id values to both aggregate and instance-level rows so detail rows can join back to aggregate results. Details on this:

Details

Before this PR, the aggregate JSON and the sample-level JSONL did not have a stable shared key for "which metric is this row about?"

For example, on main, if the aggregate file had rows like:

aggregate evaluation result:
  evaluation_name: "exact_match test on mmlu"
  evaluation_result_id: null
  score: 0.0

and the instance-level file had a sample row like:

sample row:
  sample_id: "id147"
  evaluation_result_id: null
  score: 0.0
  is_correct: false

That sample row was implicitly the exact_match:test result, but nothing in the row said that. A downstream consumer could not reliably join it to the aggregate exact_match test on mmlu result. This becomes especially problematic once a sample contributes to multiple metrics. A score of 0.0 could mean exact_match, exact_match@5, quasi_exact_match, prefix_exact_match, or another metric.

After this PR, both sides use the same deterministic key:

aggregate evaluation result:
  evaluation_name: "exact_match test on mmlu"
  evaluation_result_id: "exact_match:test"
  score: 0.0

and:

sample row:
  sample_id: "id147"
  evaluation_result_id: "exact_match:test"
  score: 0.0
  is_correct: false

So downstream code can now do:

sample_row.evaluation_result_id == aggregate_result.evaluation_result_id

and know exactly which aggregate metric the sample row belongs to.

A high-level report showing what new JSONL rows are emitted and what new items are added in the aggregate JSON metric rows:

Erotemic · 2026-05-13T23:58:41Z

I also cleaned up the typing in the files I touched to reduce the ty check errors.

Fix HELM EEE instance metric rows

4353b80

Erotemic force-pushed the dev/helm_instance_fixes branch from 953c2eb to 4353b80 Compare May 13, 2026 19:34

Erotemic marked this pull request as draft May 13, 2026 19:58

Erotemic added 4 commits May 13, 2026 16:46

Add more tests

f62387a

Fix HELM per-metric instance result joins

a4f2bb6

Document HELM instance metric invariants

991b7cb

Restrict HELM metric rows to core metrics

1b34a24

Erotemic marked this pull request as ready for review May 13, 2026 23:41

Erotemic added 3 commits May 13, 2026 19:47

Simplify HELM core metric filter

e86c5f0

Clean up HELM core metric conversion

e7de029

Clean up HELM converter typing

43011f5

Erotemic requested review from j-chim and nelaturuharsha May 13, 2026 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HELM EEE instance metric rows#132

Fix HELM EEE instance metric rows#132
Erotemic wants to merge 8 commits into
evaleval:mainfrom
Erotemic:dev/helm_instance_fixes

Erotemic commented May 8, 2026 •

edited

Loading

Uh oh!

Erotemic commented May 13, 2026

Uh oh!

Erotemic commented May 13, 2026

Uh oh!

Erotemic commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Erotemic commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Erotemic commented May 13, 2026

Uh oh!

Erotemic commented May 13, 2026

Uh oh!

Erotemic commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Erotemic commented May 8, 2026 •

edited

Loading