Skip to content

Fix HELM EEE instance metric rows#132

Open
Erotemic wants to merge 8 commits into
evaleval:mainfrom
Erotemic:dev/helm_instance_fixes
Open

Fix HELM EEE instance metric rows#132
Erotemic wants to merge 8 commits into
evaleval:mainfrom
Erotemic:dev/helm_instance_fixes

Conversation

@Erotemic
Copy link
Copy Markdown
Collaborator

@Erotemic Erotemic commented May 8, 2026

A current problem is that the HELM converter collapses multi-metric samples into a single instance-level row, losing metric-specific results and sometimes marking non-correctness metrics as is_correct.

This PR refactors the converter to emit one InstanceLevelEvaluationLog row per (sample, metric) and only sets is_correct for metrics that are true binary correctness signals.

This preserves information from multi-metric HELM runs and prevents graded or bookkeeping metrics from being counted as binary correctness.

Changes:

  • Emit one row per non-empty inst_stats.stats metric. Emit one row per non-empty core HELM metric, excluding bookkeeping/diagnostic stats from metric rows.
  • Set evaluation_result_id = metric_name.
  • Add _score_from_stat() to extract scalar scores from Stat.mean or sum/count.
  • Preserve the legacy exact-match fallback when inst_stats is missing.
  • Add _is_correct_for_metric() with an allowlist for correctness metrics such as exact_match*, ifeval_strict_accuracy, chain_of_thought_correctness, and math_equiv*.

Tests:

  • Updated legacy per-sample assertions for per-metric rows.
  • Added coverage for per-(sample, metric) emission, is_correct behavior, helper logic, and reasoning_traces=None.
  • Updated adapter row-count expectations from total_rows == N to total_rows >= N.

@Erotemic Erotemic force-pushed the dev/helm_instance_fixes branch from 953c2eb to 4353b80 Compare May 13, 2026 19:34
@Erotemic Erotemic marked this pull request as draft May 13, 2026 19:58
@Erotemic
Copy link
Copy Markdown
Collaborator Author

Working on a few more tests before this is ready for merge.

@Erotemic
Copy link
Copy Markdown
Collaborator Author

I've updated this PR to narrow the scope to core HELM evaluation metrics only.

Originally this PR was converting all HELM metrics into EEE metrics, and on review, I don't think that was right. I restricted it to only convert the "core" HELM metrics having to do with benchmark or instance quality.

The existing token/performance bookkeeping fields remain where they already belong, but bookkeeping stats like num_prompt_tokens, inference_runtime, logprob, batch_size, and num_bytes are not promoted into evaluation_results or per-metric sample rows in this PR.

The new behavior does add stable evaluation_result_id values to both aggregate and instance-level rows so detail rows can join back to aggregate results. Details on this:

Details

Before this PR, the aggregate JSON and the sample-level JSONL did not have a stable shared key for "which metric is this row about?"

For example, on main, if the aggregate file had rows like:

aggregate evaluation result:
  evaluation_name: "exact_match test on mmlu"
  evaluation_result_id: null
  score: 0.0

and the instance-level file had a sample row like:

sample row:
  sample_id: "id147"
  evaluation_result_id: null
  score: 0.0
  is_correct: false

That sample row was implicitly the exact_match:test result, but nothing in the row said that. A downstream consumer could not reliably join it to the aggregate exact_match test on mmlu result. This becomes especially problematic once a sample contributes to multiple metrics. A score of 0.0 could mean exact_match, exact_match@5, quasi_exact_match, prefix_exact_match, or another metric.

After this PR, both sides use the same deterministic key:

aggregate evaluation result:
  evaluation_name: "exact_match test on mmlu"
  evaluation_result_id: "exact_match:test"
  score: 0.0

and:

sample row:
  sample_id: "id147"
  evaluation_result_id: "exact_match:test"
  score: 0.0
  is_correct: false

So downstream code can now do:

sample_row.evaluation_result_id == aggregate_result.evaluation_result_id

and know exactly which aggregate metric the sample row belongs to.

A high-level report showing what new JSONL rows are emitted and what new items are added in the aggregate JSON metric rows:

@Erotemic Erotemic marked this pull request as ready for review May 13, 2026 23:41
@Erotemic
Copy link
Copy Markdown
Collaborator Author

I also cleaned up the typing in the files I touched to reduce the ty check errors.

@Erotemic Erotemic requested review from j-chim and nelaturuharsha May 13, 2026 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant