Fix HELM EEE instance metric rows#132
Conversation
953c2eb to
4353b80
Compare
|
Working on a few more tests before this is ready for merge. |
|
I've updated this PR to narrow the scope to core HELM evaluation metrics only. Originally this PR was converting all HELM metrics into EEE metrics, and on review, I don't think that was right. I restricted it to only convert the "core" HELM metrics having to do with benchmark or instance quality. The existing token/performance bookkeeping fields remain where they already belong, but bookkeeping stats like The new behavior does add stable DetailsBefore this PR, the aggregate JSON and the sample-level JSONL did not have a stable shared key for "which metric is this row about?" For example, on and the instance-level file had a sample row like: That sample row was implicitly the After this PR, both sides use the same deterministic key: and: So downstream code can now do: and know exactly which aggregate metric the sample row belongs to. A high-level report showing what new JSONL rows are emitted and what new items are added in the aggregate JSON metric rows: |
|
I also cleaned up the typing in the files I touched to reduce the |
A current problem is that the HELM converter collapses multi-metric samples into a single instance-level row, losing metric-specific results and sometimes marking non-correctness metrics as
is_correct.This PR refactors the converter to emit one
InstanceLevelEvaluationLogrow per (sample, metric) and only setsis_correctfor metrics that are true binary correctness signals.This preserves information from multi-metric HELM runs and prevents graded or bookkeeping metrics from being counted as binary correctness.
Changes:
Emit one row per non-emptyEmit one row per non-empty core HELM metric, excluding bookkeeping/diagnostic stats from metric rows.inst_stats.statsmetric.evaluation_result_id = metric_name._score_from_stat()to extract scalar scores fromStat.meanorsum/count.inst_statsis missing._is_correct_for_metric()with an allowlist for correctness metrics such asexact_match*,ifeval_strict_accuracy,chain_of_thought_correctness, andmath_equiv*.Tests:
is_correctbehavior, helper logic, andreasoning_traces=None.total_rows == Ntototal_rows >= N.