Collect results all metrics#77
Conversation
- Replace _resolve_metric (single-metric selector) with _extract_all_metrics
which returns every numeric key in a result dict as a (metric_name, value)
pair. Both primary metrics and their _stderr counterparts are included.
- lm-eval harness keys like 'acc,none' are normalised to 'acc' by stripping
the ',{filter}' suffix; when two raw keys collapse to the same name the
first encountered wins.
- Output CSV now contains one row per (model, task, n_shot, metric_name) tuple
instead of one row per task, so no pre-computed metric is silently discarded.
- The task_metrics section in task-groups.yaml is no longer consulted and the
yaml/importlib.resources loading block is removed from collect_results.
- pivot_results.py: task_label now includes the metric name so the pivot stays
unambiguous (e.g. 'hellaswag (10-shot) [acc_norm]').
- Fix pre-existing falsy-zero bug in n_shot detection: replace the 'or'-chain
with explicit None checks so that n_shot=0 is preserved rather than falling
through to 'unknown'.
- Add tests/test_collect_results.py with 17 tests covering all new behaviour.
|
Thanks @haideraltahan! I like the option to fetch all metrics that are computed, but at the same time, I feel it is good to provide only the "standardized metric" per task in the output. For example, if I wanted to do basic comparison between two models, I would care most about which metric is most commonly used for the task and just pick that. |
|
@swag2198 fair point! Will add that adjustment :) |
| def _extract_all_metrics(result_dict: dict) -> list[tuple[str, float]]: | ||
| """Return (metric_name, value) for every numeric entry in result_dict. | ||
|
|
||
| lm-eval harness stores keys as ``{metric},{filter}`` (e.g. ``acc,none`` |
There was a problem hiding this comment.
Just a small question. When filter is not none, what does it indicate and is it still safe to remove?
There was a problem hiding this comment.
Great question — when the filter is not none, it indicates that a post-processing step was applied to the model outputs before computing the metric (e.g., normalization, strict matching, or task-specific transformations).
In the current implementation, we strip the {metric},{filter} suffix down to {metric}, which can lead to collisions if multiple filters are present (e.g., acc,none and acc,strict both becoming acc). However, for the current evaluation suite, all tasks emit metrics with the default none filter (e.g., acc,none, acc_norm,none, f1,none). Therefore, stripping the filter suffix does not lead to collisions in practice.
collect-results: emit all metrics instead of picking one per task
Closes #75
Problem
collect_resultsextracted a single metric per task using a hardcoded priority list (acc,none→acc→acc_norm→f1→ ...) or a per-task override in thetask_metricsmap intask-groups.yaml. This silently discarded metrics that were already computed and sitting in the result JSONs:A flores200 result already contains
bleu,bleu_1,bleu_4,chrf++, and all their_stderrvariants — only one made it to the CSVTasks with both
accandacc_norm(e.g. hellaswag, arc_challenge) only reported onetask_metricshad to be manually extended every time a new task was added, and couldeasily drift out of sync
Solution
Replace
_resolve_metricwith_extract_all_metrics: iterate over every key in a result dict and emit one CSV row per numeric value. The lm-eval harness,{filter}suffix (e.g.acc,none→acc) is stripped for clean metric names; when two raw keys collapse to the same stripped name the first wins._stderrkeys are treated as ordinary metrics and get their own rows — they are not filtered out.The output schema gains a
metric_namecolumn and produces one row per(model, task, n_shot, metric_name)tuple:Users can filter on
metric_namedownstream and no pre-computed metric is silently discarded.Changes
oellm/main.py: replace_resolve_metricwith_extract_all_metrics; remove theyaml/importlib.resourcesblock that loadedtask_metrics; fix a pre-existing falsy-zero bug wheren_shot=0fell through to"unknown"inor-chainoellm/resources/task-groups.yaml: remove the entiretask_metricssection (172 lines) — it is no longer consulted by any codescripts/pivot_results.py: addmetric_nameto required columns; include it intask_label(e.g."hellaswag (10-shot) [acc_norm]") so pivot columns stay unambiguous when multiple metrics exist for the same tasktests/test_collect_results.py(new): 17 tests covering multiple metrics per task, correct values, suffix stripping,_stderras a separate row, non-numeric exclusion, duplicate name deduplication, full flores200 metric set, output schema validation, multi-file merge, and edge cases