Collect results all metrics by haideraltahan · Pull Request #77 · OpenEuroLLM/oellm-eval

haideraltahan · 2026-05-27T12:30:55Z

collect-results: emit all metrics instead of picking one per task

Closes #75

Problem

collect_results extracted a single metric per task using a hardcoded priority list (acc,none → acc → acc_norm → f1 → ...) or a per-task override in the task_metrics map in task-groups.yaml. This silently discarded metrics that were already computed and sitting in the result JSONs:
A flores200 result already contains bleu, bleu_1, bleu_4, chrf++, and all their
_stderr variants — only one made it to the CSV
Tasks with both acc and acc_norm (e.g. hellaswag, arc_challenge) only reported one
task_metrics had to be manually extended every time a new task was added, and could
easily drift out of sync

Solution

Replace _resolve_metric with _extract_all_metrics: iterate over every key in a result dict and emit one CSV row per numeric value. The lm-eval harness ,{filter} suffix (e.g. acc,none → acc) is stripped for clean metric names; when two raw keys collapse to the same stripped name the first wins. _stderr keys are treated as ordinary metrics and get their own rows — they are not filtered out.

The output schema gains a metric_name column and produces one row per (model, task, n_shot, metric_name) tuple:

model_name	task	n_shot	metric_name	performance
model-a	flores200:eng_Latn-ita_Latn	0	chrf++	57.676
model-a	flores200:eng_Latn-ita_Latn	0	chrf++_stderr	0.422
model-a	flores200:eng_Latn-ita_Latn	0	bleu	21.160
model-a	flores200:eng_Latn-ita_Latn	0	bleu_1	0.512
model-a	hellaswag	10	acc	0.450
model-a	hellaswag	10	acc_norm	0.630
model-a	hellaswag	10	acc_norm_stderr	0.004

Users can filter on metric_name downstream and no pre-computed metric is silently discarded.

Changes

oellm/main.py: replace _resolve_metric with _extract_all_metrics; remove the yaml/importlib.resources block that loaded task_metrics; fix a pre-existing falsy-zero bug where n_shot=0 fell through to "unknown" in or-chain
oellm/resources/task-groups.yaml: remove the entire task_metrics section (172 lines) — it is no longer consulted by any code
scripts/pivot_results.py: add metric_name to required columns; include it in task_label (e.g. "hellaswag (10-shot) [acc_norm]") so pivot columns stay unambiguous when multiple metrics exist for the same task
tests/test_collect_results.py (new): 17 tests covering multiple metrics per task, correct values, suffix stripping, _stderr as a separate row, non-numeric exclusion, duplicate name deduplication, full flores200 metric set, output schema validation, multi-file merge, and edge cases

- Replace _resolve_metric (single-metric selector) with _extract_all_metrics which returns every numeric key in a result dict as a (metric_name, value) pair. Both primary metrics and their _stderr counterparts are included. - lm-eval harness keys like 'acc,none' are normalised to 'acc' by stripping the ',{filter}' suffix; when two raw keys collapse to the same name the first encountered wins. - Output CSV now contains one row per (model, task, n_shot, metric_name) tuple instead of one row per task, so no pre-computed metric is silently discarded. - The task_metrics section in task-groups.yaml is no longer consulted and the yaml/importlib.resources loading block is removed from collect_results. - pivot_results.py: task_label now includes the metric name so the pivot stays unambiguous (e.g. 'hellaswag (10-shot) [acc_norm]'). - Fix pre-existing falsy-zero bug in n_shot detection: replace the 'or'-chain with explicit None checks so that n_shot=0 is preserved rather than falling through to 'unknown'. - Add tests/test_collect_results.py with 17 tests covering all new behaviour.

swag2198 · 2026-06-03T19:09:43Z

Thanks @haideraltahan! I like the option to fetch all metrics that are computed, but at the same time, I feel it is good to provide only the "standardized metric" per task in the output. For example, if I wanted to do basic comparison between two models, I would care most about which metric is most commonly used for the task and just pick that.
Would it make sense to add a flag to the collect-results utility --fetch-all-metrics for the intended behaviour of this PR?

haideraltahan · 2026-06-03T19:34:16Z

@swag2198 fair point! Will add that adjustment :)

kerkathy · 2026-06-05T09:11:11Z

+    def _extract_all_metrics(result_dict: dict) -> list[tuple[str, float]]:
+        """Return (metric_name, value) for every numeric entry in result_dict.
+
+        lm-eval harness stores keys as ``{metric},{filter}`` (e.g. ``acc,none``


Just a small question. When filter is not none, what does it indicate and is it still safe to remove?

Great question — when the filter is not none, it indicates that a post-processing step was applied to the model outputs before computing the metric (e.g., normalization, strict matching, or task-specific transformations).

In the current implementation, we strip the {metric},{filter} suffix down to {metric}, which can lead to collisions if multiple filters are present (e.g., acc,none and acc,strict both becoming acc). However, for the current evaluation suite, all tasks emit metrics with the default none filter (e.g., acc,none, acc_norm,none, f1,none). Therefore, stripping the filter suffix does not lead to collisions in practice.

Haider Al-Tahan and others added 5 commits May 27, 2026 14:12

task-groups: remove task_metrics section (no longer used)

0e209ac

Merge branch 'main' into collect-results-all-metrics

efaa1ac

tests: remove test_collect_results.py

7c0d13f

Merge branch 'main' into collect-results-all-metrics

fda7ec1

haideraltahan commented Jun 3, 2026

View reviewed changes

Comment thread oellm/resources/task-groups.yaml

Haider Al-Tahan added 2 commits June 3, 2026 23:24

collect: add --fetch-all-metrics flag and primary-metrics.yaml

53edbd2

style: ruff format fixes

332af95

haideraltahan requested a review from geoalgo June 4, 2026 17:57

kerkathy reviewed Jun 5, 2026

View reviewed changes

Comment thread oellm/main.py

Merge branch 'main' into collect-results-all-metrics

5d54800

haideraltahan requested a review from kerkathy June 9, 2026 17:37

Merge branch 'main' into collect-results-all-metrics

9a3f8d8

kerkathy reviewed Jun 11, 2026

View reviewed changes

haideraltahan requested a review from kerkathy June 23, 2026 02:36

haideraltahan added the enhancement New feature or request label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collect results all metrics#77

Collect results all metrics#77
haideraltahan wants to merge 9 commits into
mainfrom
collect-results-all-metrics

haideraltahan commented May 27, 2026

Uh oh!

swag2198 commented Jun 3, 2026

Uh oh!

haideraltahan commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

kerkathy Jun 5, 2026

Uh oh!

haideraltahan Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

haideraltahan commented May 27, 2026

collect-results: emit all metrics instead of picking one per task

Problem

Solution

Changes

Uh oh!

swag2198 commented Jun 3, 2026

Uh oh!

haideraltahan commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

kerkathy Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

haideraltahan Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants