Add PolyMath by haideraltahan · Pull Request #94 · OpenEuroLLM/oellm-eval

haideraltahan · 2026-06-25T05:18:43Z

Summary

I added PolyMath (Qwen/PolyMath) multilingual math reasoning as four lm-evaluation-harness groups — polymath-eu-low / -medium / -high / -top, one per difficulty tier so a tier can be run in isolation — at 0-shot generative, with the final answer parsed from \boxed{} and scored by exact_match. PolyMath isn't in lm-eval-harness or lighteval, so I added a custom task under custom_lm_eval_tasks/polymath/ and vendored the official Qwen/PolyMath math_equal judge verbatim (polymath_eval.py, ruff-excluded to keep it byte-for-byte). One of the Evals from #89.

Language coverage — EU subset only

PolyMath ships 18 languages; 6 of them are in our EU set, and I include all 6 (each across all 4 tiers):

German (de), English (en), Spanish (es), French (fr), Italian (it), Portuguese (pt)

The other 12 PolyMath languages (ar, bn, id, ja, ko, ms, ru, sw, te, th, vi, zh) are non-EU, so they're omitted. Each group is a {lang} template with valid_langs, and each task resolves to its lang_Scri code via subset (de→deu_Latn, …), so bracket scoping (polymath-eu-top[deu_Latn]) resolves correctly.

Metric

exact_match on the \boxed{} answer, judged by the vendored official math_equal comparator.

Implements the Qwen/PolyMath benchmark as a generative lm-eval-harness task, scoped to EU languages per the dataset's language configs: de, en, es, fr, it, pt. PolyMath ships one config per language and four difficulty splits per config (low/medium/high/top), so this adds one task per (language, difficulty) — 24 tasks — each tagged `polymath` and `polymath_<lang>` for aggregate reporting. Approach (mirrors lm-eval's minerva_math, but self-contained): - Prompt asks the model to reason and box its final answer in \boxed{}. - A custom process_results extracts the last \boxed{} span (falling back to the last number), normalises gold + prediction with the Minerva/Lewkowycz string normalisation, and scores exact_match with a light sympy numeric fallback (e.g. 18 == 18.0, 1/2 == 0.5). Deliberately avoids math_verify / parse_latex, which are unavailable in this venv (no math_verify; antlr4 is 4.13 not 4.11). Registry: - New `polymath-eu` task group (0-shot) with a {lang} valid_langs template, so `polymath-eu[deu_Latn]` etc. language brackets work out of the box. - task_metrics entries (exact_match) for all 24 tasks; README group list updated. Validated: lm-eval discovers all 24 tasks + tags (template not leaked), YAML include + !function resolution works, the task instantiates against the real `de` config (125 docs/split, fields id/question/answer), and process_results scores correct/incorrect generations as expected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the single polymath-eu group with four per-tier groups (polymath-eu-low / -medium / -high / -top) so a difficulty tier can be run in isolation. The 24 per-(language, difficulty) task YAMLs and their task_metrics entries are unchanged. 'top' is PolyMath's hardest tier (Olympiad-level). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Pass --batch_size auto to lm_eval so batch size is tuned per task instead of defaulting to 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Replace the bespoke Minerva-normalisation + narrow sympy exact_match with the upstream PolyMath equivalence judge so our scores reproduce the benchmark instead of being a strict lower bound. - Add polymath_eval.py: the official `math_equal` (numeric isclose rel_tol=1e-4, percentage variants, LaTeX/symbolic + matrix/list/equation matching), vendored verbatim from QwenLM/PolyMath eval/scripts.py. Two justified deviations: * latex2sympy comes from latex2sympy2_extended (already in the venv via lighteval). Classic latex2sympy2 pins antlr 4.7.2 and would break lighteval; the extended fork exposes the same `latex2sympy` and only feeds the symbolic_equal fallback, so the decision logic is unchanged. * call_with_timeout runs the symbolic check in a daemon thread, not a forked process: forking corrupts latex2sympy2_extended's parser state here, which silently scored LaTeX answers wrong. A one-time warm-up at import avoids a cold-start timeout on the first comparison. Verified deterministic (50x). - utils.py: extract the answer with run_eval.py's `extract_boxed_content` (verbatim) and score with `math_equal`, mirroring upstream run_eval.py. Drops the old last-number fallback (upstream uses boxed-only). - pyproject.toml: exclude the vendored scorer from ruff to keep it verbatim. Validated end-to-end through lm-eval TaskManager: polymath_de_low loads, 125 docs, and process_results scores boxed gold/fraction/percentage/wrong correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Point MIOPEN_USER_DB_PATH / MIOPEN_CUSTOM_CACHE_DIR at a job-unique writable dir (and create it) so the conv1d in linear-attention models stops failing on ROCm with "Cannot open database file:/tmp/gfx90a*.ukdb". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ls imports

haideraltahan mentioned this pull request Jun 26, 2026

Create a multilingual eval suite #89

Open

34 tasks

haideraltahan changed the title ~~Add PolyMath multilingual (EU)~~ Add PolyMath Jul 1, 2026

haideraltahan added the new_benchmark label Jul 1, 2026

Haider Altahan and others added 6 commits July 1, 2026 04:08

fix: use --batch_size auto in eval sbatch

6e81468

Pass --batch_size auto to lm_eval so batch size is tuned per task instead of defaulting to 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

style: keep polymath task-group descriptions under 80 chars, sort uti…

b4889cb

…ls imports

haideraltahan force-pushed the PolyMath branch from dd7389c to b4889cb Compare July 1, 2026 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PolyMath#94

Add PolyMath#94
haideraltahan wants to merge 6 commits into
mainfrom
PolyMath

haideraltahan commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

haideraltahan commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Language coverage — EU subset only

Metric

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haideraltahan commented Jun 25, 2026 •

edited

Loading