Skip to content

Add PolyMath#94

Open
haideraltahan wants to merge 6 commits into
mainfrom
PolyMath
Open

Add PolyMath#94
haideraltahan wants to merge 6 commits into
mainfrom
PolyMath

Conversation

@haideraltahan

@haideraltahan haideraltahan commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

I added PolyMath (Qwen/PolyMath) multilingual math reasoning as four lm-evaluation-harness groups — polymath-eu-low / -medium / -high / -top, one per difficulty tier so a tier can be run in isolation — at 0-shot generative, with the final answer parsed from \boxed{} and scored by exact_match. PolyMath isn't in lm-eval-harness or lighteval, so I added a custom task under custom_lm_eval_tasks/polymath/ and vendored the official Qwen/PolyMath math_equal judge verbatim (polymath_eval.py, ruff-excluded to keep it byte-for-byte). One of the Evals from #89.

Language coverage — EU subset only

PolyMath ships 18 languages; 6 of them are in our EU set, and I include all 6 (each across all 4 tiers):

German (de), English (en), Spanish (es), French (fr), Italian (it), Portuguese (pt)

The other 12 PolyMath languages (ar, bn, id, ja, ko, ms, ru, sw, te, th, vi, zh) are non-EU, so they're omitted. Each group is a {lang} template with valid_langs, and each task resolves to its lang_Scri code via subset (dedeu_Latn, …), so bracket scoping (polymath-eu-top[deu_Latn]) resolves correctly.

Metric

exact_match on the \boxed{} answer, judged by the vendored official math_equal comparator.

@haideraltahan haideraltahan changed the title Add PolyMath multilingual (EU) Add PolyMath Jul 1, 2026
Haider Altahan and others added 6 commits July 1, 2026 04:08
Implements the Qwen/PolyMath benchmark as a generative lm-eval-harness task,
scoped to EU languages per the dataset's language configs: de, en, es, fr, it, pt.

PolyMath ships one config per language and four difficulty splits per config
(low/medium/high/top), so this adds one task per (language, difficulty) — 24
tasks — each tagged `polymath` and `polymath_<lang>` for aggregate reporting.

Approach (mirrors lm-eval's minerva_math, but self-contained):
- Prompt asks the model to reason and box its final answer in \boxed{}.
- A custom process_results extracts the last \boxed{} span (falling back to the
  last number), normalises gold + prediction with the Minerva/Lewkowycz string
  normalisation, and scores exact_match with a light sympy numeric fallback
  (e.g. 18 == 18.0, 1/2 == 0.5). Deliberately avoids math_verify / parse_latex,
  which are unavailable in this venv (no math_verify; antlr4 is 4.13 not 4.11).

Registry:
- New `polymath-eu` task group (0-shot) with a {lang} valid_langs template, so
  `polymath-eu[deu_Latn]` etc. language brackets work out of the box.
- task_metrics entries (exact_match) for all 24 tasks; README group list updated.

Validated: lm-eval discovers all 24 tasks + tags (template not leaked), YAML
include + !function resolution works, the task instantiates against the real
`de` config (125 docs/split, fields id/question/answer), and process_results
scores correct/incorrect generations as expected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the single polymath-eu group with four per-tier groups
(polymath-eu-low / -medium / -high / -top) so a difficulty tier can be run in
isolation. The 24 per-(language, difficulty) task YAMLs and their task_metrics
entries are unchanged. 'top' is PolyMath's hardest tier (Olympiad-level).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pass --batch_size auto to lm_eval so batch size is tuned per task instead of
defaulting to 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the bespoke Minerva-normalisation + narrow sympy exact_match with the
upstream PolyMath equivalence judge so our scores reproduce the benchmark
instead of being a strict lower bound.

- Add polymath_eval.py: the official `math_equal` (numeric isclose rel_tol=1e-4,
  percentage variants, LaTeX/symbolic + matrix/list/equation matching), vendored
  verbatim from QwenLM/PolyMath eval/scripts.py. Two justified deviations:
  * latex2sympy comes from latex2sympy2_extended (already in the venv via
    lighteval). Classic latex2sympy2 pins antlr 4.7.2 and would break
    lighteval; the extended fork exposes the same `latex2sympy` and only feeds
    the symbolic_equal fallback, so the decision logic is unchanged.
  * call_with_timeout runs the symbolic check in a daemon thread, not a forked
    process: forking corrupts latex2sympy2_extended's parser state here, which
    silently scored LaTeX answers wrong. A one-time warm-up at import avoids a
    cold-start timeout on the first comparison. Verified deterministic (50x).
- utils.py: extract the answer with run_eval.py's `extract_boxed_content`
  (verbatim) and score with `math_equal`, mirroring upstream run_eval.py.
  Drops the old last-number fallback (upstream uses boxed-only).
- pyproject.toml: exclude the vendored scorer from ruff to keep it verbatim.

Validated end-to-end through lm-eval TaskManager: polymath_de_low loads, 125
docs, and process_results scores boxed gold/fraction/percentage/wrong correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Point MIOPEN_USER_DB_PATH / MIOPEN_CUSTOM_CACHE_DIR at a job-unique writable
dir (and create it) so the conv1d in linear-attention models stops failing on
ROCm with "Cannot open database file:/tmp/gfx90a*.ukdb".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant