Add PolyMath#94
Open
haideraltahan wants to merge 6 commits into
Open
Conversation
34 tasks
Implements the Qwen/PolyMath benchmark as a generative lm-eval-harness task,
scoped to EU languages per the dataset's language configs: de, en, es, fr, it, pt.
PolyMath ships one config per language and four difficulty splits per config
(low/medium/high/top), so this adds one task per (language, difficulty) — 24
tasks — each tagged `polymath` and `polymath_<lang>` for aggregate reporting.
Approach (mirrors lm-eval's minerva_math, but self-contained):
- Prompt asks the model to reason and box its final answer in \boxed{}.
- A custom process_results extracts the last \boxed{} span (falling back to the
last number), normalises gold + prediction with the Minerva/Lewkowycz string
normalisation, and scores exact_match with a light sympy numeric fallback
(e.g. 18 == 18.0, 1/2 == 0.5). Deliberately avoids math_verify / parse_latex,
which are unavailable in this venv (no math_verify; antlr4 is 4.13 not 4.11).
Registry:
- New `polymath-eu` task group (0-shot) with a {lang} valid_langs template, so
`polymath-eu[deu_Latn]` etc. language brackets work out of the box.
- task_metrics entries (exact_match) for all 24 tasks; README group list updated.
Validated: lm-eval discovers all 24 tasks + tags (template not leaked), YAML
include + !function resolution works, the task instantiates against the real
`de` config (125 docs/split, fields id/question/answer), and process_results
scores correct/incorrect generations as expected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the single polymath-eu group with four per-tier groups (polymath-eu-low / -medium / -high / -top) so a difficulty tier can be run in isolation. The 24 per-(language, difficulty) task YAMLs and their task_metrics entries are unchanged. 'top' is PolyMath's hardest tier (Olympiad-level). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pass --batch_size auto to lm_eval so batch size is tuned per task instead of defaulting to 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the bespoke Minerva-normalisation + narrow sympy exact_match with the
upstream PolyMath equivalence judge so our scores reproduce the benchmark
instead of being a strict lower bound.
- Add polymath_eval.py: the official `math_equal` (numeric isclose rel_tol=1e-4,
percentage variants, LaTeX/symbolic + matrix/list/equation matching), vendored
verbatim from QwenLM/PolyMath eval/scripts.py. Two justified deviations:
* latex2sympy comes from latex2sympy2_extended (already in the venv via
lighteval). Classic latex2sympy2 pins antlr 4.7.2 and would break
lighteval; the extended fork exposes the same `latex2sympy` and only feeds
the symbolic_equal fallback, so the decision logic is unchanged.
* call_with_timeout runs the symbolic check in a daemon thread, not a forked
process: forking corrupts latex2sympy2_extended's parser state here, which
silently scored LaTeX answers wrong. A one-time warm-up at import avoids a
cold-start timeout on the first comparison. Verified deterministic (50x).
- utils.py: extract the answer with run_eval.py's `extract_boxed_content`
(verbatim) and score with `math_equal`, mirroring upstream run_eval.py.
Drops the old last-number fallback (upstream uses boxed-only).
- pyproject.toml: exclude the vendored scorer from ruff to keep it verbatim.
Validated end-to-end through lm-eval TaskManager: polymath_de_low loads, 125
docs, and process_results scores boxed gold/fraction/percentage/wrong correctly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Point MIOPEN_USER_DB_PATH / MIOPEN_CUSTOM_CACHE_DIR at a job-unique writable dir (and create it) so the conv1d in linear-attention models stops failing on ROCm with "Cannot open database file:/tmp/gfx90a*.ukdb". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
I added PolyMath (Qwen/PolyMath) multilingual math reasoning as four
lm-evaluation-harnessgroups —polymath-eu-low/-medium/-high/-top, one per difficulty tier so a tier can be run in isolation — at 0-shot generative, with the final answer parsed from\boxed{}and scored byexact_match. PolyMath isn't in lm-eval-harness or lighteval, so I added a custom task undercustom_lm_eval_tasks/polymath/and vendored the official Qwen/PolyMathmath_equaljudge verbatim (polymath_eval.py, ruff-excluded to keep it byte-for-byte). One of the Evals from #89.Language coverage — EU subset only
PolyMath ships 18 languages; 6 of them are in our EU set, and I include all 6 (each across all 4 tiers):
The other 12 PolyMath languages (ar, bn, id, ja, ko, ms, ru, sw, te, th, vi, zh) are non-EU, so they're omitted. Each group is a
{lang}template withvalid_langs, and each task resolves to itslang_Scricode viasubset(de→deu_Latn, …), so bracket scoping (polymath-eu-top[deu_Latn]) resolves correctly.Metric
exact_matchon the\boxed{}answer, judged by the vendored officialmath_equalcomparator.