Skip to content

amao0o0/awesome-AI-Math-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 

Repository files navigation

AI Math Datasets

This repo contains recent open-source math datasets (mainly English) for training and evaluating Math Large Language Models (LLMs).

Note

This repo is currently under development and updated regularly.

πŸ—“οΈ Last updated: 2026-04-26

Table of Contents


Pre-training

πŸ“ Text Only

Dataset Descriptions References
Open-Web-Math An open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
Open-Web-Math-Pro Refined from open-web-math using the ProX refining framework. It contains about 5B high-quality math-related tokens, ready for pre-training. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
AMPS Auxiliary Mathematics Problems and Solutions. A collection of mathematical problems and step-by-step solutions, comprising over 100,000 problems from Khan Academy and approximately 5 million problems generated using Mathematica scripts. πŸ™ Repo
NaturalProofs A dataset designed to study mathematical reasoning in natural language, comprising approximately 32,000 theorem statements and proofs, 14,000 definitions, and 2,000 additional pages sourced from diverse mathematical domains πŸ“„ Paper
πŸ™ Repo
MathPile A math-centric corpus comprising about 9.5 billion tokens. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
AlgebraicStack A dataset of 11B tokens of code specifically related to mathematics. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MathCode-Pile Containing 19.2B tokens, with math-related data covering web pages, textbooks, model-synthesized text, and math-related code. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
FineMath Consisting of 34B tokens (FineMath-3+) and 54B tokens (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl. πŸ“„ Paper
πŸ€— Dataset
Proof-Pile-2 A 55 billion token dataset of mathematical and scientific documents from arxiv, open-web-math and algebraic-stack. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
AutoMathText A dataset encompassing around 200 GB of mathematical texts. It's a compilation sourced from a diverse range of platforms including various websites, arXiv, and GitHub (OpenWebMath, RedPajama, Algebraic Stack). πŸ“„Paper
πŸ™Repo
πŸ€—Dataset
MegaMath An open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens. πŸ“„Paper
πŸ™Repo
πŸ€—Dataset
Nemotron-CC-Math A 133 billion token large-scale math corpus extracted from Common Crawl πŸ“„ Paper
πŸ”— Blog
πŸ€— Dataset
SwallowMath A ~2.3B token math pre-training corpus refined from FineMath-4+ via LLM-driven rewriting (Llama-3.3-70B-Instruct) that removes boilerplate, restores missing context, and reformats solutions into concise step-by-step explanations. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset

πŸ–ΌοΈ Vision-Text Modality

Dataset Descriptions References
InfiMM-WebMath-40B A dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. πŸ“„ Paper
πŸ€— Dataset

Supervised Fine-Tuning

πŸ“ Text Only

Dataset Descriptions References
SVAMP A collection of 1,000 elementary-level math word problems. πŸ“„ Paper
πŸ™ Repo
GSM8K A dataset consists of 8.5K high-quality grade school math word problems. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ βˆ’ Γ— Γ·) to reach the final answer. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
MathQA A dataset of 37k English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset πŸ”— Project
MATH A challenging dataset that extends beyond the high school level and covers diverse topics, including algebra, precalculus, and number theory. Each problem in MATH has a full step-by-step solution. πŸ”— Project
NuminaMath A comprehensive collection of 860,000 pairs ranging from high-school-level to advanced-competition-level. The dataset has both CoT and PoT rationales (NuminaMath-CoT and -TIR (tool integrated reasoning)) πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MetaMath A dataset with 395K samples created by bootstrapping questions from MATH and GSM8K. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
MathInstruct An instruction tuning dataset that combines data from 13 mathematical rationale datasets, uniquely focusing on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
CoinMath A dataset designed to enhance mathematical reasoning in large language models by incorporating diverse coding styles into code-based rationales. It includes math questions annotated with code-based solutions that feature concise comments, descriptive naming conventions, and hardcoded solutions πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
OpenMathInstruct-2 A math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model. πŸ“„ Paper
πŸ€— Dataset
CAMEL Math Containing 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs were generated from 25 math topics, and 25 subtopics for each topic. πŸ“„ Paper
πŸ€— Dataset
AoPS-Instruct A large-scale dataset of over 650,000 Olympiad-level math question-answer pairs mined from the AoPS forum, accompanied by LiveAoPSBenchβ€”a dynamic, contamination-resistant benchmark πŸ“„ Paper
πŸ”— Project
πŸ€— Dataset
OpenMathReasoning A large-scale dataset of 306K unique math problems sourced from AoPS forums, with solutions generated by DeepSeek-R1 and QwQ-32B in both CoT and TIR (tool-integrated reasoning) formats. Formed the foundation of the AIMO-2 Kaggle competition winning solution. πŸ“„ Paper
πŸ€— Dataset

πŸ–ΌοΈ Vision-Text Modality

Dataset Descriptions References
GeoQA Containing 4,998 Chinese geometric multiple-choice questions with rich domain-specific program annotations. πŸ“„ Paper
πŸ™ Repo
UniGeo Containing 4,998 calculation problems and 9,543 proving problems. πŸ“„ Paper
πŸ™ Repo
Geo170K A synthesize dataset witch contains around 60,000 geometric image caption pairs and more than 110,000 question answer pairs. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MAVIS Containing two datasets: 1. MAVIS-Caption: 588K high-quality caption-diagram pairs, spanning geometry and function, 2. MAVIS-Instruct: 834K instruction-tuning data with CoT rationales in a text-lite version. πŸ“„ Paper
πŸ™ Repo
Geometry3K Consisting of 3,002 geometry problems with dense annotation in formal language. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
MathV360K Consisting 40K images from 24 datasets and 360K question-answer pairs. πŸ”— Project
πŸ€— Dataset
MultiMath300K A multimodal, multilingual, multi-level, and multistep mathematical reasoning dataset that encompasses a wide range of K-12 level mathematical problem. πŸ”— Project
MathCoder-VL A multimodal math instruction-tuning suite from the MathCoder-VL paper. Contains ImgCode-8.6M (the image-to-code dataset used for vision-code alignment via the FigCodifier model) and MM-MathInstruct-3M (a 2.87M-sample multimodal math instruction set synthesized using FigCodifier). πŸ“„ Paper
πŸ™ Repo
πŸ€— ImgCode-8.6M / MM-MathInstruct

Reinforcement Learning

​While many datasets listed in Supervised Fine-Tuning can be adapted for reinforcement learning, we specifically highlight datasets explicitly designed for RL as indicated in their respective references.

πŸ“ Text Only

Dataset Descriptions References
PRM800K A process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
Big-Math A dataset of over 250,000 high-quality math questions with verifiable answers, purposefully made for reinforcement learning (RL). Extracted questions satisfy three desiderata: (1) problems with uniquely verifiable solutions, (2) problems that are open-ended, and (3) problems with a closed-form solution. πŸ™ Repo
πŸ€— Dataset
Math-Shepherd Problems and step-by-step solutions with automatic labels πŸ“„ Paper
πŸ”— Project
πŸ€— Dataset
OpenR1-Math-220k Consisting of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5. πŸ€— Dataset
DeepMath-103K A dataset of ~103K highly challenging math problems (primarily difficulty levels 5–9) designed for RL training, with rigorous decontamination against common benchmarks to prevent eval leakage. Each problem includes verifiable answers for rule-based reward and three distinct R1-generated solutions for SFT/distillation. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset

Benchmark

πŸ“ Text Only

Dataset Descriptions References
Lila A mathematical reasoning benchmark consisting of over 140K natural language questions from 23 diverse tasks. πŸ”— Project
πŸ€— Dataset
MathBench A benchmark that tests large language models on math, covering five-level difficulty mechanisms. It evaluates both theory and problem-solving skills in English and Chinese. πŸ“„ Paper
πŸ™ Repo
MathOdyssey A collection of 387 mathematical problems for evaluating the general mathematical capacities of LLMs. Featuring a spectrum of questions from Olympiad-level competitions, advanced high school curricula, and university-level mathematics. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
Omni-MATH A challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
HARP A math reasoning dataset consisting of 4,780 short answer questions from US national math competitions. πŸ“„ Paper
πŸ™ Repo
PolyMath A multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MathMist A parallel multilingual benchmark for math problem solving and reasoning, containing ~30K aligned question–answer pairs across 13 typologically diverse languages (high-, medium-, and low-resource), built from 2,890 Bangla-English gold-standard artifacts. Supports zero-shot, CoT, perturbed reasoning, and code-switched reasoning evaluation. πŸ“„ Paper
πŸ™ Repo
U-MATH A benchmark of 1,100 open-ended university-level math problems across 6 subjects (Precalculus, Algebra, Differential/Integral/Multivariable Calculus, Sequences & Series), sourced from real coursework with 20% multimodal problems. Also releases ΞΌ-MATH, a meta-evaluation dataset of 1,084 labeled solutions for assessing LLM-as-judge quality on free-form math grading. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
FrontierMath A benchmark of 350 original, expert-vetted research-level math problems split into difficulty Tiers 1–4 (undergraduate to research-frontier), with automated verification via Python-submitted answers. Designed to be uncontaminated and resist memorizationβ€”frontier models initially scored <2%, with Gemini 3 Deep Think reaching ~40% on Tiers 1–3 by early 2026. πŸ“„ Paper
πŸ”— Project
OlymMATH A bilingual (English/Chinese) Olympiad-level benchmark of 200 problems split into EASY (AIME-level) and HARD (significantly above AIME) tiers across algebra, geometry, number theory, and combinatorics, manually sourced from printed publications to prevent leakage. The accompanying OlymMATH-eval release includes 582K reasoning trajectories from 28 models for fine-grained analysis. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
Open Proof Corpus (OPC) A large-scale, human-validated dataset of 5,000+ LLM-generated proofs across 1,000+ problems from 20+ elite competitions (IMO, USAMO, BMOSL, etc.), annotated as correct/incorrect by expert judges with optional sentence-level error highlighting. Includes a finetuned 8B judge model (OPC-R1-8B) that matches Gemini-2.5-Pro on proof correctness evaluation. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
MathArena A continuously updated benchmark (NeurIPS D&B 2025) that evaluates LLMs on freshly released math competitions (AIME, HMMT, BRUMO, SMT, CMIMC 2025; USAMO/IMO/Putnam 2025; AIME/HMMT 2026), eliminating contamination by post-release evaluation. Notably the first benchmark to include proof-writing evalβ€”top models reach <40% on IMO 2025. So far covers 162+ problems across 7+ competitions and 50+ models. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset

πŸ–ΌοΈ Vision-Text Modality

Dataset Descriptions References
MathVerse A collection of 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. πŸ”— Project
πŸ€— Dataset
MathVista A benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets πŸ”— Project
πŸ€— Dataset
MATH-Vision A collection of 3,040 mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty. πŸ”— Project
πŸ€— Dataset
We-Math A collection of 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. πŸ“„ Paper
πŸ”— Project
πŸ™ Repo
πŸ€— Dataset
OlympiadBench An Olympiad-level bilingual (EN/ZH) multimodal scientific benchmark of 8,476 math and physics problems sourced from International Olympiads, the Chinese Olympiad, and the Chinese College Entrance Exam (GaoKao), each with expert-level step-by-step solution annotations. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
VCBench A benchmark from Alibaba DAMO Academy containing 1,720 elementary-level (grades 1–6) multimodal math problems across 6 cognitive domains, with an average of 3.9 images per question to enforce multi-image reasoning. Evaluates five competencies β€” temporal, geometric, logical, spatial reasoning, and pattern recognition β€” with even the best LVLMs failing to exceed 50% accuracy. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset
MV-MATH A multi-image multimodal math benchmark (CVPR 2025) of 2,009 K-12 problems with interleaved text and multiple images (some up to 8). Covers 11 subjects across 3 difficulty tiers and 3 question types (MC / free-form / multi-step), and uniquely splits problems into ID (independent images) vs MD (mutually dependent images) to probe cross-image reasoning. πŸ“„ Paper
πŸ”— Project
πŸ€— Dataset

🎀 Speech Modality

Dataset Descriptions References
Spoken-MQA A benchmark designed to evaluate large language models’ mathematical reasoning ability from spoken input. It features math problems covering arithmetic, contextual reasoning, and knowledge-based reasoning. πŸ“„ Paper
πŸ™ Repo
πŸ€— Dataset

Related Repo

https://github.com/tongyx361/Awesome-LLM4Math

https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md

About

A collection of recent open-source math datasets for training and evaluating Math LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors