AI Math Datasets

This repo contains recent open-source math datasets (mainly English) for training and evaluating Math Large Language Models (LLMs).

Note

This repo is currently under development and updated regularly.

🗓️ Last updated: 2026-04-26

Pre-training

📝 Text Only

Dataset	Descriptions	References
Open-Web-Math	An open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl.	📄 Paper 🐙 Repo 🤗 Dataset
Open-Web-Math-Pro	Refined from open-web-math using the ProX refining framework. It contains about 5B high-quality math-related tokens, ready for pre-training.	📄 Paper 🐙 Repo 🤗 Dataset
AMPS	Auxiliary Mathematics Problems and Solutions. A collection of mathematical problems and step-by-step solutions, comprising over 100,000 problems from Khan Academy and approximately 5 million problems generated using Mathematica scripts.	🐙 Repo
NaturalProofs	A dataset designed to study mathematical reasoning in natural language, comprising approximately 32,000 theorem statements and proofs, 14,000 definitions, and 2,000 additional pages sourced from diverse mathematical domains	📄 Paper 🐙 Repo
MathPile	A math-centric corpus comprising about 9.5 billion tokens.	📄 Paper 🐙 Repo 🤗 Dataset
AlgebraicStack	A dataset of 11B tokens of code specifically related to mathematics.	📄 Paper 🐙 Repo 🤗 Dataset
MathCode-Pile	Containing 19.2B tokens, with math-related data covering web pages, textbooks, model-synthesized text, and math-related code.	📄 Paper 🐙 Repo 🤗 Dataset
FineMath	Consisting of 34B tokens (FineMath-3+) and 54B tokens (FineMath-3+ with InfiMM-WebMath-3+) of mathematical educational content filtered from CommonCrawl.	📄 Paper 🤗 Dataset
Proof-Pile-2	A 55 billion token dataset of mathematical and scientific documents from arxiv, open-web-math and algebraic-stack.	📄 Paper 🐙 Repo 🤗 Dataset
AutoMathText	A dataset encompassing around 200 GB of mathematical texts. It's a compilation sourced from a diverse range of platforms including various websites, arXiv, and GitHub (OpenWebMath, RedPajama, Algebraic Stack).	📄Paper 🐙Repo 🤗Dataset
MegaMath	An open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens.	📄Paper 🐙Repo 🤗Dataset
Nemotron-CC-Math	A 133 billion token large-scale math corpus extracted from Common Crawl	📄 Paper 🔗 Blog 🤗 Dataset
SwallowMath	A ~2.3B token math pre-training corpus refined from FineMath-4+ via LLM-driven rewriting (Llama-3.3-70B-Instruct) that removes boilerplate, restores missing context, and reformats solutions into concise step-by-step explanations.	📄 Paper 🐙 Repo 🤗 Dataset

🖼️ Vision-Text Modality

Dataset	Descriptions	References
InfiMM-WebMath-40B	A dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.	📄 Paper 🤗 Dataset

Supervised Fine-Tuning

📝 Text Only

Dataset	Descriptions	References
SVAMP	A collection of 1,000 elementary-level math word problems.	📄 Paper 🐙 Repo
GSM8K	A dataset consists of 8.5K high-quality grade school math word problems. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − × ÷) to reach the final answer.	📄 Paper 🔗 Project 🐙 Repo
MathQA	A dataset of 37k English multiple-choice math word problems covering multiple math domain categories by modeling operation programs corresponding to word problems in the AQuA dataset	🔗 Project
MATH	A challenging dataset that extends beyond the high school level and covers diverse topics, including algebra, precalculus, and number theory. Each problem in MATH has a full step-by-step solution.	🔗 Project
NuminaMath	A comprehensive collection of 860,000 pairs ranging from high-school-level to advanced-competition-level. The dataset has both CoT and PoT rationales (NuminaMath-CoT and -TIR (tool integrated reasoning))	📄 Paper 🐙 Repo 🤗 Dataset
MetaMath	A dataset with 395K samples created by bootstrapping questions from MATH and GSM8K.	📄 Paper 🔗 Project 🐙 Repo 🤗 Dataset
MathInstruct	An instruction tuning dataset that combines data from 13 mathematical rationale datasets, uniquely focusing on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales.	📄 Paper 🔗 Project 🐙 Repo 🤗 Dataset
CoinMath	A dataset designed to enhance mathematical reasoning in large language models by incorporating diverse coding styles into code-based rationales. It includes math questions annotated with code-based solutions that feature concise comments, descriptive naming conventions, and hardcoded solutions	📄 Paper 🐙 Repo 🤗 Dataset
OpenMathInstruct-2	A math instruction tuning dataset with 14M problem-solution pairs generated using the Llama3.1-405B-Instruct model.	📄 Paper 🤗 Dataset
CAMEL Math	Containing 50K problem-solution pairs obtained using GPT-4. The dataset problem-solutions pairs were generated from 25 math topics, and 25 subtopics for each topic.	📄 Paper 🤗 Dataset
AoPS-Instruct	A large-scale dataset of over 650,000 Olympiad-level math question-answer pairs mined from the AoPS forum, accompanied by LiveAoPSBench—a dynamic, contamination-resistant benchmark	📄 Paper 🔗 Project 🤗 Dataset
OpenMathReasoning	A large-scale dataset of 306K unique math problems sourced from AoPS forums, with solutions generated by DeepSeek-R1 and QwQ-32B in both CoT and TIR (tool-integrated reasoning) formats. Formed the foundation of the AIMO-2 Kaggle competition winning solution.	📄 Paper 🤗 Dataset

🖼️ Vision-Text Modality

Dataset	Descriptions	References
GeoQA	Containing 4,998 Chinese geometric multiple-choice questions with rich domain-specific program annotations.	📄 Paper 🐙 Repo
UniGeo	Containing 4,998 calculation problems and 9,543 proving problems.	📄 Paper 🐙 Repo
Geo170K	A synthesize dataset witch contains around 60,000 geometric image caption pairs and more than 110,000 question answer pairs.	📄 Paper 🐙 Repo 🤗 Dataset
MAVIS	Containing two datasets: 1. MAVIS-Caption: 588K high-quality caption-diagram pairs, spanning geometry and function, 2. MAVIS-Instruct: 834K instruction-tuning data with CoT rationales in a text-lite version.	📄 Paper 🐙 Repo
Geometry3K	Consisting of 3,002 geometry problems with dense annotation in formal language.	📄 Paper 🔗 Project 🐙 Repo
MathV360K	Consisting 40K images from 24 datasets and 360K question-answer pairs.	🔗 Project 🤗 Dataset
MultiMath300K	A multimodal, multilingual, multi-level, and multistep mathematical reasoning dataset that encompasses a wide range of K-12 level mathematical problem.	🔗 Project
MathCoder-VL	A multimodal math instruction-tuning suite from the MathCoder-VL paper. Contains ImgCode-8.6M (the image-to-code dataset used for vision-code alignment via the FigCodifier model) and MM-MathInstruct-3M (a 2.87M-sample multimodal math instruction set synthesized using FigCodifier).	📄 Paper 🐙 Repo 🤗 ImgCode-8.6M / MM-MathInstruct

Reinforcement Learning

While many datasets listed in Supervised Fine-Tuning can be adapted for reinforcement learning, we specifically highlight datasets explicitly designed for RL as indicated in their respective references.

📝 Text Only

Dataset	Descriptions	References
PRM800K	A process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset	📄 Paper 🔗 Project 🐙 Repo
Big-Math	A dataset of over 250,000 high-quality math questions with verifiable answers, purposefully made for reinforcement learning (RL). Extracted questions satisfy three desiderata: (1) problems with uniquely verifiable solutions, (2) problems that are open-ended, and (3) problems with a closed-form solution.	🐙 Repo 🤗 Dataset
Math-Shepherd	Problems and step-by-step solutions with automatic labels	📄 Paper 🔗 Project 🤗 Dataset
OpenR1-Math-220k	Consisting of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5.	🤗 Dataset
DeepMath-103K	A dataset of ~103K highly challenging math problems (primarily difficulty levels 5–9) designed for RL training, with rigorous decontamination against common benchmarks to prevent eval leakage. Each problem includes verifiable answers for rule-based reward and three distinct R1-generated solutions for SFT/distillation.	📄 Paper 🐙 Repo 🤗 Dataset

Benchmark

📝 Text Only

Dataset	Descriptions	References
Lila	A mathematical reasoning benchmark consisting of over 140K natural language questions from 23 diverse tasks.	🔗 Project 🤗 Dataset
MathBench	A benchmark that tests large language models on math, covering five-level difficulty mechanisms. It evaluates both theory and problem-solving skills in English and Chinese.	📄 Paper 🐙 Repo
MathOdyssey	A collection of 387 mathematical problems for evaluating the general mathematical capacities of LLMs. Featuring a spectrum of questions from Olympiad-level competitions, advanced high school curricula, and university-level mathematics.	📄 Paper 🔗 Project 🐙 Repo
Omni-MATH	A challenging benchmark specifically designed to assess LLMs' mathematical reasoning at the Olympiad level.	📄 Paper 🔗 Project 🐙 Repo 🤗 Dataset
HARP	A math reasoning dataset consisting of 4,780 short answer questions from US national math competitions.	📄 Paper 🐙 Repo
PolyMath	A multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.	📄 Paper 🐙 Repo 🤗 Dataset
MathMist	A parallel multilingual benchmark for math problem solving and reasoning, containing ~30K aligned question–answer pairs across 13 typologically diverse languages (high-, medium-, and low-resource), built from 2,890 Bangla-English gold-standard artifacts. Supports zero-shot, CoT, perturbed reasoning, and code-switched reasoning evaluation.	📄 Paper 🐙 Repo
U-MATH	A benchmark of 1,100 open-ended university-level math problems across 6 subjects (Precalculus, Algebra, Differential/Integral/Multivariable Calculus, Sequences & Series), sourced from real coursework with 20% multimodal problems. Also releases μ-MATH, a meta-evaluation dataset of 1,084 labeled solutions for assessing LLM-as-judge quality on free-form math grading.	📄 Paper 🐙 Repo 🤗 Dataset
FrontierMath	A benchmark of 350 original, expert-vetted research-level math problems split into difficulty Tiers 1–4 (undergraduate to research-frontier), with automated verification via Python-submitted answers. Designed to be uncontaminated and resist memorization—frontier models initially scored <2%, with Gemini 3 Deep Think reaching ~40% on Tiers 1–3 by early 2026.	📄 Paper 🔗 Project
OlymMATH	A bilingual (English/Chinese) Olympiad-level benchmark of 200 problems split into EASY (AIME-level) and HARD (significantly above AIME) tiers across algebra, geometry, number theory, and combinatorics, manually sourced from printed publications to prevent leakage. The accompanying OlymMATH-eval release includes 582K reasoning trajectories from 28 models for fine-grained analysis.	📄 Paper 🐙 Repo 🤗 Dataset
Open Proof Corpus (OPC)	A large-scale, human-validated dataset of 5,000+ LLM-generated proofs across 1,000+ problems from 20+ elite competitions (IMO, USAMO, BMOSL, etc.), annotated as correct/incorrect by expert judges with optional sentence-level error highlighting. Includes a finetuned 8B judge model (OPC-R1-8B) that matches Gemini-2.5-Pro on proof correctness evaluation.	📄 Paper 🔗 Project 🐙 Repo 🤗 Dataset
MathArena	A continuously updated benchmark (NeurIPS D&B 2025) that evaluates LLMs on freshly released math competitions (AIME, HMMT, BRUMO, SMT, CMIMC 2025; USAMO/IMO/Putnam 2025; AIME/HMMT 2026), eliminating contamination by post-release evaluation. Notably the first benchmark to include proof-writing eval—top models reach <40% on IMO 2025. So far covers 162+ problems across 7+ competitions and 50+ models.	📄 Paper 🔗 Project 🐙 Repo 🤗 Dataset

🖼️ Vision-Text Modality

Dataset	Descriptions	References
MathVerse	A collection of 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.	🔗 Project 🤗 Dataset
MathVista	A benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets	🔗 Project 🤗 Dataset
MATH-Vision	A collection of 3,040 mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty.	🔗 Project 🤗 Dataset
We-Math	A collection of 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity.	📄 Paper 🔗 Project 🐙 Repo 🤗 Dataset
OlympiadBench	An Olympiad-level bilingual (EN/ZH) multimodal scientific benchmark of 8,476 math and physics problems sourced from International Olympiads, the Chinese Olympiad, and the Chinese College Entrance Exam (GaoKao), each with expert-level step-by-step solution annotations.	📄 Paper 🐙 Repo 🤗 Dataset
VCBench	A benchmark from Alibaba DAMO Academy containing 1,720 elementary-level (grades 1–6) multimodal math problems across 6 cognitive domains, with an average of 3.9 images per question to enforce multi-image reasoning. Evaluates five competencies — temporal, geometric, logical, spatial reasoning, and pattern recognition — with even the best LVLMs failing to exceed 50% accuracy.	📄 Paper 🐙 Repo 🤗 Dataset
MV-MATH	A multi-image multimodal math benchmark (CVPR 2025) of 2,009 K-12 problems with interleaved text and multiple images (some up to 8). Covers 11 subjects across 3 difficulty tiers and 3 question types (MC / free-form / multi-step), and uniquely splits problems into ID (independent images) vs MD (mutually dependent images) to probe cross-image reasoning.	📄 Paper 🔗 Project 🤗 Dataset

🎤 Speech Modality

Dataset	Descriptions	References
Spoken-MQA	A benchmark designed to evaluate large language models’ mathematical reasoning ability from spoken input. It features math problems covering arithmetic, contextual reasoning, and knowledge-based reasoning.	📄 Paper 🐙 Repo 🤗 Dataset

Related Repo

https://github.com/tongyx361/Awesome-LLM4Math

https://github.com/huggingface/evaluation-guidebook/blob/main/contents/automated-benchmarks/some-evaluation-datasets.md

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Math Datasets

Table of Contents

Pre-training

📝 Text Only

🖼️ Vision-Text Modality

Supervised Fine-Tuning

📝 Text Only

🖼️ Vision-Text Modality

Reinforcement Learning

📝 Text Only

Benchmark

📝 Text Only

🖼️ Vision-Text Modality

🎤 Speech Modality

Related Repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Math Datasets

Table of Contents

Pre-training

📝 Text Only

🖼️ Vision-Text Modality

Supervised Fine-Tuning

📝 Text Only

🖼️ Vision-Text Modality

Reinforcement Learning

📝 Text Only

Benchmark

📝 Text Only

🖼️ Vision-Text Modality

🎤 Speech Modality

Related Repo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages