Datasets

This document provides an overview of the datasets used in the LLM uncertainty evaluation project. Each dataset has been selected to assess different aspects of model performance and uncertainty estimation.

Overview

Dataset	Domain	Format	Source
MMLU	Multi-domain	Multiple-choice	cais/mmlu
MedMCQA	Medical	Multiple-choice	openlifescienceai/medmcqa
PubMedQA	Biomedical Research	Yes/No/Maybe	pubmed_qa
AI2 ARC	Science	Multiple-choice	allenai/ai2_arc
MathQA	Mathematics	Multiple-choice	allenai/math_qa
CommonsenseQA	General knowledge	Multiple-choice	tau/commonsense_qa

Detailed Descriptions

📘 Massive Multitask Language Understanding (MMLU)

Source: cais/mmlu

MMLU is a benchmark that evaluates models across 57 different subjects spanning STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability. Questions are multiple-choice format, requiring zero-shot or few-shot reasoning to answer correctly.

🧬 MedMCQA

Source: openlifescienceai/medmcqa

MedMCQA is a large-scale, multiple-choice question answering dataset designed to address real-world medical entrance exam questions. It covers various topics in medicine, clinical practice, and biomedical sciences, making it challenging even for specialized models.

📄 PubMedQA

Source: pubmed_qa

PubMedQA is a biomedical question answering dataset collected from PubMed abstracts. It features questions that can be answered with "yes," "no," or "maybe" based on the provided context from medical literature, testing reasoning abilities in specialized domains.

🔬 AI2 ARC (AI2 Reasoning Challenge)

Source: allenai/ai2_arc

The ARC dataset consists of genuine grade-school level, multiple-choice science questions. The challenge set contains questions that require reasoning, making them difficult for retrieval and co-occurrence methods. It tests models' ability to apply scientific knowledge rather than just retrieve facts.

🧮 MathQA

Source: allenai/math_qa

MathQA is a dataset on math word problems, providing step-by-step annotations for solving each question. It covers multiple mathematical domains including arithmetic, algebra, probability, and geometry, testing a model's quantitative reasoning capabilities.

🧠 CommonsenseQA

Source: tau/commonsense_qa

CommonsenseQA is a multiple-choice question answering dataset that specifically targets commonsense knowledge. Questions are derived from ConceptNet and require models to understand everyday concepts and relationships that humans typically take for granted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Overview

Detailed Descriptions

📘 Massive Multitask Language Understanding (MMLU)

🧬 MedMCQA

📄 PubMedQA

🔬 AI2 ARC (AI2 Reasoning Challenge)

🧮 MathQA

🧠 CommonsenseQA

FilesExpand file tree

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Datasets

Overview

Detailed Descriptions

📘 Massive Multitask Language Understanding (MMLU)

🧬 MedMCQA

📄 PubMedQA

🔬 AI2 ARC (AI2 Reasoning Challenge)

🧮 MathQA

🧠 CommonsenseQA