This document provides an overview of the datasets used in the LLM uncertainty evaluation project. Each dataset has been selected to assess different aspects of model performance and uncertainty estimation.
| Dataset | Domain | Format | Source |
|---|---|---|---|
| MMLU | Multi-domain | Multiple-choice | cais/mmlu |
| MedMCQA | Medical | Multiple-choice | openlifescienceai/medmcqa |
| PubMedQA | Biomedical Research | Yes/No/Maybe | pubmed_qa |
| AI2 ARC | Science | Multiple-choice | allenai/ai2_arc |
| MathQA | Mathematics | Multiple-choice | allenai/math_qa |
| CommonsenseQA | General knowledge | Multiple-choice | tau/commonsense_qa |
Source: cais/mmlu
MMLU is a benchmark that evaluates models across 57 different subjects spanning STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability. Questions are multiple-choice format, requiring zero-shot or few-shot reasoning to answer correctly.
Source: openlifescienceai/medmcqa
MedMCQA is a large-scale, multiple-choice question answering dataset designed to address real-world medical entrance exam questions. It covers various topics in medicine, clinical practice, and biomedical sciences, making it challenging even for specialized models.
Source: pubmed_qa
PubMedQA is a biomedical question answering dataset collected from PubMed abstracts. It features questions that can be answered with "yes," "no," or "maybe" based on the provided context from medical literature, testing reasoning abilities in specialized domains.
Source: allenai/ai2_arc
The ARC dataset consists of genuine grade-school level, multiple-choice science questions. The challenge set contains questions that require reasoning, making them difficult for retrieval and co-occurrence methods. It tests models' ability to apply scientific knowledge rather than just retrieve facts.
Source: allenai/math_qa
MathQA is a dataset on math word problems, providing step-by-step annotations for solving each question. It covers multiple mathematical domains including arithmetic, algebra, probability, and geometry, testing a model's quantitative reasoning capabilities.
Source: tau/commonsense_qa
CommonsenseQA is a multiple-choice question answering dataset that specifically targets commonsense knowledge. Questions are derived from ConceptNet and require models to understand everyday concepts and relationships that humans typically take for granted.