Skip to content

Latest commit

 

History

History
46 lines (30 loc) · 3.25 KB

File metadata and controls

46 lines (30 loc) · 3.25 KB

Datasets

This document provides an overview of the datasets used in the LLM uncertainty evaluation project. Each dataset has been selected to assess different aspects of model performance and uncertainty estimation.

Overview

Dataset Domain Format Source
MMLU Multi-domain Multiple-choice cais/mmlu
MedMCQA Medical Multiple-choice openlifescienceai/medmcqa
PubMedQA Biomedical Research Yes/No/Maybe pubmed_qa
AI2 ARC Science Multiple-choice allenai/ai2_arc
MathQA Mathematics Multiple-choice allenai/math_qa
CommonsenseQA General knowledge Multiple-choice tau/commonsense_qa

Detailed Descriptions

📘 Massive Multitask Language Understanding (MMLU)

Source: cais/mmlu

MMLU is a benchmark that evaluates models across 57 different subjects spanning STEM, humanities, social sciences, and more. It tests both world knowledge and problem-solving ability. Questions are multiple-choice format, requiring zero-shot or few-shot reasoning to answer correctly.

🧬 MedMCQA

Source: openlifescienceai/medmcqa

MedMCQA is a large-scale, multiple-choice question answering dataset designed to address real-world medical entrance exam questions. It covers various topics in medicine, clinical practice, and biomedical sciences, making it challenging even for specialized models.

📄 PubMedQA

Source: pubmed_qa

PubMedQA is a biomedical question answering dataset collected from PubMed abstracts. It features questions that can be answered with "yes," "no," or "maybe" based on the provided context from medical literature, testing reasoning abilities in specialized domains.

🔬 AI2 ARC (AI2 Reasoning Challenge)

Source: allenai/ai2_arc

The ARC dataset consists of genuine grade-school level, multiple-choice science questions. The challenge set contains questions that require reasoning, making them difficult for retrieval and co-occurrence methods. It tests models' ability to apply scientific knowledge rather than just retrieve facts.

🧮 MathQA

Source: allenai/math_qa

MathQA is a dataset on math word problems, providing step-by-step annotations for solving each question. It covers multiple mathematical domains including arithmetic, algebra, probability, and geometry, testing a model's quantitative reasoning capabilities.

🧠 CommonsenseQA

Source: tau/commonsense_qa

CommonsenseQA is a multiple-choice question answering dataset that specifically targets commonsense knowledge. Questions are derived from ConceptNet and require models to understand everyday concepts and relationships that humans typically take for granted.