Skip to content

Latest commit

 

History

History
32 lines (18 loc) · 773 Bytes

File metadata and controls

32 lines (18 loc) · 773 Bytes

📊 Results

We present Hierarchical Bayesian posterior summaries for LLM performance on the benchmarks, comparing Exact versus Rephrased question conditions.

The parameters reported are:

  • Mean accuracy (μ)
  • Inter-question homogeneity (κ)
  • Mean expected total intra-question variance (γ̄) — reflects response stochasticity
  • Each value represents a posterior mean with 95% Highest Density Intervals (HDIs) shown in brackets.

The table is divided into:

🧠 Reasoning models (top half)

🤖 Non-reasoning models (bottom half)

MMLU

Alt text

MATHQA

Alt text

MEDMCQA

Alt text

CommonsenseQA

Alt text