📊 Results
We present Hierarchical Bayesian posterior summaries for LLM performance on the benchmarks, comparing Exact versus Rephrased question conditions.
The parameters reported are:
- Mean accuracy (μ)
- Inter-question homogeneity (κ)
- Mean expected total intra-question variance (γ̄) — reflects response stochasticity
- Each value represents a posterior mean with 95% Highest Density Intervals (HDIs) shown in brackets.
The table is divided into:
🧠 Reasoning models (top half)
🤖 Non-reasoning models (bottom half)



