llm-uncertainty/docs/results.md at main · pfizer-opensource/llm-uncertainty

📊 Results

We present Hierarchical Bayesian posterior summaries for LLM performance on the benchmarks, comparing Exact versus Rephrased question conditions.

The parameters reported are:

Mean accuracy (μ)
Inter-question homogeneity (κ)
Mean expected total intra-question variance (γ̄) — reflects response stochasticity
Each value represents a posterior mean with 95% Highest Density Intervals (HDIs) shown in brackets.

The table is divided into:

🧠 Reasoning models (top half)

🤖 Non-reasoning models (bottom half)

Provide feedback