stan-dev
diff --git a/‎R/loo-glossary.R‎
Lines changed: 94 additions & 0 deletions b/‎R/loo-glossary.R‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎R/loo_compare.R‎
Lines changed: 1 addition & 1 deletion b/‎R/loo_compare.R‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎man/loo-glossary.Rd‎
Lines changed: 104 additions & 0 deletions b/‎man/loo-glossary.Rd‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎man/loo_compare.Rd‎
Lines changed: 9 additions & 18 deletions b/‎man/loo_compare.Rd‎
Lines changed: 9 additions & 18 deletions
@@ -153,4 +153,98 @@
 #' individual models due to correlation (i.e., if some observations are easier
 #' and some more difficult to predict for all models).
 #'
+#' @section `p_worse` (probability of worse predictive performance):
+#'
+#' `p_worse` is the estimated probability that a model has worse predictive
+#' performance than the best-ranked model in the comparison, based on the normal
+#' approximation to the uncertainty in `elpd_diff`. It is computed as
+#'
+#'     p_worse = pnorm(0, elpd_diff, se_diff).
+#'
+#' The best-ranked model (the first row in the `loo_compare()` output, where
+#' `elpd_diff = 0`) always receives `NA`, since the comparison is defined
+#' relative to that model.
+#'
+#' Because models are ordered by `elpd_loo` before computing `p_worse`, all
+#' reported values are at least 0.5 by construction. A value close to 0.5
+#' indicates that the models are nearly indistinguishable in predictive
+#' performance and that the ranking could easily be reversed with different
+#' data. A value close to 1 indicates that the lower-ranked model is almost
+#' certainly worse. `p_worse` inherits all the limitations of `se_diff` and the
+#' normal approximation on which it is based. In particular, when `se_diff` is
+#' underestimated, `p_worse` will be estimated too close to 1, making a model
+#' appear more clearly worse than the data actually support. Conversely, when
+#' `elpd_diff` is biased due to an unreliable LOO approximation, `p_worse` can
+#' point in the wrong direction entirely. When any of these conditions are
+#' present, `diag_diff` or `diag_elpd` will be flagged in the `loo_compare()`
+#' output. See those sections below for further guidance.
+#'
+#' @section `diag_diff` (pairwise comparison diagnostics):
+#'
+#' `diag_diff` is a diagnostic column in the `loo_compare()` output for each
+#' model comparison against the current reference model. It flags conditions
+#' under which the normal approximation behind `se_diff` and `p_worse` is likely
+#' to be poorly calibrated. The column contains a short label when a condition
+#' is detected, and is empty otherwise.
+#'
+#' The column `diag_diff` currently flags two problems:
+#'
+#' ### `N < 100`
+#'
+#' When the number of observations is small, we may assume `se_diff` to be
+#' underestimated. As a rough heuristic one can multiply `se_diff` by 2 to
+#' make a more conservative estimate (Bengio and Grandvalet, 2004).
+#'
+#' ### `|elpd_diff| < 4`
+#'
+#' When `|elpd_diff|` is below 4, the models have very similar predictive
+#' performance. In this setting, Sivula et al. (2025) show that skewness in
+#' the error distribution can make the normal approximation for `se_diff`
+#' and `p_worse` miscalibrated, even for large N. In practice, this usually
+#' supports treating the models as predictively similar.
+#'
+#' ### Relation between `N < 100` and `|elpd_diff| < 4`
+#'
+#' The conditions flagged by `diag_diff` are not independent: they tend to
+#' co-occur, and when they do, some flags carry more information than others.
+#' `loo_compare()` therefore follows a priority hierarchy and shows only the
+#' most critical flag in the table output.
+#'
+#' The hierarchy is as follows:
+#'
+#' * **`N < 100` takes highest priority.** A small sample size undermines the
+#' reliability of `se_diff` by underestimating uncertainty. Because of this,
+#' even if `|elpd_diff| < 4` is also true for a comparison, the table will only
+#' show `N < 100`. The small sample size renders the `|elpd_diff| < 4`
+#' diagnostic less meaningful.
+#'
+#' * **`|elpd_diff| < 4` takes second priority.** When N >= 100 and the
+#'   difference is small, the normal approximation is miscalibrated due to the
+#'   skewness of the error distribution (Sivula et al., 2025). In this
+#'   situation, `se_diff` exists and is not heavily biased in scale, but the
+#'   shape of the approximation is wrong, making `p_worse` unreliable.
+#'
+#' @section `diag_elpd`:
+#'
+#' `diag_elpd` is a diagnostic column in the `loo_compare()` output that flags
+#' when the PSIS-LOO approximation for an individual model is unreliable. Unlike
+#' `diag_diff`, which concerns the *comparison* between models, `diag_elpd`
+#' concerns the quality of the `elpd_loo` estimate for each model individually.
+#' It contains a short text label when a problem is detected, and is empty
+#' otherwise.
+#'
+#' ### `K k_psis > t` (K observations with Pareto-k values > t)
+#'
+#' This label indicates that K observations for this model have Pareto-k values
+#' above the PSIS reliability threshold `t` used by `loo` for that fit. The
+#' threshold is sample-size dependent, and in many practical cases is close to
+#' 0.7. When this flag appears, the PSIS approximation can be unreliable for
+#' those observations, and the resulting `elpd_loo` may be biased. Because
+#' `elpd_diff` is a direct difference of two models' `elpd_loo` values, bias in
+#' either model's estimate propagates directly into `elpd_diff` and `p_worse`.
+#' This is qualitatively different from the calibration issues flagged by
+#' `diag_diff`: here the estimate itself may be wrong, not just uncertain.
+#'
+#' See for further information on Pareto-k values see the "Pareto k estimates"
+#' section.
 NULL
@@ -63,7 +63,7 @@
 #'
 #'   The column `diag_elpd` shows the PSIS-LOO Pareto k diagnostic for the
 #'   pointwise ELPD computations for each model. If `K k_psis > 0.7` is shown,
-#'   where `K` is the number of high high Pareto k values in the PSIS
+#'   where `K` is the number of high Pareto k values in the PSIS
 #'   computation, then there may be significant bias in `elpd_diff` favoring
 #'   models with a large number of high Pareto k values.
 #'