Skip to content

Commit b8d0782

Browse files
author
Florence Bockting
committed
docs: update loo-glossary and documentation wrt diag_diff and diag_elpd
1 parent b0b6ef8 commit b8d0782

4 files changed

Lines changed: 208 additions & 19 deletions

File tree

R/loo-glossary.R

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,4 +153,98 @@
153153
#' individual models due to correlation (i.e., if some observations are easier
154154
#' and some more difficult to predict for all models).
155155
#'
156+
#' @section `p_worse` (probability of worse predictive performance):
157+
#'
158+
#' `p_worse` is the estimated probability that a model has worse predictive
159+
#' performance than the best-ranked model in the comparison, based on the normal
160+
#' approximation to the uncertainty in `elpd_diff`. It is computed as
161+
#'
162+
#' p_worse = pnorm(0, elpd_diff, se_diff).
163+
#'
164+
#' The best-ranked model (the first row in the `loo_compare()` output, where
165+
#' `elpd_diff = 0`) always receives `NA`, since the comparison is defined
166+
#' relative to that model.
167+
#'
168+
#' Because models are ordered by `elpd_loo` before computing `p_worse`, all
169+
#' reported values are at least 0.5 by construction. A value close to 0.5
170+
#' indicates that the models are nearly indistinguishable in predictive
171+
#' performance and that the ranking could easily be reversed with different
172+
#' data. A value close to 1 indicates that the lower-ranked model is almost
173+
#' certainly worse. `p_worse` inherits all the limitations of `se_diff` and the
174+
#' normal approximation on which it is based. In particular, when `se_diff` is
175+
#' underestimated, `p_worse` will be estimated too close to 1, making a model
176+
#' appear more clearly worse than the data actually support. Conversely, when
177+
#' `elpd_diff` is biased due to an unreliable LOO approximation, `p_worse` can
178+
#' point in the wrong direction entirely. When any of these conditions are
179+
#' present, `diag_diff` or `diag_elpd` will be flagged in the `loo_compare()`
180+
#' output. See those sections below for further guidance.
181+
#'
182+
#' @section `diag_diff` (pairwise comparison diagnostics):
183+
#'
184+
#' `diag_diff` is a diagnostic column in the `loo_compare()` output for each
185+
#' model comparison against the current reference model. It flags conditions
186+
#' under which the normal approximation behind `se_diff` and `p_worse` is likely
187+
#' to be poorly calibrated. The column contains a short label when a condition
188+
#' is detected, and is empty otherwise.
189+
#'
190+
#' The column `diag_diff` currently flags two problems:
191+
#'
192+
#' ### `N < 100`
193+
#'
194+
#' When the number of observations is small, we may assume `se_diff` to be
195+
#' underestimated. As a rough heuristic one can multiply `se_diff` by 2 to
196+
#' make a more conservative estimate (Bengio and Grandvalet, 2004).
197+
#'
198+
#' ### `|elpd_diff| < 4`
199+
#'
200+
#' When `|elpd_diff|` is below 4, the models have very similar predictive
201+
#' performance. In this setting, Sivula et al. (2025) show that skewness in
202+
#' the error distribution can make the normal approximation for `se_diff`
203+
#' and `p_worse` miscalibrated, even for large N. In practice, this usually
204+
#' supports treating the models as predictively similar.
205+
#'
206+
#' ### Relation between `N < 100` and `|elpd_diff| < 4`
207+
#'
208+
#' The conditions flagged by `diag_diff` are not independent: they tend to
209+
#' co-occur, and when they do, some flags carry more information than others.
210+
#' `loo_compare()` therefore follows a priority hierarchy and shows only the
211+
#' most critical flag in the table output.
212+
#'
213+
#' The hierarchy is as follows:
214+
#'
215+
#' * **`N < 100` takes highest priority.** A small sample size undermines the
216+
#' reliability of `se_diff` by underestimating uncertainty. Because of this,
217+
#' even if `|elpd_diff| < 4` is also true for a comparison, the table will only
218+
#' show `N < 100`. The small sample size renders the `|elpd_diff| < 4`
219+
#' diagnostic less meaningful.
220+
#'
221+
#' * **`|elpd_diff| < 4` takes second priority.** When N >= 100 and the
222+
#' difference is small, the normal approximation is miscalibrated due to the
223+
#' skewness of the error distribution (Sivula et al., 2025). In this
224+
#' situation, `se_diff` exists and is not heavily biased in scale, but the
225+
#' shape of the approximation is wrong, making `p_worse` unreliable.
226+
#'
227+
#' @section `diag_elpd`:
228+
#'
229+
#' `diag_elpd` is a diagnostic column in the `loo_compare()` output that flags
230+
#' when the PSIS-LOO approximation for an individual model is unreliable. Unlike
231+
#' `diag_diff`, which concerns the *comparison* between models, `diag_elpd`
232+
#' concerns the quality of the `elpd_loo` estimate for each model individually.
233+
#' It contains a short text label when a problem is detected, and is empty
234+
#' otherwise.
235+
#'
236+
#' ### `K k_psis > t` (K observations with Pareto-k values > t)
237+
#'
238+
#' This label indicates that K observations for this model have Pareto-k values
239+
#' above the PSIS reliability threshold `t` used by `loo` for that fit. The
240+
#' threshold is sample-size dependent, and in many practical cases is close to
241+
#' 0.7. When this flag appears, the PSIS approximation can be unreliable for
242+
#' those observations, and the resulting `elpd_loo` may be biased. Because
243+
#' `elpd_diff` is a direct difference of two models' `elpd_loo` values, bias in
244+
#' either model's estimate propagates directly into `elpd_diff` and `p_worse`.
245+
#' This is qualitatively different from the calibration issues flagged by
246+
#' `diag_diff`: here the estimate itself may be wrong, not just uncertain.
247+
#'
248+
#' See for further information on Pareto-k values see the "Pareto k estimates"
249+
#' section.
156250
NULL

R/loo_compare.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@
6363
#'
6464
#' The column `diag_elpd` shows the PSIS-LOO Pareto k diagnostic for the
6565
#' pointwise ELPD computations for each model. If `K k_psis > 0.7` is shown,
66-
#' where `K` is the number of high high Pareto k values in the PSIS
66+
#' where `K` is the number of high Pareto k values in the PSIS
6767
#' computation, then there may be significant bias in `elpd_diff` favoring
6868
#' models with a large number of high Pareto k values.
6969
#'

man/loo-glossary.Rd

Lines changed: 104 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/loo_compare.Rd

Lines changed: 9 additions & 18 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)