|
153 | 153 | #' individual models due to correlation (i.e., if some observations are easier |
154 | 154 | #' and some more difficult to predict for all models). |
155 | 155 | #' |
| 156 | +#' @section `p_worse` (probability of worse predictive performance): |
| 157 | +#' |
| 158 | +#' `p_worse` is the estimated probability that a model has worse predictive |
| 159 | +#' performance than the best-ranked model in the comparison, based on the normal |
| 160 | +#' approximation to the uncertainty in `elpd_diff`. It is computed as |
| 161 | +#' |
| 162 | +#' p_worse = pnorm(0, elpd_diff, se_diff). |
| 163 | +#' |
| 164 | +#' The best-ranked model (the first row in the `loo_compare()` output, where |
| 165 | +#' `elpd_diff = 0`) always receives `NA`, since the comparison is defined |
| 166 | +#' relative to that model. |
| 167 | +#' |
| 168 | +#' Because models are ordered by `elpd_loo` before computing `p_worse`, all |
| 169 | +#' reported values are at least 0.5 by construction. A value close to 0.5 |
| 170 | +#' indicates that the models are nearly indistinguishable in predictive |
| 171 | +#' performance and that the ranking could easily be reversed with different |
| 172 | +#' data. A value close to 1 indicates that the lower-ranked model is almost |
| 173 | +#' certainly worse. `p_worse` inherits all the limitations of `se_diff` and the |
| 174 | +#' normal approximation on which it is based. In particular, when `se_diff` is |
| 175 | +#' underestimated, `p_worse` will be estimated too close to 1, making a model |
| 176 | +#' appear more clearly worse than the data actually support. Conversely, when |
| 177 | +#' `elpd_diff` is biased due to an unreliable LOO approximation, `p_worse` can |
| 178 | +#' point in the wrong direction entirely. When any of these conditions are |
| 179 | +#' present, `diag_diff` or `diag_elpd` will be flagged in the `loo_compare()` |
| 180 | +#' output. See those sections below for further guidance. |
| 181 | +#' |
| 182 | +#' @section `diag_diff` (pairwise comparison diagnostics): |
| 183 | +#' |
| 184 | +#' `diag_diff` is a diagnostic column in the `loo_compare()` output for each |
| 185 | +#' model comparison against the current reference model. It flags conditions |
| 186 | +#' under which the normal approximation behind `se_diff` and `p_worse` is likely |
| 187 | +#' to be poorly calibrated. The column contains a short label when a condition |
| 188 | +#' is detected, and is empty otherwise. |
| 189 | +#' |
| 190 | +#' The column `diag_diff` currently flags two problems: |
| 191 | +#' |
| 192 | +#' ### `N < 100` |
| 193 | +#' |
| 194 | +#' When the number of observations is small, we may assume `se_diff` to be |
| 195 | +#' underestimated. As a rough heuristic one can multiply `se_diff` by 2 to |
| 196 | +#' make a more conservative estimate (Bengio and Grandvalet, 2004). |
| 197 | +#' |
| 198 | +#' ### `|elpd_diff| < 4` |
| 199 | +#' |
| 200 | +#' When `|elpd_diff|` is below 4, the models have very similar predictive |
| 201 | +#' performance. In this setting, Sivula et al. (2025) show that skewness in |
| 202 | +#' the error distribution can make the normal approximation for `se_diff` |
| 203 | +#' and `p_worse` miscalibrated, even for large N. In practice, this usually |
| 204 | +#' supports treating the models as predictively similar. |
| 205 | +#' |
| 206 | +#' ### Relation between `N < 100` and `|elpd_diff| < 4` |
| 207 | +#' |
| 208 | +#' The conditions flagged by `diag_diff` are not independent: they tend to |
| 209 | +#' co-occur, and when they do, some flags carry more information than others. |
| 210 | +#' `loo_compare()` therefore follows a priority hierarchy and shows only the |
| 211 | +#' most critical flag in the table output. |
| 212 | +#' |
| 213 | +#' The hierarchy is as follows: |
| 214 | +#' |
| 215 | +#' * **`N < 100` takes highest priority.** A small sample size undermines the |
| 216 | +#' reliability of `se_diff` by underestimating uncertainty. Because of this, |
| 217 | +#' even if `|elpd_diff| < 4` is also true for a comparison, the table will only |
| 218 | +#' show `N < 100`. The small sample size renders the `|elpd_diff| < 4` |
| 219 | +#' diagnostic less meaningful. |
| 220 | +#' |
| 221 | +#' * **`|elpd_diff| < 4` takes second priority.** When N >= 100 and the |
| 222 | +#' difference is small, the normal approximation is miscalibrated due to the |
| 223 | +#' skewness of the error distribution (Sivula et al., 2025). In this |
| 224 | +#' situation, `se_diff` exists and is not heavily biased in scale, but the |
| 225 | +#' shape of the approximation is wrong, making `p_worse` unreliable. |
| 226 | +#' |
| 227 | +#' @section `diag_elpd`: |
| 228 | +#' |
| 229 | +#' `diag_elpd` is a diagnostic column in the `loo_compare()` output that flags |
| 230 | +#' when the PSIS-LOO approximation for an individual model is unreliable. Unlike |
| 231 | +#' `diag_diff`, which concerns the *comparison* between models, `diag_elpd` |
| 232 | +#' concerns the quality of the `elpd_loo` estimate for each model individually. |
| 233 | +#' It contains a short text label when a problem is detected, and is empty |
| 234 | +#' otherwise. |
| 235 | +#' |
| 236 | +#' ### `K k_psis > t` (K observations with Pareto-k values > t) |
| 237 | +#' |
| 238 | +#' This label indicates that K observations for this model have Pareto-k values |
| 239 | +#' above the PSIS reliability threshold `t` used by `loo` for that fit. The |
| 240 | +#' threshold is sample-size dependent, and in many practical cases is close to |
| 241 | +#' 0.7. When this flag appears, the PSIS approximation can be unreliable for |
| 242 | +#' those observations, and the resulting `elpd_loo` may be biased. Because |
| 243 | +#' `elpd_diff` is a direct difference of two models' `elpd_loo` values, bias in |
| 244 | +#' either model's estimate propagates directly into `elpd_diff` and `p_worse`. |
| 245 | +#' This is qualitatively different from the calibration issues flagged by |
| 246 | +#' `diag_diff`: here the estimate itself may be wrong, not just uncertain. |
| 247 | +#' |
| 248 | +#' See for further information on Pareto-k values see the "Pareto k estimates" |
| 249 | +#' section. |
156 | 250 | NULL |
0 commit comments