Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
a909cea
add diff diagnostics
avehtari Aug 31, 2025
b940192
add comments and option whether the probabilities are printed
avehtari Oct 16, 2025
6fe41bb
Apply Jonah's suggestions from code review
avehtari Oct 17, 2025
fa0d79d
fix argument name
avehtari Oct 17, 2025
65ef286
add more documentation (+ khat threshold 0.5 as we don't yet smooth)
avehtari Oct 17, 2025
b0ebac7
oops, fix khat comparison
avehtari Oct 17, 2025
7410b81
skip pareto-k check for low number of unique values
avehtari Oct 17, 2025
b993b44
remove `simplify` argument (need to check reverse dependencies)
jgabry Oct 17, 2025
b118da2
remove simplify argument from print.compare.loo_ss
jgabry Oct 17, 2025
25f93fd
start fixing tests
jgabry Oct 17, 2025
62027d1
fix issues in preliminary reverse dependency checks
jgabry Oct 17, 2025
cc3edfe
Update loo_compare.R
jgabry Oct 17, 2025
22f442f
make sure p_worse is available
jgabry Oct 17, 2025
ca843f4
use x instead of xcopy
jgabry Oct 17, 2025
d579c06
Update loo_compare.R
jgabry Oct 17, 2025
cf20250
Merge branch 'master' into diff-diagnostics
jgabry Oct 18, 2025
846a891
unify diagnostic messages
avehtari Oct 18, 2025
64b365c
improved loo_compare documentation
avehtari Oct 18, 2025
185e570
add subsections to loo_compare doc and put diagnostic messages in bul…
jgabry Oct 18, 2025
8625e10
minor cleanup
jgabry Oct 18, 2025
64a19c3
Add `model` column to `loo_compare()` output
jgabry Oct 18, 2025
a84154e
remove old loo::compare()
jgabry Oct 18, 2025
16f67d4
improve backwards compatibility
jgabry Oct 18, 2025
9b43e06
Merge branch 'diff-diagnostics' into model-names-as-column
jgabry Oct 18, 2025
4e16d2f
Update loo_compare.R
jgabry Oct 18, 2025
23a79c0
fix failing test
jgabry Oct 18, 2025
cff3c2c
Revert "remove old loo::compare()"
jgabry Oct 18, 2025
8f521a5
update tests
jgabry Oct 18, 2025
dc9db69
cleanup print method
jgabry Oct 19, 2025
3ed1c0c
improve backwards compatibility
jgabry Oct 19, 2025
89d39f5
change diag_pnorm to diag_diff
avehtari Oct 21, 2025
6d73537
change diag_pnorm to diag_diff in tests
avehtari Oct 21, 2025
abcf209
update test snapshots
jgabry Oct 21, 2025
794ffb0
add `model` column instead of row names
jgabry Oct 21, 2025
a375c93
remove row numbers when printing
jgabry Oct 21, 2025
f3fcb28
add diag_elpd
avehtari Oct 22, 2025
359853a
improve loo_compare doc
avehtari Oct 22, 2025
b7db7dd
clarifiy loo_compare diag_diff khat
avehtari Oct 22, 2025
77e753f
yet another small doc improvement
avehtari Oct 22, 2025
c59414e
Use function()
avehtari Oct 22, 2025
93fdbff
another loo_compare doc edit
avehtari Oct 22, 2025
ae48d69
adjust some diagnostic messages and documentation
avehtari Oct 22, 2025
4f5872b
edit doc, fix tests, move diagnostics to internal functions
jgabry Oct 22, 2025
574af4f
Merge branch 'master' into diff-diagnostics
jgabry Oct 22, 2025
3d008e9
Merge branch 'master' into diff-diagnostics
jgabry Oct 28, 2025
57150b7
Merge branch 'master' of github.com:stan-dev/loo into diff-diagnostics
avehtari Feb 15, 2026
b0b6ef8
drop k_diff
avehtari Feb 15, 2026
b8d0782
docs: update loo-glossary and documentation wrt diag_diff and diag_elpd
Apr 2, 2026
c3ce77a
style: fix typos and grammar
Apr 2, 2026
a49c641
chore: add user message with link to loo-glossary in loo_compare outputx
Apr 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions R/loo-glossary.R
Original file line number Diff line number Diff line change
Expand Up @@ -153,4 +153,98 @@
#' individual models due to correlation (i.e., if some observations are easier
#' and some more difficult to predict for all models).
#'
#' @section `p_worse` (probability of worse predictive performance):
#'
#' `p_worse` is the estimated probability that a model has worse predictive
#' performance than the best-ranked model in the comparison, based on the normal
#' approximation to the uncertainty in `elpd_diff`. It is computed as
#'
#' p_worse = pnorm(0, elpd_diff, se_diff).
#'
#' The best-ranked model (the first row in the `loo_compare()` output, where
#' `elpd_diff = 0`) always receives `NA`, since the comparison is defined
#' relative to that model.
#'
#' Because models are ordered by `elpd_loo` before computing `p_worse`, all
#' reported values are at least 0.5 by construction. A value close to 0.5
#' indicates that the models are nearly indistinguishable in predictive
#' performance and that the ranking could easily be reversed with different
#' data. A value close to 1 indicates that the lower-ranked model is almost
#' certainly worse. `p_worse` inherits all the limitations of `se_diff` and the
#' normal approximation on which it is based. In particular, when `se_diff` is
#' underestimated, `p_worse` will be estimated too close to 1, making a model
#' appear more clearly worse than the data actually support. Conversely, when
#' `elpd_diff` is biased due to an unreliable LOO approximation, `p_worse` can
#' point in the wrong direction entirely. When any of these conditions are
#' present, `diag_diff` or `diag_elpd` will be flagged in the `loo_compare()`
#' output. See those sections below for further guidance.
#'
#' @section `diag_diff` (pairwise comparison diagnostics):
#'
#' `diag_diff` is a diagnostic column in the `loo_compare()` output for each
#' model comparison against the current reference model. It flags conditions
#' under which the normal approximation behind `se_diff` and `p_worse` is likely
#' to be poorly calibrated. The column contains a short label when a condition
#' is detected, and is empty otherwise.
#'
#' The column `diag_diff` currently flags two problems:
#'
#' ### `N < 100`
#'
#' When the number of observations is small, we may assume `se_diff` to be
#' underestimated. As a rough heuristic one can multiply `se_diff` by 2 to
#' make a more conservative estimate (Bengio and Grandvalet, 2004).
Comment on lines +195 to +196
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Althoug LOO-uncertainty paper says the results by Bengio and Grandvalet (2004) can be used to justify multiplier 2, they do not state it directly and thus citing is likely to cause confusion

#'
#' ### `|elpd_diff| < 4`
#'
#' When `|elpd_diff|` is below 4, the models have very similar predictive
#' performance. In this setting, Sivula et al. (2025) show that skewness in
#' the error distribution can make the normal approximation for `se_diff`
#' and `p_worse` miscalibrated, even for large N. In practice, this usually
#' supports treating the models as predictively similar.
#'
#' ### Relation between `N < 100` and `|elpd_diff| < 4`
#'
#' The conditions flagged by `diag_diff` are not independent: they tend to
#' co-occur, and when they do, some flags carry more information than others.
#' `loo_compare()` therefore follows a priority hierarchy and shows only the
#' most critical flag in the table output.
#'
#' The hierarchy is as follows:
#'
#' * **`N < 100` takes highest priority.** A small sample size undermines the
#' reliability of `se_diff` by underestimating uncertainty. Because of this,
#' even if `|elpd_diff| < 4` is also true for a comparison, the table will only
#' show `N < 100`. The small sample size renders the `|elpd_diff| < 4`
#' diagnostic less meaningful.
#'
#' * **`|elpd_diff| < 4` takes second priority.** When N >= 100 and the
#' difference is small, the normal approximation is miscalibrated due to the
#' skewness of the error distribution (Sivula et al., 2025). In this
#' situation, `se_diff` exists and is not heavily biased in scale, but the
#' shape of the approximation is wrong, making `p_worse` unreliable.
#'
#' @section `diag_elpd`:
#'
#' `diag_elpd` is a diagnostic column in the `loo_compare()` output that flags
#' when the PSIS-LOO approximation for an individual model is unreliable. Unlike
#' `diag_diff`, which concerns the *comparison* between models, `diag_elpd`
#' concerns the quality of the `elpd_loo` estimate for each model individually.
#' It contains a short text label when a problem is detected, and is empty
#' otherwise.
#'
#' ### `K k_psis > t` (K observations with Pareto-k values > t)
#'
#' This label indicates that K observations for this model have Pareto-k values
#' above the PSIS reliability threshold `t` used by `loo` for that fit. The
#' threshold is sample-size dependent, and in many practical cases close to
#' 0.7. When this flag appears, the PSIS approximation can be unreliable for
#' those observations, and the resulting `elpd_loo` may be biased. Because
#' `elpd_diff` is a direct difference of two models' `elpd_loo` values, bias in
#' either model's estimate propagates directly into `elpd_diff` and `p_worse`.
#' This is qualitatively different from the calibration issues flagged by
#' `diag_diff`: here the estimate itself may be wrong, not just uncertain.
#'
#' See for further information on Pareto-k values the "Pareto k estimates"
#' section.
NULL
Loading
Loading