diff --git a/.Rbuildignore b/.Rbuildignore index bdddba20..65caed95 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -29,6 +29,7 @@ framed.sty ^CRAN-SUBMISSION$ ^\.github$ ^\.claude$ +^CLAUDE\.md$ ^\.git$ ^\.vscode$ ^doc$ @@ -54,6 +55,6 @@ framed.sty ^vignettes/\.quarto$ ^vignettes/.*_files$ ^vignettes/.*\.html$ -# Stray default-device plot dumps from running examples/tests locally -# (already in .gitignore; this keeps them out of the build tarball too) +# Stray default-device plot artefact (created by plotting with no open device), +# at the root or under tests/ — never ship it. Rplots\.pdf$ diff --git a/.gitignore b/.gitignore index 23cadad1..2090f12b 100644 --- a/.gitignore +++ b/.gitignore @@ -49,9 +49,20 @@ vignettes/ggRandomForests.html vignettes/varpro_cache/ vignettes/varpro_files/ vignettes/varpro.html +vignettes/uvarpro_cache/ +vignettes/uvarpro_files/ +vignettes/uvarpro.html .claude +CLAUDE.md .positai -# Local dev tool state (brainstorm/superpowers) +# Brainstorm / visual-companion scratch .superpowers/ +# Rendered vignette artifacts (current quarto vignette names) +vignettes/ggRandomForests-regression.html +vignettes/ggRandomForests-survival.html +vignettes/ggRandomForests-regression_files/ +vignettes/ggRandomForests-survival_files/ +# Built package tarballs +*.tar.gz diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION index 4f414c9f..7051de84 100644 --- a/CRAN-SUBMISSION +++ b/CRAN-SUBMISSION @@ -1,3 +1,3 @@ -Version: 3.2.0 -Date: 2026-06-23 16:43:46 UTC -SHA: 9892d5abc8c9278acd19835059bbe78ff887c7bc +Version: 3.4.0 +Date: 2026-07-02 16:27:59 UTC +SHA: 16c8b4d791827785b68e78279b89c183b9d35d09 diff --git a/DESCRIPTION b/DESCRIPTION index c3921a83..2d6dc3e6 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -2,7 +2,7 @@ Package: ggRandomForests Type: Package Title: Visually Exploring Random Forests Version: 4.0.0.9000 -Date: 2026-06-23 +Date: 2026-07-02 Authors@R: person("John", "Ehrlinger", role = c("aut", "cre"), email = "john.ehrlinger@gmail.com") diff --git a/NAMESPACE b/NAMESPACE index b4680c50..94af30d3 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -25,13 +25,19 @@ S3method(calc_roc,rfsrc) S3method(gg_auct,rhf) S3method(gg_beta_uvarpro,default) S3method(gg_beta_uvarpro,uvarpro) +S3method(gg_beta_varpro,default) S3method(gg_beta_varpro,varpro) +S3method(gg_brier,default) S3method(gg_brier,rfsrc) +S3method(gg_error,default) S3method(gg_error,randomForest) S3method(gg_error,randomForest.formula) S3method(gg_error,rfsrc) +S3method(gg_isopro,default) S3method(gg_isopro,isopro) +S3method(gg_ivarpro,default) S3method(gg_ivarpro,varpro) +S3method(gg_rfsrc,default) S3method(gg_rfsrc,randomForest) S3method(gg_rfsrc,rfsrc) S3method(gg_rhf,rhf) @@ -42,8 +48,10 @@ S3method(gg_sdependent,default) S3method(gg_sdependent,uvarpro) S3method(gg_survival,default) S3method(gg_survival,rfsrc) +S3method(gg_variable,default) S3method(gg_variable,randomForest) S3method(gg_variable,rfsrc) +S3method(gg_vimp,default) S3method(gg_vimp,randomForest) S3method(gg_vimp,rfsrc) S3method(plot,gg_auct) diff --git a/NEWS.md b/NEWS.md index 4e9f48d0..f0ead785 100644 --- a/NEWS.md +++ b/NEWS.md @@ -17,6 +17,37 @@ ggRandomForests v4.0.0 (development) AUC(t) with a bootstrap CI ribbon when available and a 0.5 reference line. `gg_auct.rhf(object, marker, auct_fit = NULL)` computes `auct.rhf()` internally or reuses a cached fit. + +ggRandomForests v3.4.1 +====================== +* The remaining `rfsrc`/`randomForest` wrappers -- `gg_error()`, `gg_vimp()`, + `gg_variable()`, `gg_rfsrc()`, and `gg_brier()` -- now have `default` S3 + methods, so a wrong-class input gives a clear "expected an 'rfsrc' or + 'randomForest' object" error (naming the class it got) instead of R's generic + "no applicable method". This finishes the dispatch-consistency pass started + for the varPro family in 3.4.0. (`gg_roc()` keeps its existing + `gg_roc.rfsrc` default, which accepts rfsrc-shaped objects.) + +ggRandomForests v3.4.0 +====================== +* `gg_isopro()`, `gg_beta_varpro()`, and `gg_ivarpro()` now have `default` S3 + methods, so a wrong-class input gives a clear "expected a '' object" + error (naming the class it got) instead of R's generic "no applicable + method". This makes the varPro-family wrappers consistent with + `gg_beta_uvarpro()` / `gg_sdependent()`; the previously-unreachable inner + class checks were removed. +* Fix: `gg_partial_rfsrc()` now computes partial dependence correctly for + `factor` predictors. It was passing factor *labels* as + `partial.values` to `randomForestSRC::partial.rfsrc()`, which imposes a + level by its integer code (internally `as.numeric(partial.values)`). + Character labels ("No"/"Yes") became `NA` and numeric-looking labels + ("4"/"6"/"8") became out-of-range codes, so every level collapsed to a + single value (a flat categorical partial plot). The wrapper now passes the + integer codes and relabels the output, matching `plot.variable(partial = + TRUE)` and the ground-truth partial dependence. The categorical `x` is now + returned as a `factor` in the model's level order, so the plot keeps that + order instead of re-sorting alphabetically. Continuous and numeric + low-cardinality predictors are unaffected. * `gg_beta_uvarpro()` / `plot.gg_beta_uvarpro()`: tidy wrapper and bar chart for `varPro::get.beta.entropy()` -- the unsupervised analogue of `gg_beta_varpro()`. From a `uvarpro()` fit it aggregates the per-region @@ -31,8 +62,41 @@ ggRandomForests v4.0.0 (development) graph) with the "which variables are signal" ranking; shares the `beta_fit` entropy matrix. Follows the `get.beta.entropy` + `sdependent` workflow from the `varPro::uvarpro()` help (iowa-housing example). -* Fixes the intro vignette's placeholder `\VignetteIndexEntry` - ("Vignette's Title" -> "Exploring Random Forests with ggRandomForests"). +* New `uvarpro` vignette: a short, focused walk-through of the unsupervised + varPro wrappers (`gg_udependent()`, `gg_beta_uvarpro()`, `gg_sdependent()`) + on a single `uvarpro()` fit, using the shared `beta_fit` matrix. The three + unsupervised sections were lifted out of the `varpro` vignette, which now + points to the new one and covers the five supervised wrappers. +* Fixed the main vignette's `\VignetteIndexEntry`, which still carried the + template placeholder "Vignette's Title" -- it now reads "Exploring Random + Forests with ggRandomForests" (the index entry CRAN lists, not the document + title, was the stale one). + +ggRandomForests v3.3.0 +====================== +* `gg_partial_varpro()`: **classification partial plots now default to + probability.** `scale = "auto"` on a classification fit resolves to `"prob"` + (P(Y = target class)) instead of raw log-odds; `"odds"` and `"logodds"` are + options. The back-transform is applied before averaging (mean predicted + probability). The `causal` contrast is shown only on `"logodds"`. +* `gg_partial_varpro()`: **survival partial plots now default to survival + probability.** `scale = "auto"` on a survival fit resolves to `"surv"` + (S(tau | x), bounded 0-1) via a new partialpro learner, instead of the + unbounded ensemble-mortality score (still available via + `scale = "mortality"`). `"surv"` and `"rmst"` default `tau` to the median + follow-up time when `time` is omitted -- a units-safe, data-driven horizon + (v3.2.0's `rmst` required `time`; this is a loosening). The resolved `tau` is + reported in a message and the axis label. +* `plot.gg_partial_varpro()`: documents what the `causal` (virtual-twins) + estimator is and when to use it, and explains why it is hidden on the bounded + probability scales. +* Documentation: `plot.gg_partial_varpro()` gains a "Reading an RMST curve" + section explaining how to interpret the `scale = "rmst"` y-axis -- RMST(tau) + is the expected event-free time within the first tau time-units (area under + S(t) out to tau), read in the model's own time units, bounded by tau, and + higher-is-better (the opposite direction from ensemble mortality). It also + notes that tau must be supplied in the fit's time units, since a tau beyond + the largest event time truncates to the full restricted mean. No code change. ggRandomForests v3.2.0 ====================== diff --git a/R/gg_beta_uvarpro.R b/R/gg_beta_uvarpro.R index 7090a3ff..4803e134 100644 --- a/R/gg_beta_uvarpro.R +++ b/R/gg_beta_uvarpro.R @@ -77,10 +77,6 @@ gg_beta_uvarpro.default <- function(object, ..., cutoff = NULL, #' @export gg_beta_uvarpro.uvarpro <- function(object, ..., cutoff = NULL, beta_fit = NULL) { - if (!inherits(object, "uvarpro")) { - stop("gg_beta_uvarpro: expected a 'uvarpro' object from varPro::uvarpro().", - call. = FALSE) - } .assert_scalar_numeric_or_null(cutoff, "cutoff", "gg_beta_uvarpro") # Resolve the beta matrix (cache path) diff --git a/R/gg_beta_varpro.R b/R/gg_beta_varpro.R index 4648eb07..eca03bd6 100644 --- a/R/gg_beta_varpro.R +++ b/R/gg_beta_varpro.R @@ -183,13 +183,17 @@ gg_beta_varpro <- function(object, ..., cutoff = NULL, beta_fit = NULL, UseMethod("gg_beta_varpro", object) } +#' @export +gg_beta_varpro.default <- function(object, ..., cutoff = NULL, + beta_fit = NULL, which_class = NULL) { + stop("gg_beta_varpro: expected a 'varpro' object from varPro::varpro(); ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} + #' @export gg_beta_varpro.varpro <- function(object, ..., cutoff = NULL, beta_fit = NULL, which_class = NULL) { - if (!inherits(object, "varpro")) { - stop("gg_beta_varpro: expected a 'varpro' object from varPro::varpro().", - call. = FALSE) - } fam <- object$family if (!fam %in% c("regr", "class")) { stop(sprintf( diff --git a/R/gg_brier.R b/R/gg_brier.R index 7559df61..fde3398d 100644 --- a/R/gg_brier.R +++ b/R/gg_brier.R @@ -119,6 +119,13 @@ gg_brier <- function(object, ...) { UseMethod("gg_brier", object) } +#' @export +gg_brier.default <- function(object, ...) { + stop("gg_brier: expected an 'rfsrc' survival object from ", + "randomForestSRC::rfsrc(); got an object of class ", + paste(class(object), collapse = "/"), ".", call. = FALSE) +} + #' @export gg_brier.rfsrc <- function(object, subset = NULL, diff --git a/R/gg_error.R b/R/gg_error.R index e8c8948f..4283cd63 100644 --- a/R/gg_error.R +++ b/R/gg_error.R @@ -204,6 +204,13 @@ gg_error <- function(object, ...) { UseMethod("gg_error", object) } + +#' @export +gg_error.default <- function(object, ...) { + stop("gg_error: expected an 'rfsrc' or 'randomForest' object; ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} #' @export gg_error.rfsrc <- function(object, ...) { ## Check that the input object is of the correct type. diff --git a/R/gg_isopro.R b/R/gg_isopro.R index 59384e08..5ec2e30f 100644 --- a/R/gg_isopro.R +++ b/R/gg_isopro.R @@ -172,12 +172,14 @@ gg_isopro <- function(object, ..., newdata = NULL) { } #' @export -gg_isopro.isopro <- function(object, ..., newdata = NULL) { - if (!inherits(object, "isopro")) { - stop("gg_isopro expects a 'isopro' object from varPro::isopro().", - call. = FALSE) - } +gg_isopro.default <- function(object, ..., newdata = NULL) { + stop("gg_isopro: expected an 'isopro' object from varPro::isopro(); ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} +#' @export +gg_isopro.isopro <- function(object, ..., newdata = NULL) { ntree <- tryCatch( as.integer(object$isoforest$ntree), error = function(e) NA_integer_ diff --git a/R/gg_ivarpro.R b/R/gg_ivarpro.R index b8501705..2981c180 100644 --- a/R/gg_ivarpro.R +++ b/R/gg_ivarpro.R @@ -158,14 +158,19 @@ gg_ivarpro <- function(object, ..., which_obs = NULL, which_class = NULL, UseMethod("gg_ivarpro", object) } +#' @export +gg_ivarpro.default <- function(object, ..., which_obs = NULL, + which_class = NULL, cutoff = NULL, + ivarpro_fit = NULL) { + stop("gg_ivarpro: expected a 'varpro' object from varPro::varpro(); ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} + #' @export gg_ivarpro.varpro <- function(object, ..., which_obs = NULL, which_class = NULL, cutoff = NULL, ivarpro_fit = NULL) { - if (!inherits(object, "varpro")) { - stop("gg_ivarpro: expected a 'varpro' object from varPro::varpro().", - call. = FALSE) - } fam <- object$family if (!fam %in% c("regr", "class")) { stop(sprintf( diff --git a/R/gg_partial_rfsrc.R b/R/gg_partial_rfsrc.R index 9b70302d..0b7c338f 100644 --- a/R/gg_partial_rfsrc.R +++ b/R/gg_partial_rfsrc.R @@ -85,7 +85,8 @@ #' (the level of \code{xvar2.name}) and \code{time} (survival forests #' only) for all continuous predictors.} #' \item{categorical}{A \code{data.frame} with the same columns but -#' \code{x} kept as character, for low-cardinality predictors.} +#' \code{x} kept as a \code{factor} (levels in the model's level order, +#' not alphabetical), for low-cardinality predictors.} #' } #' #' @seealso \code{\link{gg_partial}}, \code{\link[randomForestSRC]{partial.rfsrc}}, @@ -195,8 +196,9 @@ snap_partial_time <- function(rf_model, partial.time) { ## Build the evaluation grid (xval vector + categorical flag) for one variable. make_eval_grid <- function(xname, newx, cat_limit, n_eval) { # Use `[[` to preserve the column's class (factor, character, numeric, …). - # unlist(dplyr::select(...)) would coerce factors to integer codes, breaking - # both the cat_limit check and the partial.values passed to partial.rfsrc(). + # unlist(dplyr::select(...)) would coerce factors to their integer codes, + # losing the level labels we need for the cat_limit check and for mapping + # partial.rfsrc()'s numeric output back to labels (see partial_one_var()). xval <- newx[[xname]] xval <- xval[!is.na(xval)] if (length(xval) == 0L) { @@ -207,14 +209,22 @@ make_eval_grid <- function(xname, newx, cat_limit, n_eval) { return(NULL) } gr <- is.factor(xval) || is.character(xval) || length(unique(xval)) < cat_limit - if (!gr && length(unique(xval)) > n_eval) { + flevels <- NULL + if (is.factor(xval)) { + # partial.rfsrc() imposes a factor level by its INTEGER CODE -- internally + # it does as.numeric(partial.values). Passing the labels coerces character + # levels ("No"/"Yes") to NA, or numeric-looking labels ("4"/"6"/"8") to + # out-of-range codes, and every level collapses to a single value. Pass the + # codes here and relabel get.partial.plot.data()'s output in partial_one_var(). + flevels <- levels(xval) + present <- levels(droplevels(xval)) # levels actually present in newx + xval <- match(present, flevels) # codes in the model's factor coding + } else if (!gr && length(unique(xval)) > n_eval) { xval <- quantile_pts(xval, groups = n_eval) - } else if (is.factor(xval)) { - xval <- levels(droplevels(xval)) # preserve factor ordering; drop unused levels } else { xval <- sort(unique(xval)) } - list(xval = xval, categorical = gr) + list(xval = xval, categorical = gr, flevels = flevels) } ## Thin wrapper around partial.rfsrc that builds the argument list. @@ -250,6 +260,11 @@ partial_one_var <- function(xname, newx, rf_model, is_surv, partial.time, partial.type, xvar2.name, x2val) pout <- randomForestSRC::get.partial.plot.data(partial.obj, granule = gr) + # Factor levels were passed to partial.rfsrc() as integer codes; map the + # numeric x that comes back to the original level labels. + if (!is.null(eg$flevels)) { + pout$x <- eg$flevels[as.integer(pout$x)] + } # Survival forests with >1 partial.time return yhat as an # [length(partial.values) x length(partial.time)] matrix; expand to long form # so each (x, time) pair is its own row. For non-survival or single-time @@ -320,6 +335,13 @@ split_partial_result <- function(pdta) { continuous$type <- NULL categorical <- pdta[!cont_idx, , drop = FALSE] categorical$type <- NULL + # Keep the model's factor-level order. The rows arrive blocked by level in + # ascending code order (i.e. the model's level order), so first-appearance is + # that order; make x a factor so plot.gg_partial_rfsrc()'s factor(x) does not + # re-sort the levels alphabetically. + if (nrow(categorical) > 0L) { + categorical$x <- factor(categorical$x, levels = unique(categorical$x)) + } result <- list(continuous = continuous, categorical = categorical) class(result) <- "gg_partial_rfsrc" result diff --git a/R/gg_partial_varpro.R b/R/gg_partial_varpro.R index c0605de9..f11ad267 100644 --- a/R/gg_partial_varpro.R +++ b/R/gg_partial_varpro.R @@ -74,9 +74,13 @@ #' \code{part_dta} is \code{NULL} it is passed to #' \code{varPro::partialpro(object)} for you. Required when #' \code{scale \%in\% c("surv","chf")}. -#' @param scale Character; sets the y-axis label and, for survival forests, -#' the output type. One of \code{"auto"} (default), \code{"mortality"}, -#' \code{"rmst"}, \code{"surv"}, or \code{"chf"}. +#' @param scale Character; the y-axis scale. One of \code{"auto"} (default), +#' the classification scales \code{"prob"} / \code{"odds"} / \code{"logodds"}, +#' or the survival scales \code{"rmst"} / \code{"surv"} / \code{"mortality"} / +#' \code{"chf"}. With \code{"auto"}: classification fits resolve to +#' \code{"prob"} (probability of the target class) and survival fits to +#' \code{"surv"} (survival probability at a default horizon \eqn{\tau}); see +#' \strong{Details}. #' @param time Numeric; the evaluation time point. Required when #' \code{scale = "rmst"} (the RMST horizon \eqn{\tau}), where it now #' \emph{drives} the partial computation through an RMST(\eqn{\tau}) @@ -124,6 +128,27 @@ #' A \eqn{\tau} beyond the model's largest event time is truncated there #' (with a warning), since \eqn{S(t)} cannot be extrapolated. #' +#' **Classification scale (scale = "prob"/"odds"/"logodds"):** +#' \code{varPro::partialpro} returns classification effects as \emph{log-odds} +#' of the target class. \code{scale = "prob"} (the classification default) +#' back-transforms to probability \eqn{P(Y = \mathrm{target})}, \code{"odds"} to +#' the odds, and \code{"logodds"} keeps the raw scale. The back-transform is +#' applied per observation \emph{before} averaging, so the curve is the mean +#' predicted probability, not the probability of the mean log-odds. The +#' \code{causal} contrast is shown only on \code{"logodds"} (see +#' \code{\link{plot.gg_partial_varpro}}). +#' +#' **Survival probability (scale = "surv"):** \code{scale = "surv"} (the +#' survival default) computes \eqn{S(\tau \mid x)} through \code{partialpro} +#' (the same UVT engine as mortality and RMST), bounded in \eqn{[0, 1]}. When +#' \code{time} is not supplied, \eqn{\tau} defaults to the \strong{median +#' follow-up time} of the fit -- a data-driven horizon that is always in the +#' model's own time units, so it cannot be mis-specified the way a hand-typed +#' \eqn{\tau} can. The resolved \eqn{\tau} is reported in a message and the +#' axis label; pass \code{time = tau} to choose another. \code{scale = +#' "mortality"} keeps the unbounded ensemble-mortality score as an explicit +#' opt-in. +#' #' **Ensemble mortality (scale = "mortality"):** here the y-axis is #' \emph{ensemble mortality}, the expected number of events a subject would #' see if they were exposed to the study-average cumulative hazard. It is @@ -152,6 +177,11 @@ #' Random survival forests. \emph{The Annals of Applied Statistics}, #' \bold{2}(3), 841--860. \doi{10.1214/08-AOAS169}. #' +#' Ishwaran H, Blackstone EH (2025). +#' Harnessing the power of virtual (digital) twins: Graphical causal tools for +#' understanding patient and hospital differences. +#' \emph{Computational and Structural Biotechnology Journal}, \bold{28}, 312. +#' #' @seealso \code{\link{plot.gg_partial_varpro}}, #' \code{\link{gg_varpro}}, \code{\link{gg_vimp}}, #' \code{\link{gg_partialpro}} (deprecated), @@ -184,8 +214,8 @@ #' @export gg_partial_varpro <- function(part_dta = NULL, object = NULL, - scale = c("auto", "rmst", "mortality", - "surv", "chf"), + scale = c("auto", "prob", "odds", "logodds", + "rmst", "surv", "mortality", "chf"), time = NULL, nvars = NULL, cat_limit = 10, @@ -200,11 +230,21 @@ gg_partial_varpro <- function(part_dta = NULL, "'part_dta' is supplied (nothing is recomputed).", call. = FALSE) } - ## ---- C-path: route through gg_partial_rfsrc ---------------------------- - if (!is.null(object) && scale %in% c("surv", "chf")) { + ## Resolve 'auto' to a concrete scale once; all routing, conversion, labels + ## and provenance below use the resolved value. (Validation above used the + ## raw scale, which must still distinguish 'auto'.) + scale <- .resolve_varpro_scale( + scale, if (!is.null(object)) object$family else NA_character_) + + ## ---- C-path: route CHF through gg_partial_rfsrc ------------------------ + ## (surv now uses the partialpro S(t) learner on path A, below.) + if (!is.null(object) && scale == "chf") { return(.gg_partial_varpro_cpath(object, scale, time, model)) } + ## ---- Survival default horizon: surv/rmst fill tau from the data -------- + time <- .resolve_default_tau(part_dta, object, scale, time) + ## ---- RMST guardrails (A-path) ------------------------------------------ .warn_varpro_rmst(part_dta, object, scale, time) @@ -217,19 +257,24 @@ gg_partial_varpro <- function(part_dta = NULL, ## object-driven path can select/limit variables the same way an explicit ## partialpro() call would -- otherwise it falls back to get.topvars(object). if (is.null(part_dta)) { - part_dta <- if (scale == "rmst") { - varPro::partialpro(object, learner = .rmst_learner(object, time), ...) - } else { + learner <- switch(scale, + rmst = .rmst_learner(object, time), + surv = .surv_learner(object, time), + NULL) + part_dta <- if (is.null(learner)) { varPro::partialpro(object, ...) + } else { + varPro::partialpro(object, learner = learner, ...) } } if (is.null(nvars)) { nvars <- length(part_dta) } - prov <- .varpro_provenance(object, scale, time, path = "A") + prov <- .varpro_provenance(object, scale, time, path = "A", + target = .varpro_target(object, list(...))) - dfs <- .build_varpro_dfs(part_dta, nvars, cat_limit) + dfs <- .build_varpro_dfs(part_dta, nvars, cat_limit, scale) continuous <- dfs$continuous categorical <- dfs$categorical @@ -258,7 +303,8 @@ gg_partial_varpro <- function(part_dta = NULL, call. = FALSE) } .validate_partial_time(time) - if (scale == "rmst") .validate_rmst_inputs(part_dta, object, time) + if (scale %in% c("rmst", "surv")) + .validate_surv_scale_inputs(part_dta, object, scale) invisible(NULL) } @@ -274,45 +320,42 @@ gg_partial_varpro <- function(part_dta = NULL, invisible(NULL) } -## RMST needs a horizon, and (when we recompute from 'object', part_dta = NULL) -## a survival fit. +## Survival learner scales (rmst/surv) need a survival fit when recomputing +## from 'object'. tau is optional now (defaults to median follow-up). #' @keywords internal -.validate_rmst_inputs <- function(part_dta, object, time) { - if (is.null(time)) { - stop("scale = 'rmst' requires 'time' (the RMST horizon tau)", - call. = FALSE) - } +.validate_surv_scale_inputs <- function(part_dta, object, scale) { if (is.null(part_dta) && !is.null(object) && !identical(object$family, "surv")) { - stop("scale = 'rmst' requires a survival varpro fit ", + stop("scale = '", scale, "' requires a survival varpro fit ", "(object$family == \"surv\")", call. = FALSE) } invisible(NULL) } -## Surface the two RMST traps: a precomputed part_dta can't be driven by tau -## (curve is mortality with an RMST label only), and a tau past the model's -## largest event time is truncated. Also flag a 'time' a scale ignores. +## Surface the two horizon traps for the learner scales (rmst/surv): a +## precomputed part_dta can't be driven by tau (it carries only the scale +## label), and a tau past the model's largest event time is truncated there. +## Also flag a 'time' a scale ignores. #' @keywords internal .warn_varpro_rmst <- function(part_dta, object, scale, time) { - if (scale == "rmst") { + if (scale %in% c("rmst", "surv")) { if (!is.null(part_dta)) { - warning("gg_partial_varpro: scale = 'rmst' cannot drive the partial ", - "computation from a precomputed 'part_dta'; the curves are ", - "ensemble mortality carrying an RMST(tau) label only. Supply ", - "'object' with part_dta = NULL for a genuine RMST(tau) curve.", - call. = FALSE) + warning("gg_partial_varpro: scale = '", scale, "' cannot drive the ", + "partial computation from a precomputed 'part_dta' (the values ", + "are whatever was passed in, carrying a '", scale, "' label ", + "only). Supply 'object' with part_dta = NULL for a genuine ", + scale, " curve.", call. = FALSE) } else if (!is.null(object)) { ## Only tau beyond the largest event time is truncated: S(t) cannot be - ## extrapolated past max(ti). A small tau is fine -- the integration - ## assumes S(t) = 1 on [0, ti[1]) -- so it is not flagged. + ## extrapolated past max(ti). A small tau is fine. ti <- object$rf$time.interest if (!is.null(ti) && time > max(ti)) { warning(sprintf( - paste0("gg_partial_varpro: RMST horizon tau = %g exceeds the ", - "model's largest event time (%g); RMST is truncated there, ", - "since S(t) cannot be extrapolated beyond it."), - time, max(ti)), call. = FALSE) + paste0("gg_partial_varpro: horizon tau = %g for scale = '%s' ", + "exceeds the model's largest event time (%g); it is ", + "truncated there, since S(t) cannot be extrapolated beyond ", + "it."), + time, scale, max(ti)), call. = FALSE) } } } else if (!is.null(time)) { @@ -378,8 +421,89 @@ gg_partial_varpro <- function(part_dta = NULL, as.numeric(level %*% width) } +## Survival probability at horizon tau from a survival matrix: the S(tau) +## column, snapped to the nearest event time. `surv` is n x J with column k = +## S(times[k]) (randomForestSRC layout). +#' @keywords internal +.surv_at_tau <- function(surv, times, tau) { + if (is.null(dim(surv))) surv <- matrix(surv, nrow = 1L) + if (ncol(surv) != length(times)) { + stop(sprintf(paste0(".surv_at_tau: survival matrix has %d column(s) but ", + "%d time point(s); the grids must match."), + ncol(surv), length(times)), call. = FALSE) + } + surv[, which.min(abs(times - tau))] +} + +## S(tau) learner for varPro::partialpro: maps feature rows to S(tau | x) from +## the survival forest in object$rf. Same prediction machinery as +## .rmst_learner; pulls the S(tau) column instead of integrating. +#' @keywords internal +.surv_learner <- function(object, tau) { + rf <- object$rf + function(newx) { + if (missing(newx)) { + pr <- randomForestSRC::predict.rfsrc(rf, perf.type = "none") + surv <- pr$survival.oob + if (is.null(surv)) surv <- pr$survival + } else { + pr <- randomForestSRC::predict.rfsrc(rf, newx, perf.type = "none") + surv <- pr$survival + } + times <- pr$time.interest + if (is.null(times)) times <- rf$time.interest + .surv_at_tau(surv, times, tau) + } +} + +## Fill a missing horizon for surv/rmst with the median-follow-up default -- +## but only when recomputing from 'object' (part_dta = NULL). A precomputed +## part_dta is label-only, so picking/announcing a horizon there would mislead, +## and .default_surv_tau() assumes a survival rf. +#' @keywords internal +.resolve_default_tau <- function(part_dta, object, scale, time) { + if (is.null(part_dta) && !is.null(object) && + scale %in% c("surv", "rmst") && is.null(time)) { + time <- .default_surv_tau(object) + message("gg_partial_varpro: using default horizon tau = ", signif(time, 4), + " (median follow-up). Set 'time' to choose another.") + } + time +} + +## Data-driven, units-safe default horizon for survival scales: the median +## observed follow-up time (the survival response's time column). Always in the +## model's own units, so it cannot mismatch them. Falls back to the median of +## the distinct event times if the raw response is not reachable. +#' @keywords internal +.default_surv_tau <- function(object) { + rf <- object$rf + yv <- rf$yvar + tms <- NULL + if (!is.null(yv)) { + yv <- as.data.frame(yv) + tcol <- if (!is.null(rf$yvar.names)) rf$yvar.names[1] else names(yv)[1] + tms <- suppressWarnings(as.numeric(yv[[tcol]])) + } + if (is.null(tms) || !any(is.finite(tms))) tms <- rf$time.interest + stats::median(tms, na.rm = TRUE) +} + +## Classification target class label: the `target` passed through ... if any, +## else the last factor level of the response (partialpro's default target). +## NA for non-classification fits or when only part_dta is supplied. +#' @keywords internal +.varpro_target <- function(object, dots) { + if (is.null(object) || !identical(object$family, "class")) + return(NA_character_) + if (!is.null(dots$target)) return(as.character(dots$target)) + lv <- levels(object$y.org) + if (is.null(lv)) lv <- levels(as.factor(object$y)) + if (is.null(lv)) NA_character_ else lv[length(lv)] +} + #' @keywords internal -.varpro_provenance <- function(object, scale, time, path = "A") { +.varpro_provenance <- function(object, scale, time, path = "A", target = NA) { prov_family <- if (!is.null(object)) object$family else NA_character_ prov_xvars <- if (!is.null(object)) object$xvar.names else NA_character_ prov_n <- if (!is.null(object)) nrow(object$x) else NA_integer_ @@ -392,13 +516,15 @@ gg_partial_varpro <- function(part_dta = NULL, n = prov_n, scale = scale_used, rmst_tau = time, + target = target, xvar.names = prov_xvars, path = path ) } #' @keywords internal -.build_varpro_dfs <- function(part_dta, nvars, cat_limit) { +.build_varpro_dfs <- function(part_dta, nvars, cat_limit, scale = "generic") { + bounded <- .is_bounded_scale(scale) cont_list <- list() cat_list <- list() for (feature in seq(nvars)) { @@ -407,14 +533,18 @@ gg_partial_varpro <- function(part_dta = NULL, if (length(feat$xvirtual) > cat_limit) { plt.df <- dplyr::bind_cols( variable = feat$xvirtual, - parametric = colMeans(feat$yhat.par, na.rm = TRUE), - nonparametric = colMeans(feat$yhat.nonpar, na.rm = TRUE), - causal = colMeans(feat$yhat.causal, na.rm = TRUE) + parametric = colMeans(.scale_transform(feat$yhat.par, scale), + na.rm = TRUE), + nonparametric = colMeans(.scale_transform(feat$yhat.nonpar, scale), + na.rm = TRUE), + # `causal` is a centered contrast: not shown on bounded scales + causal = if (bounded) NA_real_ else + colMeans(feat$yhat.causal, na.rm = TRUE) ) plt.df$name <- feat_name cont_list[[feature]] <- plt.df } else { - cat_list[[feature]] <- .process_cat_var(feat, feat_name) + cat_list[[feature]] <- .process_cat_var(feat, feat_name, scale) } } list( @@ -427,24 +557,44 @@ gg_partial_varpro <- function(part_dta = NULL, .resolve_varpro_scale <- function(scale, family) { if (scale != "auto") return(scale) if (is.na(family) || is.null(family)) return("generic") - if (family == "surv") return("mortality") - "generic" # regr, class, or unknown + if (family == "surv") return("surv") # bounded survival default (3.3.0) + if (family == "class") return("prob") # probability default (3.3.0) + "generic" # regr or unknown +} + +## Transform partialpro's (log-odds) values to the requested classification +## scale. Identity for everything except prob/odds, because the survival +## learners already return their own scale and the additive scales are raw. +#' @keywords internal +.scale_transform <- function(z, scale) { + switch(scale, + prob = stats::plogis(z), + odds = exp(z), + z) +} + +## Bounded scales: probability (class), odds (class), survival probability. +## On these the absolute level curves convert and the centered `causal` +## contrast is not shown (it cannot share the axis). +#' @keywords internal +.is_bounded_scale <- function(scale) { + scale %in% c("prob", "odds", "surv") } #' @keywords internal -.process_cat_var <- function(feat, feat_name) { - n_cats <- length(unique(feat$xorg)) +.process_cat_var <- function(feat, feat_name, scale = "generic") { + bounded <- .is_bounded_scale(scale) + cats <- unique(feat$xorg) cat_feat <- list() - for (ind in seq(n_cats)) { + for (ind in seq_along(cats)) { cat_feat[[ind]] <- dplyr::bind_cols( - parametric = feat$yhat.par[, ind], - nonparametric = feat$yhat.nonpar[, ind], - causal = feat$yhat.causal[, ind] + parametric = .scale_transform(feat$yhat.par[, ind], scale), + nonparametric = .scale_transform(feat$yhat.nonpar[, ind], scale), + causal = if (bounded) NA_real_ else feat$yhat.causal[, ind] ) - cat_feat[[ind]]$variable <- unique(feat$xorg)[ind] - plt.df <- if (ind == 1L) cat_feat[[ind]] else - dplyr::bind_rows(plt.df, cat_feat[[ind]]) + cat_feat[[ind]]$variable <- cats[ind] } + plt.df <- dplyr::bind_rows(cat_feat) plt.df$name <- feat_name plt.df } diff --git a/R/gg_partialpro.R b/R/gg_partialpro.R index 4fdaecd6..e58ea049 100644 --- a/R/gg_partialpro.R +++ b/R/gg_partialpro.R @@ -19,8 +19,8 @@ #' @export gg_partialpro <- function(part_dta, object = NULL, - scale = c("auto", "rmst", "mortality", - "surv", "chf"), + scale = c("auto", "prob", "odds", "logodds", + "rmst", "surv", "mortality", "chf"), time = NULL, nvars = NULL, cat_limit = 10, diff --git a/R/gg_rfsrc.R b/R/gg_rfsrc.R index 802bfa02..b73940af 100644 --- a/R/gg_rfsrc.R +++ b/R/gg_rfsrc.R @@ -454,6 +454,13 @@ gg_rfsrc <- function(object, UseMethod("gg_rfsrc", object) } +#' @export +gg_rfsrc.default <- function(object, oob = TRUE, by, ...) { + stop("gg_rfsrc: expected an 'rfsrc' or 'randomForest' object; ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} + #' @export gg_rfsrc.randomForest <- function(object, oob = TRUE, diff --git a/R/gg_variable.R b/R/gg_variable.R index 7981c015..e9d3d6f1 100644 --- a/R/gg_variable.R +++ b/R/gg_variable.R @@ -159,6 +159,13 @@ gg_variable <- function(object, ...) { UseMethod("gg_variable", object) } + +#' @export +gg_variable.default <- function(object, ...) { + stop("gg_variable: expected an 'rfsrc' or 'randomForest' object; ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} #' @export gg_variable.rfsrc <- function(object, ...) { diff --git a/R/gg_vimp.R b/R/gg_vimp.R index b53309d0..5dcffc99 100644 --- a/R/gg_vimp.R +++ b/R/gg_vimp.R @@ -190,6 +190,13 @@ gg_vimp <- function(object, nvar, ...) { UseMethod("gg_vimp", object) } + +#' @export +gg_vimp.default <- function(object, nvar, ...) { + stop("gg_vimp: expected an 'rfsrc' or 'randomForest' object; ", + "got an object of class ", paste(class(object), collapse = "/"), ".", + call. = FALSE) +} #' @export gg_vimp.rfsrc <- function(object, nvar, ...) { # Validate that the object is an rfsrc grow or predict result. diff --git a/R/plot.gg_partial_varpro.R b/R/plot.gg_partial_varpro.R index 97623cef..78ad3c80 100644 --- a/R/plot.gg_partial_varpro.R +++ b/R/plot.gg_partial_varpro.R @@ -31,9 +31,54 @@ #' to claim how the world works. They are a window into the fitted #' relationship; they do not by themselves establish that intervening #' on the variable would move the outcome. For survival path-C -#' (\code{scale = "surv"} or \code{"chf"}), the y-axis is on the -#' probability or cumulative-hazard scale, which is usually the scale -#' you want to report to a clinical audience. +#' (\code{scale = "chf"}), the y-axis is on the cumulative-hazard scale. +#' +#' @section Reading a probability curve (scale = "prob"): +#' The y-axis is \eqn{P(Y = \mathrm{target})}, the model's predicted probability +#' of the target class as the focal variable varies (others held at their +#' UVT-plausible average). \code{"odds"} and \code{"logodds"} are the same +#' curve on the odds and log-odds scales. The \code{causal} curve is a +#' contrast (below) and is \emph{not} shown on \code{"prob"}/\code{"odds"}; +#' use \code{"logodds"} to see it. +#' +#' @section Reading a survival-probability curve (scale = "surv"): +#' The y-axis is \eqn{S(\tau \mid x)}, the predicted probability of surviving +#' past \eqn{\tau}, bounded in \eqn{[0, 1]} and read in the model's time units. +#' Higher is better (more survival). \eqn{\tau} defaults to the median +#' follow-up time when not supplied. +#' +#' @section What the causal curve is, and when to use it: +#' \code{causal} is the \strong{baseline-subtracted local effect} -- varPro's +#' virtual- ("digital-") twins estimator (Ishwaran & Blackstone, 2025). It +#' shows how the prediction shifts as the focal variable moves away from the +#' reference grid point, with the other covariates held at on-manifold +#' (UVT-plausible) values; it is a \strong{contrast} (it starts at 0), not a +#' level. Use it when you want the local effect (change-from-baseline) rather +#' than the absolute predicted level, and as a cross-check on the parametric +#' and nonparametric curves. It is varpro's local estimator \emph{within the +#' fitted model}, \strong{not a structural causal claim} about the +#' data-generating process. Because it is a contrast it cannot share a +#' probability/odds axis with the absolute curves, so it is shown only on the +#' additive scales (\code{"logodds"}, \code{"mortality"}, \code{"rmst"}). +#' +#' @section Reading an RMST curve (scale = "rmst"): +#' The y-axis is restricted mean survival time at horizon \eqn{\tau}, +#' \eqn{\mathrm{RMST}(\tau)=\int_0^\tau S(t)\,dt}: the \strong{expected +#' event-free time during the first \eqn{\tau} time-units}, the area under the +#' survival curve out to \eqn{\tau}. Read it in the \strong{model's own time +#' units}, where it is bounded by \eqn{0 \le \mathrm{RMST}(\tau) \le \tau}. +#' +#' Two things follow. First, \eqn{\tau} must be given in the fit's time units; +#' a \eqn{\tau} past the largest event time just truncates to the full +#' restricted mean and stops changing. Second, higher is better here -- more +#' time event-free -- which is the opposite of the ensemble-mortality scale. +#' +#' A continuous variable's curve sloping \emph{up} means higher values of that +#' covariate buy you \emph{more} restricted-mean event-free time within \eqn{\tau} +#' (with the other covariates held at their UVT-plausible average); a flat curve +#' means the covariate does not move it. Unlike ensemble mortality, RMST reads +#' on a directly clinical scale, "so many event-free time-units within +#' \eqn{\tau}", which is usually the one you want to report. #' #' @param x A \code{\link{gg_partial_varpro}} object. #' @param type Character vector; one or more of \code{"parametric"}, @@ -57,6 +102,11 @@ #' Random survival forests. \emph{The Annals of Applied Statistics}, #' \bold{2}(3), 841--860. \doi{10.1214/08-AOAS169}. #' +#' Ishwaran H, Blackstone EH (2025). +#' Harnessing the power of virtual (digital) twins: Graphical causal tools for +#' understanding patient and hospital differences. +#' \emph{Computational and Structural Biotechnology Journal}, \bold{28}, 312. +#' #' @seealso \code{\link{gg_partial_varpro}} #' #' @examples @@ -91,6 +141,8 @@ plot.gg_partial_varpro <- function(x, type = c("parametric", "nonparametric", "causal"), ...) { + type_user <- !missing(type) # was 'causal' asked for, or is it the default? + ## C-path: delegate to plot.gg_partial_rfsrc via NextMethod(). prov <- attr(x, "provenance") if (!is.null(prov) && identical(prov$path, "C")) { @@ -101,6 +153,9 @@ plot.gg_partial_varpro <- function(x, type <- match.arg(type, several.ok = TRUE) ylabel <- .partial_varpro_ylabel(prov) + ## On bounded scales (prob/odds/surv) the causal contrast is not shown. + type <- .partial_varpro_plot_type(type, type_user, prov) + gg_cont <- NULL if (!is.null(x$continuous) && nrow(x$continuous) > 0) { cont_long <- tidyr::pivot_longer( @@ -154,38 +209,51 @@ plot.gg_partial_varpro <- function(x, } } +## Drop the `causal` contrast on bounded scales (prob/odds/surv) -- it cannot +## share the level axis. Warn only when the user explicitly asked for it; fall +## back to the level curves if causal was the only requested type. +#' @keywords internal +.partial_varpro_plot_type <- function(type, type_user, prov) { + if (is.null(prov) || !.is_bounded_scale(prov$scale %||% "generic")) + return(type) + if (type_user && "causal" %in% type) { + warning("plot.gg_partial_varpro: 'causal' is not shown on the ", + prov$scale, " scale (it is a contrast, not a level). ", + "Use scale = 'logodds' (classification) or 'mortality'/'rmst' ", + "(survival) to see it.", call. = FALSE) + } + type <- setdiff(type, "causal") + if (length(type) == 0L) c("parametric", "nonparametric") else type +} + ## --------------------------------------------------------------------------- ## Internal: build honest y-axis label from provenance. #' @keywords internal .partial_varpro_ylabel <- function(prov) { if (is.null(prov)) return("Partial Effect") scale <- prov$scale %||% "generic" + tgt <- prov$target + has_tgt <- !is.null(tgt) && !is.na(tgt) switch(scale, + prob = if (has_tgt) sprintf("P(Y = %s)", tgt) else "Probability", + odds = if (has_tgt) sprintf("Odds(Y = %s)", tgt) else "Odds", + logodds = if (has_tgt) sprintf("Log-odds(Y = %s)", tgt) else "Log-odds", mortality = "Ensemble mortality (expected events)", rmst = { tau <- prov$rmst_tau - if (!is.null(tau) && !is.na(tau)) { - sprintf("RMST (\u03c4 = %g)", tau) - } else { - "RMST" - } + if (!is.null(tau) && !is.na(tau)) sprintf("RMST (\u03c4 = %g)", tau) + else "RMST" }, surv = { t <- prov$rmst_tau - if (!is.null(t) && !is.na(t)) { - sprintf("Survival probability at t = %g", t) - } else { - "Survival probability" - } + if (!is.null(t) && !is.na(t)) sprintf("Survival probability at t = %g", t) + else "Survival probability" }, chf = { t <- prov$rmst_tau - if (!is.null(t) && !is.na(t)) { - sprintf("Cumulative hazard at t = %g", t) - } else { - "Cumulative hazard" - } + if (!is.null(t) && !is.na(t)) sprintf("Cumulative hazard at t = %g", t) + else "Cumulative hazard" }, - "Partial Effect" # generic / auto-regr / auto-class / unknown + "Partial Effect" # generic / regr / unknown ) } diff --git a/README.md b/README.md index f24d8472..8047ca78 100644 --- a/README.md +++ b/README.md @@ -21,8 +21,7 @@ `ggRandomForests` provides `ggplot2`-based diagnostic and exploration plots for random forests fit with [randomForestSRC](https://cran.r-project.org/package=randomForestSRC) (>= 3.4.0) or [randomForest](https://cran.r-project.org/package=randomForest). -It separates data extraction from plotting so the intermediate tidy objects can be inspected, saved, or used -for custom analyses. +It keeps the data step apart from the figure step, so you can inspect, save, or reuse the tidy object on its own. Listed in the [ggplot2 extensions gallery](https://exts.ggplot2.tidyverse.org/). ## Installation @@ -74,6 +73,13 @@ forests — see the dedicated vignette: vignette("varpro", package = "ggRandomForests") ``` +The unsupervised varPro tools — `gg_udependent()`, `gg_beta_uvarpro()`, and +`gg_sdependent()`, which read structure off a `uvarpro()` fit with no outcome — +have their own short vignette: +```r +vignette("uvarpro", package = "ggRandomForests") +``` + ## Function reference | Function | Input | What you get | @@ -88,8 +94,8 @@ vignette("varpro", package = "ggRandomForests") | `gg_roc()` | `rfsrc` / `randomForest` (class) | ROC curve data | | `gg_brier()` | `rfsrc` (survival) | Time-resolved Brier score and CRPS | -Each `gg_*` function has a matching `plot()` S3 method that hands back a single plottable object — a `ggplot`, -or a `patchwork` composite when the method lays out multiple panels — so you can keep adding layers, scales, or a theme. Every `gg_*` object also has `print()` and `summary()` methods: `print()` +Each `gg_*` function has a matching `plot()` S3 method that hands back a single plottable object: a `ggplot` +you extend with `+`, or a `patchwork` composite for the multi-panel methods. Every `gg_*` object also has `print()` and `summary()` methods: `print()` shows a short header at the REPL rather than dumping every row (use `head()` when you want the rows), and `summary()` gives you a diagnostics object you can print or keep. @@ -110,11 +116,11 @@ entirely and build the figure from the tidy data yourself. ## Recent changes -See [NEWS.md](NEWS.md) for the full changelog. Highlights since v2.4.0: +See [NEWS.md](NEWS.md) for the full changelog. Recent highlights: -- **v2.6.1** Fix factor-level assignment in `gg_partial` for categorical variables. -- **v2.6.0** New plotting functions exported; test coverage raised to 83%; removed internal dependency on `hvtiRutilities`. -- **v2.5.0** New `gg_partial_rfsrc()` computes partial dependence directly from an `rfsrc` model without a separate `plot.variable` call; supports a grouping variable via `xvar2.name`. +- **v3.4.0** Unsupervised varPro wrappers (`gg_beta_uvarpro()`, `gg_sdependent()`) with their own vignette; `gg_partial_rfsrc()` now handles factor predictors correctly. +- **v3.3.0** varPro partial plots default to interpretable scales — probability for classification, survival S(τ) for survival. +- **v3.1.0** varPro integration: release-rule importance, partial dependence, local importance, anomaly scores, and the dependency graph. ## References diff --git a/_pkgdown.yml b/_pkgdown.yml index 2272e340..4d3440e3 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -46,6 +46,8 @@ navbar: href: articles/ggRandomForests-regression.html - text: "Exploring variable importance with varPro" href: articles/varpro.html + - text: "Variable selection without an outcome: unsupervised varPro" + href: articles/uvarpro.html reference: - title: "Forest Objects" @@ -148,6 +150,7 @@ articles: - ggRandomForests-survival - ggRandomForests-regression - varpro + - uvarpro news: releases: diff --git a/cran-comments.md b/cran-comments.md index 28431561..1800d635 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -1,48 +1,60 @@ -## v3.2.0 — minor release (RMST partial dependence; bug fixes) +## v3.4.0 — minor release (interpretable varPro partial scales; unsupervised varPro wrappers; factor partial-dependence fix) -This is a minor feature-and-fix release for the varPro survival workflow. +This is a minor feature-and-fix release. It consolidates the work developed +since the CRAN 3.2.0 release (the 3.3.0 and 3.4.0 development cycles) into a +single submission. ### What's new / fixed -* `gg_partial_varpro(scale = "rmst", time = tau)` now *drives* the survival - partial-dependence computation through an RMST(tau) learner - (RMST(tau) = integral_0^tau S(t) dt), rather than only relabeling the - default ensemble-mortality curve. `varPro::partialpro()` exposes no time - argument, so multi-horizon RMST plots built the old way differed only by - Monte-Carlo noise, not by tau; the learner makes the curve genuinely - tau-dependent. Guardrails warn when a precomputed `part_dta` cannot be - driven by tau, when tau exceeds the model's largest event time (RMST is - truncated there), and when `time` is supplied to a scale that ignores it. -* `gg_partial_varpro()` (and the deprecated `gg_partialpro()` alias) now - forward `...` to `varPro::partialpro()` on the object-driven path, restoring - control over which variables are computed (`xvar.names`, `nvar`) and the - isolation-forest step (`cut`, `nsmp`, ...). -* `gg_varpro()` now fails with a clear message instead of a cryptic - "differing number of rows" error when `varPro::importance()` returns a - degenerate importance table (issue #118). -* `Imports` requires `varPro (>= 3.1.0)` (the version exposing the - `partialpro()` `learner` argument the RMST path relies on). +* **Interpretable varPro partial-plot scales.** `gg_partial_varpro()` now + defaults to bounded, interpretable y-axes: probability P(Y = target class) + for classification, and survival probability S(tau) for survival (via a + partialpro survival learner), with tau defaulting to the median follow-up + time when omitted. The unbounded ensemble-mortality scale remains available + via `scale = "mortality"`. The `causal` (virtual-twins) contrast is hidden on + the bounded scales and documented. +* **Unsupervised varPro wrappers.** New `gg_beta_uvarpro()` (lasso-importance + ranking from `varPro::get.beta.entropy()`) and `gg_sdependent()` + (signal-variable detection from `varPro::sdependent()`), each with + `plot`/`print`/`summary`/`autoplot` methods, complementing the existing + `gg_udependent()` dependency graph. A new short "uvarpro" vignette walks the + three unsupervised views on a single `uvarpro()` fit. +* **Bug fix: factor partial dependence in `gg_partial_rfsrc()`.** The wrapper + passed factor *labels* to `randomForestSRC::partial.rfsrc()`, which imposes a + level by its integer code; character labels became `NA` and numeric-looking + labels went out of range, collapsing every level to a single value (a flat + categorical partial plot). It now passes the integer codes and relabels the + output, matching `plot.variable(partial = TRUE)`. The categorical `x` is + returned as a factor in the model's level order. +* Documentation: fixed the main vignette's `\VignetteIndexEntry` (it carried a + template placeholder); split the unsupervised varPro material out of the + varPro vignette into the new companion vignette. ### Test environments * **Local:** R 4.6.0 on macOS (aarch64-apple-darwin23). `R CMD check --as-cran` (with the manual) returns 0 errors, 0 warnings, 0 notes. -* **win-builder:** R-devel, R-release, and R-oldrelease (Windows Server 2022, - x86_64) — all Status: OK (0 errors, 0 warnings, 0 notes). -* **mac-builder:** R 4.6.0 (aarch64-apple-darwin23, macOS) — Status: OK - (0 errors, 0 warnings, 0 notes). +* **win-builder:** R-devel (R 90199), R-release (4.6.1), and R-oldrelease + (4.5.3), Windows Server 2022, x86_64 — all Status: OK. +* **macOS:** covered by the local aarch64-apple-darwin23 check above. The + mac.r-project.org macOS builder was unavailable at submission time + (HTTP 502), so no mac-builder result is included; the macOS platform is + also exercised by the macos-latest GitHub Actions job below. * **GitHub Actions matrix:** ubuntu-latest (R-devel / R-release / R-oldrel-1), windows-latest (R-release), macos-latest (R-release). * **Reverse-dependency check:** 0 reverse dependencies on CRAN. ### NOTE disposition -`R CMD check --as-cran` is clean (0/0/0) locally. No notes expected beyond the -usual incoming-feasibility items (e.g. maintainer/timestamp), if any. +`R CMD check --as-cran` is clean (0/0/0) locally. The only note expected at +incoming feasibility is the usual "days since last update" item (3.2.0 was +published 2026-06-23). The gcc-UBSAN guard from v3.1.1/v3.1.2 is unchanged: the single unsupervised `varPro::isopro(method = "unsupv")` test still calls `skip_on_cran()` to avoid an upstream `randomForestSRC` sanitizer report (`rfsrcGrow`, `entry.c:184`); -all other varPro tests run. ggRandomForests remains a pure-R package -(`NeedsCompilation: no`). +all other varPro tests run. This is the only grow that trips the report (its +`yvar.wt` is length-0); `uvarpro()` and the other varPro grows are +synthetic-supervised and sanitizer-clean. ggRandomForests remains a pure-R +package (`NeedsCompilation: no`). diff --git a/dev/plans/2026-06-23-varpro-partial-scales-design.md b/dev/plans/2026-06-23-varpro-partial-scales-design.md new file mode 100644 index 00000000..ea6d074c --- /dev/null +++ b/dev/plans/2026-06-23-varpro-partial-scales-design.md @@ -0,0 +1,305 @@ +# Design: Interpretable y-axis scales for varPro partial plots (v3.3.0) + +- **Date:** 2026-06-23 +- **Status:** Approved design (pre-implementation) +- **Version:** 3.2.0 → 3.3.0 (CRAN minor, off `main`) +- **Branch:** `feat/varpro-figures-3.3.0` +- **Functions:** `gg_partial_varpro()`, `plot.gg_partial_varpro()` (+ internals) + +--- + +## 1. Motivation + +`varPro::partialpro()` returns classification partial effects on the **log-odds** +scale and survival partial effects as **ensemble mortality** (an unbounded +relative-risk score). Neither is the scale a clinical reader wants: + +- Classification curves should be readable as **probability** P(Y = target). +- Survival curves should be readable as a **bounded** quantity — survival + probability S(τ) or restricted mean survival time RMST(τ) — not the unbounded + mortality score. + +v3.2.0 already added the RMST(τ) learner and a "Reading an RMST curve" doc +section. v3.3.0 extends the same idea to **classification probability** and +**survival probability**, and makes the *default* scales interpretable. + +### Goals + +1. Classification partial plots return **probability by default**, with `odds` + and `logodds` as options. +2. Survival partial plots can return **survival probability S(τ)** on the same + partialpro/UVT engine as mortality and RMST. +3. Never silently hand back the misleading unbounded mortality score for a + survival fit: `scale = "auto"` defaults to bounded **survival probability** + S(τ) at a units-safe, data-driven default τ (mortality stays an explicit + opt-in). +4. Document every scale's interpretation honestly. + +### Non-goals + +- No change to the survival `chf` (cumulative hazard) path. +- No change to regression partial plots (already on the response scale). +- No multi-τ faceting in one call (one τ per call; users loop, as in ST1267). +- No new exported functions — this is scale/default work on existing entry + points. + +--- + +## 2. Current behaviour (baseline) + +`scale` argument: `c("auto", "rmst", "mortality", "surv", "chf")`. + +- `.resolve_varpro_scale(scale, family)` (in `R/gg_partial_varpro.R`): `auto` → + `"mortality"` for survival, `"generic"` for regr/class/unknown. +- Classification: `partialpro()` applies `mylogodds()` internally, so + `yhat.par`/`yhat.nonpar`/`yhat.causal` are log-odds of the target class. + `gg_partial_varpro()` averages them (`colMeans`) into the + continuous/categorical frames; the resolved scale is `"generic"`, y-label + `"Partial Effect"`. **The values are log-odds but unlabelled as such.** +- Survival: `auto` → `mortality` (path A, partialpro default learner). `rmst` + (path A, RMST learner, requires `time`). `surv`/`chf` route through **path C** + (`gg_partial_rfsrc` on `object$rf`) — a *different* engine, time-point S(t)/ + CHF curves, no UVT, no par/nonpar/causal structure. +- The three curves' scales (from partialpro internals): `yhat.par` and + `yhat.nonpar` are **absolute** (global mean intercept added back); + `yhat.causal` is a **centered contrast** (deviation from the reference grid + point — a log odds-ratio for classification). + +--- + +## 3. Design + +### 3a. New scale vocabulary + +`scale` gains three classification values: `"prob"`, `"odds"`, `"logodds"`. +Full set: + +``` +scale = c("auto", "prob", "odds", "logodds", # classification (+ generic) + "rmst", "surv", "mortality", "chf") # survival +``` + +`match.arg` accepts all; validity is family-dependent (already true today). + +### 3b. Scale resolution (`.resolve_varpro_scale`) + +| `scale` | family | resolves to | notes | +|---|---|---|---| +| `auto` | class | `prob` | **changed** (was `generic`) | +| `auto` | regr | `generic` | unchanged | +| `auto` | surv | `surv` | **changed** (was `mortality`); τ defaults to median follow-up (3e) | +| `auto` | unknown (part_dta only) | `generic` | unchanged — backward compatible | +| explicit | any | itself | unchanged | + +Backward compatibility: when only `part_dta` is supplied (no `object`), family is +unknown → `generic`, log-odds values as before. The probability default only +engages when an `object` identifies the fit as classification. + +### 3c. Classification conversion (where & how) + +partialpro returns log-odds. Conversion happens in the **extractor**, on the +per-observation matrices **before averaging** — so the partial value is the +*mean of probabilities*, not the probability of the mean log-odds +(`mean(plogis(z))`, not `plogis(mean(z))`). + +In `.build_varpro_dfs()` (and `.process_cat_var()`), apply a per-scale transform +to `yhat.par`/`yhat.nonpar` matrices before `colMeans`: + +- `prob`: `plogis(z)` +- `odds`: `exp(z)` +- `logodds` / `generic`: identity + +The transform applies to the **absolute** curves only (`par`, `nonpar`). + +### 3d. The `causal` curve + +`yhat.causal` is a centered contrast (log odds-ratio for classification; a +centered contrast in whatever units the learner returns for survival). A +contrast cannot share a probability/odds axis with the absolute level curves. + +Rule (consistent across families): + +- **Additive/unbounded scales** — `logodds` (class), `mortality`, `rmst` (surv): + all three curves shown (level + contrast commensurable). Current behaviour. +- **Bounded scales** — `prob`, `odds` (class), `surv` (surv): `par`/`nonpar` + convert; **`causal` is not shown.** The extractor sets the `causal` column to + `NA` for bounded scales; the plot drops it from `type` and **warns** if the + user explicitly passed `type = "causal"`. + +This rule must be accompanied by an explicit explanation of *what* the `causal` +curve is and *when* to use it — see §4. The warning message should point the +user to `scale = "logodds"` (or `mortality`/`rmst` for survival) to see it. + +### 3e. Survival: S(t) learner + data-driven default τ + +**New internal `.surv_learner(object, tau)`** — a near-clone of `.rmst_learner`, +replacing the integral with a single-column pull: + +```r +pr <- randomForestSRC::predict.rfsrc(rf, newx, perf.type = "none") +surv <- pr$survival # (or survival.oob for the OOB/no-newx branch) +times <- pr$time.interest +surv[, which.min(abs(times - tau))] # S(tau | x), snapped to nearest event time +``` + +Returns a bounded 0–1 vector; partialpro fits par/nonpar on it directly (no +extractor-level transform needed — the learner already yields probability). + +**Default τ (the key safety property).** When `time` is `NULL` for a survival +scale that needs one (`surv`, `rmst`), τ defaults to the **median follow-up +time** — `median()` of the model's observed event/censoring times. This is +*units-safe by construction*: it is computed from the fit, so it is always in +the model's own time units and cannot mismatch them — unlike a hand-typed τ, +which is exactly how the ST1267 days-vs-years bug arose. A derived default is +therefore *safer* than requiring the user to type a number. The resolved τ +appears in the y-label (`"... at t = X"`) and a one-time message; `time = τ` +overrides it. + +Routing & defaults: + +- `scale = "auto"` + survival → **`surv`** at the default τ — the bounded, + interpretable default (retires the old unbounded `mortality` default). +- `scale = "surv"` → **path A** via `.surv_learner` (retires the old path-C + `surv` route); τ defaults to median follow-up if not supplied. +- `scale = "rmst"` → path A via `.rmst_learner`; τ defaults to median follow-up + if not supplied (**loosened** from v3.2.0, which *required* `time`). +- `scale = "mortality"` → path A, default learner, no τ (explicit opt-in; the + unbounded score is never the silent default now). +- `scale = "chf"` → **path C** unchanged (no learner; cumulative hazard). + +No survival scale errors on a missing `time` any more — the default τ fills in. +`.validate_*` drop the `surv`/`rmst` "requires time" stops; a *supplied* `time` +still gets the scalar-finite check. + +### 3f. Y-axis labels (`.partial_varpro_ylabel`) + +| scale | label | +|---|---| +| `prob` | `P(Y = )` (or `Probability` if target unknown) | +| `odds` | `Odds(Y = )` | +| `logodds` | `Log-odds(Y = )` | +| `surv` | `Survival probability at t = <τ>` | +| `rmst` | `RMST (τ = <τ>)` (unchanged) | +| `mortality` | `Ensemble mortality (expected events)` (unchanged) | +| `chf` | `Cumulative hazard at t = ` (unchanged) | + +Target class for the classification labels: resolved from `object` — the +`target` passed through `...` if present, else the **last factor level** of the +response (partialpro's default target). If only `part_dta` is supplied, omit the +class name (`Probability`/`Odds`/`Log-odds`). + +### 3g. Provenance + +`provenance$scale` records the resolved scale (already present). Add +`provenance$target` (the classification target class label, or `NA`) so the plot +can build the label without re-deriving it. + +--- + +## 4. Documentation + +- `plot.gg_partial_varpro` gains `@section`s mirroring the existing + ensemble-mortality and "Reading an RMST curve" sections: + - **Reading a probability curve (scale = "prob")** — P(Y = target); odds/ + log-odds variants; why `causal` is hidden here. + - **Reading a survival-probability curve (scale = "surv")** — S(τ | x), + bounded 0–1, τ in the model's time units, higher = more survival. + - **What the `causal` curve is, and when to use it** — a dedicated section + (it is the most easily-misread of the three curves): + - *What it is:* the **baseline-subtracted local effect** — varPro's + **virtual- ("digital-") twins** estimator (Ishwaran & Blackstone, 2025). + It shows how the prediction shifts as the focal variable moves *away from + the reference grid point*, with the other covariates held at on-manifold + (UVT-plausible) values. It is a **contrast** (starts at 0), not a level. + - *When to use it:* when you want the local **effect / change-from-baseline** + rather than the absolute predicted level — and as a cross-check against the + parametric/nonparametric curves (agreement = a trustworthy effect; + divergence = inspect). + - *Caveat (keep the existing wording):* it is varpro's local estimator + *within the fitted model*, **not a structural causal claim** about the + data-generating process (no confounding adjustment). + - *Why it's hidden on bounded scales (ties to §3d):* a baseline-subtracted + contrast cannot share a probability/odds axis with the absolute level + curves; it lives on the additive scales (`logodds`/`mortality`/`rmst`). + - Add the Ishwaran & Blackstone (2025) virtual-twins reference to + `@references` on both `gg_partial_varpro` and `plot.gg_partial_varpro`. +- `gg_partial_varpro` `@details`: the classification probability default; the + survival `surv` default at median-follow-up τ; the `surv` learner; the + units-safe data-driven τ (and how to override with `time`). +- NEWS v3.3.0 bullets: classification probability default (behaviour change); + survival `auto` → `surv` default + `surv`/`rmst` τ defaulting to median + follow-up (behaviour change); the `surv` learner; RMST interpretation docs + (already present). + +--- + +## 5. Testing + +CRAN-safe unit tests (no varpro grow) where possible: + +- `.surv_learner`: against a small `rfsrc` survival fit (like the existing + `make_mock_cpath` helper) — returns length-n, bounded 0–1, equals the snapped + S(τ) column; OOB and newdata branches. +- Scale resolution: `.resolve_varpro_scale("auto","class") == "prob"`, etc. +- Classification conversion: `prob`/`odds`/`logodds` produce expected + transforms of a mock log-odds `part_dta`; mean-of-probabilities (not + plogis-of-mean) verified on a hand-computable case. +- `causal` hidden on bounded scales; warning when `type = "causal"` requested + with `prob`/`odds`/`surv`. +- Default τ: `scale = "surv"`/`"rmst"` with `time = NULL` resolve τ to the + median follow-up (verify the resolved τ equals `median()` of the observed + times and is surfaced in the label/message); `scale = "auto"` + a survival + object resolves to `surv` (not mortality, no error). +- Labels: y-label strings for each scale (incl. target-class name). +- End-to-end (skip_on_cran): a real classification varpro fit → + `scale = "prob"` populated, values in [0, 1]; a survival fit → + `scale = "surv", time = τ` populated, values in [0, 1]. + +Backward-compat tests: `part_dta`-only classification call still yields +log-odds/`generic` (no error, no conversion). + +--- + +## 6. Backward compatibility / breaking changes + +Two deliberate behaviour changes, documented prominently in NEWS: + +1. **Classification `scale = "auto"` default changes** from generic log-odds to + probability. Existing object-driven classification calls now return + probability values + a `P(Y=…)` label. (`part_dta`-only calls are unchanged.) +2. **Survival `scale = "auto"` now returns survival probability** S(τ) at the + median-follow-up default τ (was ensemble mortality). The old mortality + behaviour remains available via `scale = "mortality"`. Also: `scale = "rmst"` + and `"surv"` now **default τ to median follow-up** when `time` is omitted + (v3.2.0's `rmst` required `time` — this is a loosening, not a break); a + supplied `time` behaves as before. + +The old path-C `surv` route is retired (replaced by the learner); `chf` is +unchanged. + +--- + +## 7. Open implementation detail + +- **Target-class extraction.** Confirm where partialpro's effective `target` + lands when passed via `...`, and the exact object slot for the response factor + levels (`object$y` vs `object$yvar.org`), so the label names the correct + class. Resolve during implementation; fall back to `Probability` (no class + name) if it can't be determined. +- **Median-follow-up source.** Confirm the object slot for the observed survival + times to compute the default τ = `median(follow-up)` — likely the time column + of `object$rf$yvar` (the survival response), *not* `time.interest` (which is + the distinct event times, unweighted by subjects). Fall back to + `median(object$rf$time.interest)` if the raw times aren't cleanly reachable. + +--- + +## 8. Touched files + +- `R/gg_partial_varpro.R` — `scale` arg values, `.resolve_varpro_scale`, + `.build_varpro_dfs`/`.process_cat_var` (conversion), `.surv_learner`, + validation, provenance `target`, routing for `surv`. +- `R/plot.gg_partial_varpro.R` — `.partial_varpro_ylabel` (new labels), + `causal` handling/warning, `@section`s. +- `R/gg_partialpro.R` — alias inherits new args automatically. +- `NEWS.md`, `man/` (regenerated), `tests/testthat/test_gg_partial_varpro.R`. diff --git a/dev/plans/2026-06-23-varpro-partial-scales-plan.md b/dev/plans/2026-06-23-varpro-partial-scales-plan.md new file mode 100644 index 00000000..1fbbba35 --- /dev/null +++ b/dev/plans/2026-06-23-varpro-partial-scales-plan.md @@ -0,0 +1,1222 @@ +# varPro Partial-Plot Scales (v3.3.0) Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make `gg_partial_varpro()` return interpretable y-axis scales — probability by default for classification, survival probability S(τ) by default for survival — with odds/log-odds and mortality/RMST as options, while keeping the `causal` contrast honest. + +**Architecture:** All changes are in two R files plus tests/docs. The extractor (`gg_partial_varpro` + internals in `R/gg_partial_varpro.R`) resolves a per-family scale, converts the partialpro log-odds matrices to the chosen scale *before* averaging, blanks the `causal` contrast on bounded scales, and adds a survival S(t) learner with a median-follow-up default τ. The plot method (`R/plot.gg_partial_varpro.R`) builds honest y-labels and drops/warns on `causal` for bounded scales. + +**Tech Stack:** R, roxygen2, testthat (+ `devtools::load_all`/`document`), `randomForestSRC::predict.rfsrc`, `varPro::partialpro`. + +**Design spec:** `dev/plans/2026-06-23-varpro-partial-scales-design.md` + +**Branch:** `feat/varpro-figures-3.3.0` (already at version 3.3.0). + +--- + +## File Structure + +- `R/gg_partial_varpro.R` — extractor + internals. New helpers: `.scale_transform`, `.is_bounded_scale`, `.surv_at_tau`, `.surv_learner`, `.default_surv_tau`, `.varpro_target`. Modified: `gg_partial_varpro` signature/routing, `.resolve_varpro_scale`, `.build_varpro_dfs`, `.process_cat_var`, `.validate_varpro_inputs`/`.validate_rmst_inputs`, `.varpro_provenance`. +- `R/plot.gg_partial_varpro.R` — `.partial_varpro_ylabel` (new labels), `plot.gg_partial_varpro` (causal drop/warn), roxygen `@section`s/`@references`. +- `R/gg_partialpro.R` — no code change (alias inherits via `...`); roxygen `@references` only. +- `tests/testthat/test_gg_partial_varpro.R` — new tests appended. +- `NEWS.md`, `man/` — version bullets + regenerated docs. + +**Test command used throughout** (the repo skips varpro grows on CRAN, so set `NOT_CRAN`): + +```bash +NOT_CRAN=true Rscript -e 'suppressMessages(devtools::load_all(quiet=TRUE)); testthat::test_file("tests/testthat/test_gg_partial_varpro.R", reporter="summary")' +``` + +Single internal helpers are exercised via `ggRandomForests:::name(...)` inside tests (the file already uses this pattern). + +--- + +### Task 1: Scale vocabulary + resolution + +Add the classification scale values and change `auto` resolution: classification → `prob`, survival → `surv`. + +**Files:** +- Modify: `R/gg_partial_varpro.R` — `gg_partial_varpro` signature `scale = c(...)`, and `.resolve_varpro_scale`. +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +Append to `tests/testthat/test_gg_partial_varpro.R`: + +```r +## ── v3.3.0 scale resolution ────────────────────────────────────────────────── +test_that(".resolve_varpro_scale: auto resolves by family (3.3.0)", { + expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "class"), "prob") + expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "surv"), "surv") + expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "regr"), "generic") + expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", NA_character_), + "generic") + # explicit scales pass through unchanged + expect_equal(ggRandomForests:::.resolve_varpro_scale("prob", "class"), "prob") + expect_equal(ggRandomForests:::.resolve_varpro_scale("odds", "class"), "odds") +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command above. +Expected: FAIL — current `.resolve_varpro_scale("auto","class")` returns `"generic"`, `"auto"/"surv"` returns `"mortality"`. + +- [ ] **Step 3: Update the `scale` argument and resolver** + +In `R/gg_partial_varpro.R`, change the `gg_partial_varpro` signature default for `scale` from: + +```r + scale = c("auto", "rmst", "mortality", + "surv", "chf"), +``` + +to: + +```r + scale = c("auto", "prob", "odds", "logodds", + "rmst", "surv", "mortality", "chf"), +``` + +Replace the body of `.resolve_varpro_scale`: + +```r +#' @keywords internal +.resolve_varpro_scale <- function(scale, family) { + if (scale != "auto") return(scale) + if (is.na(family) || is.null(family)) return("generic") + if (family == "surv") return("surv") # bounded survival default (3.3.0) + if (family == "class") return("prob") # probability default (3.3.0) + "generic" # regr or unknown +} +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: the new resolution test PASSES. (Other tests may now fail — that is expected and fixed in later tasks; note any failures and continue.) + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): scale vocabulary + auto->prob/surv resolution (3.3.0)" +``` + +--- + +### Task 2: Scale-transform + bounded-scale helpers + +Two pure helpers: the value transform (log-odds → prob/odds) and the bounded-scale predicate. + +**Files:** +- Modify: `R/gg_partial_varpro.R` (add helpers near the other internals). +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 scale transform + bounded predicate ─────────────────────────────── +test_that(".scale_transform applies prob/odds/identity", { + z <- matrix(c(0, log(3)), nrow = 1) # log-odds: 0 -> p=.5, log(3) -> p=.75 + expect_equal(ggRandomForests:::.scale_transform(z, "prob"), + matrix(c(0.5, 0.75), nrow = 1)) + expect_equal(ggRandomForests:::.scale_transform(z, "odds"), + matrix(c(1, 3), nrow = 1)) + # identity for the additive / non-classification scales + for (s in c("logodds", "generic", "surv", "rmst", "mortality")) + expect_equal(ggRandomForests:::.scale_transform(z, s), z) +}) + +test_that(".is_bounded_scale flags prob/odds/surv only", { + for (s in c("prob", "odds", "surv")) + expect_true(ggRandomForests:::.is_bounded_scale(s)) + for (s in c("logodds", "generic", "mortality", "rmst", "chf")) + expect_false(ggRandomForests:::.is_bounded_scale(s)) +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `.scale_transform`/`.is_bounded_scale` not found. + +- [ ] **Step 3: Add the helpers** + +In `R/gg_partial_varpro.R`, add near `.resolve_varpro_scale`: + +```r +## Transform partialpro's (log-odds) values to the requested classification +## scale. Identity for everything except prob/odds, because the survival +## learners already return their own scale and the additive scales are raw. +#' @keywords internal +.scale_transform <- function(z, scale) { + switch(scale, + prob = stats::plogis(z), + odds = exp(z), + z) +} + +## Bounded scales: probability (class), odds (class), survival probability. +## On these the absolute level curves convert and the centered `causal` +## contrast is not shown (it cannot share the axis). +#' @keywords internal +.is_bounded_scale <- function(scale) { + scale %in% c("prob", "odds", "surv") +} +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: both new tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): .scale_transform + .is_bounded_scale helpers" +``` + +--- + +### Task 3: Convert + blank causal in the data-frame builders + +Apply `.scale_transform` to the `par`/`nonpar` matrices **before** `colMeans` (mean of probabilities), and set `causal = NA` on bounded scales. Thread `scale` through `.build_varpro_dfs` and `.process_cat_var`. + +**Files:** +- Modify: `R/gg_partial_varpro.R` — `.build_varpro_dfs`, `.process_cat_var`, and the one call site in `gg_partial_varpro`. +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 conversion in the extractor (mean of probabilities) ─────────────── +test_that("gg_partial_varpro: scale='prob' is mean of plogis, causal NA", { + d <- make_mock_vpro_data() + res <- gg_partial_varpro(d, scale = "prob") # part_dta path, explicit scale + age <- res$continuous[res$continuous$name == "age", ] + expected <- colMeans(stats::plogis(d$age$yhat.par), na.rm = TRUE) + expect_equal(age$parametric, expected) # mean of probabilities + expect_true(all(age$parametric >= 0 & age$parametric <= 1)) + expect_true(all(is.na(res$continuous$causal))) # causal blanked on bounded +}) + +test_that("gg_partial_varpro: scale='logodds' keeps raw values + causal", { + d <- make_mock_vpro_data() + res <- gg_partial_varpro(d, scale = "logodds") + age <- res$continuous[res$continuous$name == "age", ] + expect_equal(age$parametric, colMeans(d$age$yhat.par, na.rm = TRUE)) + expect_false(all(is.na(res$continuous$causal))) # causal shown on additive +}) +``` + +(`make_mock_vpro_data()` already exists at the top of the test file.) + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `gg_partial_varpro` does not accept/apply the scale to the values yet (`parametric` is raw log-odds, `causal` not NA). + +- [ ] **Step 3: Thread `scale` and apply the transform** + +In `R/gg_partial_varpro.R`, change the `.build_varpro_dfs` signature and body. Replace: + +```r +#' @keywords internal +.build_varpro_dfs <- function(part_dta, nvars, cat_limit) { + cont_list <- list() + cat_list <- list() + for (feature in seq(nvars)) { + feat <- part_dta[[feature]] + feat_name <- names(part_dta)[[feature]] + if (length(feat$xvirtual) > cat_limit) { + plt.df <- dplyr::bind_cols( + variable = feat$xvirtual, + parametric = colMeans(feat$yhat.par, na.rm = TRUE), + nonparametric = colMeans(feat$yhat.nonpar, na.rm = TRUE), + causal = colMeans(feat$yhat.causal, na.rm = TRUE) + ) + plt.df$name <- feat_name + cont_list[[feature]] <- plt.df + } else { + cat_list[[feature]] <- .process_cat_var(feat, feat_name) + } + } + list( + continuous = dplyr::bind_rows(cont_list), + categorical = dplyr::bind_rows(cat_list) + ) +} +``` + +with: + +```r +#' @keywords internal +.build_varpro_dfs <- function(part_dta, nvars, cat_limit, scale = "generic") { + bounded <- .is_bounded_scale(scale) + cont_list <- list() + cat_list <- list() + for (feature in seq(nvars)) { + feat <- part_dta[[feature]] + feat_name <- names(part_dta)[[feature]] + if (length(feat$xvirtual) > cat_limit) { + plt.df <- dplyr::bind_cols( + variable = feat$xvirtual, + parametric = colMeans(.scale_transform(feat$yhat.par, scale), + na.rm = TRUE), + nonparametric = colMeans(.scale_transform(feat$yhat.nonpar, scale), + na.rm = TRUE), + # `causal` is a centered contrast: not shown on bounded scales + causal = if (bounded) NA_real_ else + colMeans(feat$yhat.causal, na.rm = TRUE) + ) + plt.df$name <- feat_name + cont_list[[feature]] <- plt.df + } else { + cat_list[[feature]] <- .process_cat_var(feat, feat_name, scale) + } + } + list( + continuous = dplyr::bind_rows(cont_list), + categorical = dplyr::bind_rows(cat_list) + ) +} +``` + +Replace `.process_cat_var`: + +```r +#' @keywords internal +.process_cat_var <- function(feat, feat_name) { + n_cats <- length(unique(feat$xorg)) + cat_feat <- list() + for (ind in seq(n_cats)) { + cat_feat[[ind]] <- dplyr::bind_cols( + parametric = feat$yhat.par[, ind], + nonparametric = feat$yhat.nonpar[, ind], + causal = feat$yhat.causal[, ind] + ) + cat_feat[[ind]]$variable <- unique(feat$xorg)[ind] + plt.df <- if (ind == 1L) cat_feat[[ind]] else + dplyr::bind_rows(plt.df, cat_feat[[ind]]) + } + plt.df$name <- feat_name + plt.df +} +``` + +with: + +```r +#' @keywords internal +.process_cat_var <- function(feat, feat_name, scale = "generic") { + bounded <- .is_bounded_scale(scale) + n_cats <- length(unique(feat$xorg)) + cat_feat <- list() + for (ind in seq(n_cats)) { + cat_feat[[ind]] <- dplyr::bind_cols( + parametric = .scale_transform(feat$yhat.par[, ind], scale), + nonparametric = .scale_transform(feat$yhat.nonpar[, ind], scale), + causal = if (bounded) NA_real_ else feat$yhat.causal[, ind] + ) + cat_feat[[ind]]$variable <- unique(feat$xorg)[ind] + plt.df <- if (ind == 1L) cat_feat[[ind]] else + dplyr::bind_rows(plt.df, cat_feat[[ind]]) + } + plt.df$name <- feat_name + plt.df +} +``` + +In `gg_partial_varpro`, **resolve `scale` once, early** — immediately after the +input-validation block (after the `...`-length warning, *before* the C-path +block). Validation needs the raw `"auto"`; everything downstream (routing, +conversion, labels, provenance) needs the concrete scale. Insert: + +```r + ## Resolve 'auto' to a concrete scale once; all routing, conversion, labels + ## and provenance below use the resolved value. (Validation above used the + ## raw scale, which must still distinguish 'auto'.) + scale <- .resolve_varpro_scale( + scale, if (!is.null(object)) object$family else NA_character_) +``` + +Then change the dfs call site from: + +```r + dfs <- .build_varpro_dfs(part_dta, nvars, cat_limit) +``` + +to: + +```r + dfs <- .build_varpro_dfs(part_dta, nvars, cat_limit, scale) +``` + +Because `scale` is now the resolved value, the existing C-path check +(`scale %in% c("surv", "chf")`) still behaves correctly in this task (explicit +or auto-resolved `surv` routes to path C for now); Task 6 narrows it to `chf` +and adds the `surv` learner. + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: the two new conversion tests PASS. The earlier numeric test "continuous parametric equals colMeans(yhat.par)" still passes because the default `part_dta`-only call resolves to `generic` (identity transform). + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): convert par/nonpar before averaging; blank causal on bounded scales" +``` + +--- + +### Task 4: Survival S(t) learner + +`.surv_at_tau` pulls the S(τ) column; `.surv_learner` is the partialpro learner (mirrors `.rmst_learner`). + +**Files:** +- Modify: `R/gg_partial_varpro.R` (add both helpers near `.rmst_learner`). +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 survival S(t) learner ───────────────────────────────────────────── +test_that(".surv_at_tau pulls S(tau) snapped to nearest event time", { + surv <- matrix(c(0.9, 0.6, 0.3), nrow = 1) # S(1), S(2), S(3) + times <- c(1, 2, 3) + expect_equal(ggRandomForests:::.surv_at_tau(surv, times, 2), 0.6) + expect_equal(ggRandomForests:::.surv_at_tau(surv, times, 2.1), 0.6) # nearest + expect_equal(ggRandomForests:::.surv_at_tau(surv, times, 3), 0.3) +}) + +test_that(".surv_learner returns S(tau) in [0,1] for OOB and newdata", { + skip_if_not_installed("randomForestSRC") + m <- make_mock_cpath() # defined later in this file + tau <- stats::median(m$rf$time.interest) + learner <- ggRandomForests:::.surv_learner(list(rf = m$rf), tau) + oob <- learner() # missing(newx) -> OOB + nd <- learner(survival::veteran[1:5, ]) # newdata branch + expect_length(oob, nrow(m$rf$xvar)) + expect_length(nd, 5L) + expect_true(all(oob >= 0 & oob <= 1)) + expect_true(all(nd >= 0 & nd <= 1)) +}) +``` + +`make_mock_cpath()` is defined lower in the test file (testthat sources the whole file before running, but `test_that` blocks execute top-to-bottom). **Place these two tests AFTER the `make_mock_cpath` definition** (search for `make_mock_cpath <- function` and add them just below it, alongside the existing `.rmst_learner` test). + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `.surv_at_tau`/`.surv_learner` not found. + +- [ ] **Step 3: Add the helpers** + +In `R/gg_partial_varpro.R`, add directly after `.rmst_learner`/`.rmst_from_survival`: + +```r +## Survival probability at horizon tau from a survival matrix: the S(tau) +## column, snapped to the nearest event time. `surv` is n x J with column k = +## S(times[k]) (randomForestSRC layout). +#' @keywords internal +.surv_at_tau <- function(surv, times, tau) { + if (is.null(dim(surv))) surv <- matrix(surv, nrow = 1L) + if (ncol(surv) != length(times)) { + stop(sprintf(paste0(".surv_at_tau: survival matrix has %d column(s) but ", + "%d time point(s); the grids must match."), + ncol(surv), length(times)), call. = FALSE) + } + surv[, which.min(abs(times - tau))] +} + +## S(tau) learner for varPro::partialpro: maps feature rows to S(tau | x) from +## the survival forest in object$rf. Same prediction machinery as +## .rmst_learner; pulls the S(tau) column instead of integrating. +#' @keywords internal +.surv_learner <- function(object, tau) { + rf <- object$rf + function(newx) { + if (missing(newx)) { + pr <- randomForestSRC::predict.rfsrc(rf, perf.type = "none") + surv <- pr$survival.oob + if (is.null(surv)) surv <- pr$survival + } else { + pr <- randomForestSRC::predict.rfsrc(rf, newx, perf.type = "none") + surv <- pr$survival + } + times <- pr$time.interest + if (is.null(times)) times <- rf$time.interest + .surv_at_tau(surv, times, tau) + } +} +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: both new survival-learner tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): .surv_learner / .surv_at_tau (survival probability)" +``` + +--- + +### Task 5: Default τ = median follow-up + +`.default_surv_tau` returns the median observed survival time, with a fallback to `median(time.interest)`. + +**Files:** +- Modify: `R/gg_partial_varpro.R` (add helper). +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 default tau = median follow-up ──────────────────────────────────── +test_that(".default_surv_tau is the median observed survival time", { + skip_if_not_installed("randomForestSRC") + m <- make_mock_cpath() # veteran survival rfsrc + tau <- ggRandomForests:::.default_surv_tau(list(rf = m$rf)) + expect_equal(tau, stats::median(survival::veteran$time)) +}) + +test_that(".default_surv_tau falls back to median(time.interest)", { + fake <- list(rf = list(time.interest = c(2, 4, 6, 8))) # no yvar + expect_equal(ggRandomForests:::.default_surv_tau(fake), 5) # median(2,4,6,8) +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `.default_surv_tau` not found. + +- [ ] **Step 3: Add the helper** + +In `R/gg_partial_varpro.R`, add near the survival learner: + +```r +## Data-driven, units-safe default horizon for survival scales: the median +## observed follow-up time (the survival response's time column). Always in the +## model's own units, so it cannot mismatch them. Falls back to the median of +## the distinct event times if the raw response is not reachable. +#' @keywords internal +.default_surv_tau <- function(object) { + rf <- object$rf + yv <- rf$yvar + tms <- NULL + if (!is.null(yv)) { + yv <- as.data.frame(yv) + # survival response time column is the first non-status numeric column; + # randomForestSRC names it via yvar.names (time first, status second). + tcol <- if (!is.null(rf$yvar.names)) rf$yvar.names[1] else names(yv)[1] + tms <- suppressWarnings(as.numeric(yv[[tcol]])) + } + if (is.null(tms) || !any(is.finite(tms))) tms <- rf$time.interest + stats::median(tms, na.rm = TRUE) +} +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: both `.default_surv_tau` tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): .default_surv_tau (median follow-up, units-safe)" +``` + +--- + +### Task 6: Routing + validation (survival surv learner, default τ, no errors) + +Wire `surv` to the new learner on path A, default τ for `surv`/`rmst`, resolve `auto`+survival to `surv`, and drop the "requires time" stops. + +**Files:** +- Modify: `R/gg_partial_varpro.R` — `gg_partial_varpro` (A-path routing), `.validate_varpro_inputs`, `.validate_rmst_inputs`, `.warn_varpro_rmst`. +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 survival routing + default tau ──────────────────────────────────── +test_that("gg_partial_varpro: scale='surv' computes via the learner, in [0,1]", { + skip_on_cran() # varpro fit + partialpro (~15s) + skip_if_not_installed("randomForestSRC"); skip_if_not_installed("varPro") + set.seed(13) + pbc <- get(utils::data("pbc", package = "randomForestSRC", + envir = environment())) + pbc <- pbc[stats::complete.cases(pbc), ] + vp <- varPro::varpro(Surv(days, status) ~ ., pbc, ntree = 60, nvar = 2) + + # explicit tau + r <- gg_partial_varpro(object = vp, scale = "surv", time = 1000, nvars = 1) + expect_equal(attr(r, "provenance")$scale, "surv") + expect_equal(attr(r, "provenance")$path, "A") # learner path, not C + expect_true(all(r$continuous$parametric >= 0 & + r$continuous$parametric <= 1)) + expect_true(all(is.na(r$continuous$causal))) # bounded -> causal NA + + # default tau (no time): provenance records the median-follow-up tau + rd <- suppressMessages( + gg_partial_varpro(object = vp, scale = "surv", nvars = 1)) + expect_equal(attr(rd, "provenance")$rmst_tau, + ggRandomForests:::.default_surv_tau(vp)) + + # auto resolves to surv (not mortality, no error) + ra <- suppressMessages(gg_partial_varpro(object = vp, nvars = 1)) + expect_equal(attr(ra, "provenance")$scale, "surv") +}) + +test_that("gg_partial_varpro: scale='surv' with precomputed part_dta still errors w/o object", { + expect_error( + gg_partial_varpro(part_dta = make_mock_vpro_data(), scale = "surv"), + regexp = "requires 'object'" + ) +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `surv` currently routes through path C (provenance path `"C"`), no learner, no default τ. + +- [ ] **Step 3: Re-route `surv`, add default τ, relax validation** + +`scale` is already the resolved concrete scale here (Task 3 added the early +`.resolve_varpro_scale` overwrite), so every `scale == ...` / `scale %in% ...` +check below operates on `surv`/`rmst`/`chf`/`mortality`/`prob`/... never +`"auto"`. + +In `R/gg_partial_varpro.R`: + +(a) The C-path block currently catches `surv` and `chf`. Narrow it to `chf` only. Replace: + +```r + ## ---- C-path: route through gg_partial_rfsrc ---------------------------- + if (!is.null(object) && scale %in% c("surv", "chf")) { + return(.gg_partial_varpro_cpath(object, scale, time, model)) + } +``` + +with: + +```r + ## ---- C-path: route CHF through gg_partial_rfsrc ------------------------ + ## (surv now uses the partialpro S(t) learner on path A, below.) + if (!is.null(object) && scale == "chf") { + return(.gg_partial_varpro_cpath(object, scale, time, model)) + } + + ## ---- Survival default horizon: surv/rmst fill tau from the data -------- + if (!is.null(object) && scale %in% c("surv", "rmst") && is.null(time)) { + time <- .default_surv_tau(object) + message("gg_partial_varpro: using default horizon tau = ", signif(time, 4), + " (median follow-up). Set 'time' to choose another.") + } +``` + +(b) The A-path recompute currently only branches `rmst`. Replace: + +```r + if (is.null(part_dta)) { + part_dta <- if (scale == "rmst") { + varPro::partialpro(object, learner = .rmst_learner(object, time), ...) + } else { + varPro::partialpro(object, ...) + } + } +``` + +with: + +```r + if (is.null(part_dta)) { + learner <- switch(scale, + rmst = .rmst_learner(object, time), + surv = .surv_learner(object, time), + NULL) + part_dta <- if (is.null(learner)) { + varPro::partialpro(object, ...) + } else { + varPro::partialpro(object, learner = learner, ...) + } + } +``` + +(c) In `.validate_varpro_inputs`, the `surv`/`chf` "requires object" check stays (we still need `object` for the learner). Change the message set unchanged but keep `surv` requiring `object`: + +The existing check is already correct: + +```r + if (scale %in% c("surv", "chf") && is.null(object)) { + stop("scale = '", scale, "' requires 'object' (the varpro fit)", + call. = FALSE) + } +``` + +Leave it as-is. (This is what the second new test asserts.) + +(d) In `.validate_rmst_inputs`, **remove** the "requires time" stop (τ now defaults). Replace: + +```r +#' @keywords internal +.validate_rmst_inputs <- function(part_dta, object, time) { + if (is.null(time)) { + stop("scale = 'rmst' requires 'time' (the RMST horizon tau)", + call. = FALSE) + } + if (is.null(part_dta) && !is.null(object) && + !identical(object$family, "surv")) { + stop("scale = 'rmst' requires a survival varpro fit ", + "(object$family == \"surv\")", call. = FALSE) + } + invisible(NULL) +} +``` + +with: + +```r +#' @keywords internal +.validate_rmst_inputs <- function(part_dta, object, time) { + # tau is optional now (defaults to median follow-up when recomputing from + # an object); only the survival-family requirement remains. + if (is.null(part_dta) && !is.null(object) && + !identical(object$family, "surv")) { + stop("scale = 'rmst' requires a survival varpro fit ", + "(object$family == \"surv\")", call. = FALSE) + } + invisible(NULL) +} +``` + +(e) `.validate_varpro_inputs` currently calls `.validate_rmst_inputs` only for `scale == "rmst"`. Extend it to also guard `surv` recompute needs a survival fit. Replace: + +```r + if (scale == "rmst") .validate_rmst_inputs(part_dta, object, time) +``` + +with: + +```r + if (scale %in% c("rmst", "surv")) .validate_rmst_inputs(part_dta, object, time) +``` + +(f) `.warn_varpro_rmst` references `scale == "rmst"` for the precomputed-part_dta and out-of-range warnings. Generalise its guard to both survival learner scales. Change the opening: + +```r +.warn_varpro_rmst <- function(part_dta, object, scale, time) { + if (scale == "rmst") { +``` + +to: + +```r +.warn_varpro_rmst <- function(part_dta, object, scale, time) { + if (scale %in% c("rmst", "surv")) { +``` + +(The "tau exceeds largest event time" branch is harmless for `surv` — S(τ) just snaps to the last column; the warning still informs.) + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: the routing tests PASS (the e2e one runs because `NOT_CRAN=true`). Re-run the *whole* file; the previously-added `scale='rmst'` tests that asserted "requires 'time'" must be reconciled — see Step 4b. + +- [ ] **Step 4b: Fix the now-obsolete rmst "requires time" test** + +The file has an older test: + +```r +test_that("gg_partial_varpro: scale='rmst' without time → stop", { + expect_error( + gg_partial_varpro(part_dta = make_mock_vpro_data(), scale = "rmst"), + regexp = "requires 'time'" + ) +}) +``` + +RMST no longer errors on missing `time` when recomputing — but this call passes only `part_dta` (no `object`), so nothing recomputes and τ is irrelevant; it should now succeed as a label-only path. Replace that test with: + +```r +test_that("gg_partial_varpro: scale='rmst' part_dta-only no longer needs time", { + # tau defaults when recomputing from object; with part_dta only there is + # nothing to recompute, so this is a label-only call and must not error. + expect_no_error(suppressWarnings( + gg_partial_varpro(part_dta = make_mock_vpro_data(), scale = "rmst"))) +}) +``` + +Re-run the file; expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): surv via learner on path A; surv/rmst default tau; relax validation" +``` + +--- + +### Task 7: Target class, provenance, and y-axis labels + +Resolve the classification target class, record it in provenance, and build the new y-labels. + +**Files:** +- Modify: `R/gg_partial_varpro.R` — add `.varpro_target`, extend `.varpro_provenance`. +- Modify: `R/plot.gg_partial_varpro.R` — `.partial_varpro_ylabel`. +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 labels + provenance target ──────────────────────────────────────── +test_that(".partial_varpro_ylabel: prob/odds/logodds with target class", { + lab <- ggRandomForests:::.partial_varpro_ylabel + expect_match(lab(list(scale = "prob", target = "yes")), "P\\(Y = yes\\)") + expect_match(lab(list(scale = "odds", target = "yes")), "Odds\\(Y = yes\\)") + expect_match(lab(list(scale = "logodds", target = "yes")), "Log-odds") + expect_equal(lab(list(scale = "prob", target = NA)), "Probability") + expect_match(lab(list(scale = "surv", rmst_tau = 1000)), + "Survival probability at t = 1000") +}) + +test_that("gg_partial_varpro: classification provenance records target class", { + skip_on_cran(); skip_if_not_installed("varPro") + set.seed(5) + dat <- data.frame(y = factor(rep(c("a", "b"), 60)), + x1 = rnorm(120), x2 = rnorm(120)) + vp <- varPro::varpro(y ~ ., dat, ntree = 40, nvar = 2) + r <- gg_partial_varpro(object = vp, scale = "prob", nvars = 1) + expect_equal(attr(r, "provenance")$scale, "prob") + expect_equal(attr(r, "provenance")$target, "b") # last factor level default +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `.partial_varpro_ylabel` has no `prob/odds/logodds/surv` arms; provenance has no `target`. + +- [ ] **Step 3a: Add `.varpro_target` and extend provenance** + +In `R/gg_partial_varpro.R`, add: + +```r +## Classification target class label: the `target` passed through ... if any, +## else the last factor level of the response (partialpro's default target). +## NA for non-classification fits or when only part_dta is supplied. +#' @keywords internal +.varpro_target <- function(object, dots) { + if (is.null(object) || !identical(object$family, "class")) return(NA_character_) + if (!is.null(dots$target)) return(as.character(dots$target)) + lv <- levels(object$y.org) + if (is.null(lv)) lv <- levels(as.factor(object$y)) + if (is.null(lv)) NA_character_ else lv[length(lv)] +} +``` + +In `.varpro_provenance`, add a `target` slot. Change the signature and body. Replace: + +```r +#' @keywords internal +.varpro_provenance <- function(object, scale, time, path = "A") { + prov_family <- if (!is.null(object)) object$family else NA_character_ + prov_xvars <- if (!is.null(object)) object$xvar.names else NA_character_ + prov_n <- if (!is.null(object)) nrow(object$x) else NA_integer_ + prov_ntree <- if (!is.null(object)) object$max.tree else NA_integer_ + scale_used <- .resolve_varpro_scale(scale, prov_family) + list( + source = "varPro", + family = prov_family, + ntree = prov_ntree, + n = prov_n, + scale = scale_used, + rmst_tau = time, + xvar.names = prov_xvars, + path = path + ) +} +``` + +with: + +```r +#' @keywords internal +.varpro_provenance <- function(object, scale, time, path = "A", target = NA) { + prov_family <- if (!is.null(object)) object$family else NA_character_ + prov_xvars <- if (!is.null(object)) object$xvar.names else NA_character_ + prov_n <- if (!is.null(object)) nrow(object$x) else NA_integer_ + prov_ntree <- if (!is.null(object)) object$max.tree else NA_integer_ + scale_used <- .resolve_varpro_scale(scale, prov_family) + list( + source = "varPro", + family = prov_family, + ntree = prov_ntree, + n = prov_n, + scale = scale_used, + rmst_tau = time, + target = target, + xvar.names = prov_xvars, + path = path + ) +} +``` + +In `gg_partial_varpro`, the A-path builds provenance. Find: + +```r + prov <- .varpro_provenance(object, scale, time, path = "A") +``` + +and replace with (capture target from `...`, after the `time` default is set in Task 6 so `rmst_tau` records the resolved τ): + +```r + prov <- .varpro_provenance(object, scale, time, path = "A", + target = .varpro_target(object, list(...))) +``` + +- [ ] **Step 3b: New y-axis labels** + +In `R/plot.gg_partial_varpro.R`, replace the `.partial_varpro_ylabel` `switch` body to add the classification + survival arms. Replace: + +```r + switch(scale, + mortality = "Ensemble mortality (expected events)", + rmst = { + tau <- prov$rmst_tau + if (!is.null(tau) && !is.na(tau)) { + sprintf("RMST (τ = %g)", tau) + } else { + "RMST" + } + }, + surv = { + t <- prov$rmst_tau + if (!is.null(t) && !is.na(t)) { + sprintf("Survival probability at t = %g", t) + } else { + "Survival probability" + } + }, + chf = { + t <- prov$rmst_tau + if (!is.null(t) && !is.na(t)) { + sprintf("Cumulative hazard at t = %g", t) + } else { + "Cumulative hazard" + } + }, + "Partial Effect" # generic / auto-regr / auto-class / unknown + ) +``` + +with: + +```r + tgt <- prov$target + has_tgt <- !is.null(tgt) && !is.na(tgt) + switch(scale, + prob = if (has_tgt) sprintf("P(Y = %s)", tgt) else "Probability", + odds = if (has_tgt) sprintf("Odds(Y = %s)", tgt) else "Odds", + logodds = if (has_tgt) sprintf("Log-odds(Y = %s)", tgt) else "Log-odds", + mortality = "Ensemble mortality (expected events)", + rmst = { + tau <- prov$rmst_tau + if (!is.null(tau) && !is.na(tau)) sprintf("RMST (τ = %g)", tau) + else "RMST" + }, + surv = { + t <- prov$rmst_tau + if (!is.null(t) && !is.na(t)) sprintf("Survival probability at t = %g", t) + else "Survival probability" + }, + chf = { + t <- prov$rmst_tau + if (!is.null(t) && !is.na(t)) sprintf("Cumulative hazard at t = %g", t) + else "Cumulative hazard" + }, + "Partial Effect" # generic / regr / unknown + ) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: the label + provenance-target tests PASS. + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R R/plot.gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(gg_partial_varpro): target-class provenance + prob/odds/surv y-labels" +``` + +--- + +### Task 8: Plot drops + warns on causal for bounded scales + +When the scale is bounded, `causal` is all-`NA`; the plot must not draw it and must warn if the user explicitly asked for `type = "causal"`. + +**Files:** +- Modify: `R/plot.gg_partial_varpro.R` — `plot.gg_partial_varpro` (A-path, after `match.arg(type)`). +- Test: `tests/testthat/test_gg_partial_varpro.R` + +- [ ] **Step 1: Write the failing test** + +```r +## ── v3.3.0 plot: causal hidden on bounded scales ───────────────────────────── +test_that("plot.gg_partial_varpro: bounded scale drops causal, warns if asked", { + d <- make_mock_vpro_data() + res <- gg_partial_varpro(d, nvars = 1, scale = "prob") + # default type: no error, causal silently absent + gg <- plot(res) + expect_s3_class(gg, "ggplot") + # explicit causal -> warning + still returns a ggplot (par/nonpar only) + expect_warning(plot(res, type = "causal"), regexp = "causal") +}) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run the test command. Expected: FAIL — `plot` currently passes `type` straight to `pivot_longer`; `type = "causal"` on an all-NA column produces an all-NA line, no warning (and `type` default still includes causal, which would draw an all-NA layer). + +- [ ] **Step 3: Drop/warn causal on bounded scales** + +In `R/plot.gg_partial_varpro.R`, in `plot.gg_partial_varpro`, just after: + +```r + ## A-path rendering. + type <- match.arg(type, several.ok = TRUE) + ylabel <- .partial_varpro_ylabel(prov) +``` + +insert: + +```r + ## On bounded scales (prob/odds/surv) the causal contrast is not shown + ## (it cannot share the level axis). Drop it; warn only if asked explicitly. + if (!is.null(prov) && .is_bounded_scale(prov$scale %||% "generic")) { + if ("causal" %in% type) { + warning("plot.gg_partial_varpro: 'causal' is not shown on the ", + prov$scale, " scale (it is a contrast, not a level). ", + "Use scale = 'logodds' (classification) or 'mortality'/'rmst' ", + "(survival) to see it.", call. = FALSE) + } + type <- setdiff(type, "causal") + } +``` + +(`.is_bounded_scale` is defined in `R/gg_partial_varpro.R`; both files are in the same package namespace, so no import is needed.) + +- [ ] **Step 4: Run test to verify it passes** + +Run the test command. Expected: the plot causal test PASSES. + +- [ ] **Step 5: Commit** + +```bash +git add R/plot.gg_partial_varpro.R tests/testthat/test_gg_partial_varpro.R +git commit -m "feat(plot.gg_partial_varpro): drop/warn causal on bounded scales" +``` + +--- + +### Task 9: Documentation, references, NEWS + +Roxygen `@section`s, `@details`, `@param scale` update, the virtual-twins reference, and NEWS bullets. + +**Files:** +- Modify: `R/gg_partial_varpro.R` (roxygen header), `R/plot.gg_partial_varpro.R` (roxygen header), `NEWS.md`. +- Regenerate: `man/` via `devtools::document()`. + +- [ ] **Step 1: Update `@param scale` and `@details` on `gg_partial_varpro`** + +In the `R/gg_partial_varpro.R` roxygen header, replace the `@param scale` block: + +```r +#' @param scale Character; sets the y-axis label and, for survival forests, +#' the output type. One of \code{"auto"} (default), \code{"mortality"}, +#' \code{"rmst"}, \code{"surv"}, or \code{"chf"}. +``` + +with: + +```r +#' @param scale Character; the y-axis scale. One of \code{"auto"} (default), +#' the classification scales \code{"prob"} / \code{"odds"} / \code{"logodds"}, +#' or the survival scales \code{"rmst"} / \code{"surv"} / \code{"mortality"} / +#' \code{"chf"}. With \code{"auto"}: classification fits resolve to +#' \code{"prob"} (probability of the target class) and survival fits to +#' \code{"surv"} (survival probability at a default horizon \eqn{\tau}); see +#' \strong{Details}. +``` + +Add a `@details` block (after the existing RMST details paragraph): + +```r +#' **Classification scale (scale = "prob"/"odds"/"logodds"):** +#' \code{varPro::partialpro} returns classification effects as \emph{log-odds} +#' of the target class. \code{scale = "prob"} (the classification default) +#' back-transforms to probability \eqn{P(Y = \mathrm{target})}, \code{"odds"} to +#' the odds, and \code{"logodds"} keeps the raw scale. The back-transform is +#' applied per observation \emph{before} averaging, so the curve is the mean +#' predicted probability, not the probability of the mean log-odds. The +#' \code{causal} contrast is shown only on \code{"logodds"} (see +#' \code{\link{plot.gg_partial_varpro}}). +#' +#' **Survival scale (scale = "surv"):** \code{scale = "surv"} computes survival +#' probability \eqn{S(\tau \mid x)} through \code{partialpro} (the same UVT +#' engine as mortality and RMST), bounded in [0, 1]. When \code{time} is not +#' supplied, \eqn{\tau} defaults to the \strong{median follow-up time} of the +#' fit -- a data-driven horizon that is always in the model's own time units, +#' so it cannot be mis-specified the way a hand-typed \eqn{\tau} can. The +#' resolved \eqn{\tau} is reported in a message and the axis label; pass +#' \code{time = tau} to choose another. \code{scale = "mortality"} keeps the +#' unbounded ensemble-mortality score as an explicit opt-in. +``` + +- [ ] **Step 2: Add the `@section`s and reference on `plot.gg_partial_varpro`** + +In `R/plot.gg_partial_varpro.R` roxygen header, after the existing +"What this tells you" section, add (the prose was approved in the spec): + +```r +#' @section Reading a probability curve (scale = "prob"): +#' The y-axis is \eqn{P(Y = \mathrm{target})}, the model's predicted probability +#' of the target class as the focal variable varies (others held at their +#' UVT-plausible average). \code{"odds"} and \code{"logodds"} are the same +#' curve on the odds and log-odds scales. The \code{causal} curve is a +#' contrast (below) and is \emph{not} shown on \code{"prob"}/\code{"odds"}; +#' use \code{"logodds"} to see it. +#' +#' @section Reading a survival-probability curve (scale = "surv"): +#' The y-axis is \eqn{S(\tau \mid x)}, the predicted probability of surviving +#' past \eqn{\tau}, bounded in [0, 1] and read in the model's time units. +#' Higher is better (more survival). \eqn{\tau} defaults to the median +#' follow-up time when not supplied. +#' +#' @section What the causal curve is, and when to use it: +#' \code{causal} is the \strong{baseline-subtracted local effect} -- varPro's +#' virtual- ("digital-") twins estimator (Ishwaran & Blackstone, 2025). It +#' shows how the prediction shifts as the focal variable moves away from the +#' reference grid point, with the other covariates held at on-manifold +#' (UVT-plausible) values; it is a \strong{contrast} (it starts at 0), not a +#' level. Use it when you want the local effect (change-from-baseline) rather +#' than the absolute predicted level, and as a cross-check on the parametric +#' and nonparametric curves. It is varpro's local estimator \emph{within the +#' fitted model}, \strong{not a structural causal claim} about the +#' data-generating process. Because it is a contrast it cannot share a +#' probability/odds axis with the absolute curves, so it is shown only on the +#' additive scales (\code{"logodds"}, \code{"mortality"}, \code{"rmst"}). +``` + +Add to the `@references` of BOTH `R/gg_partial_varpro.R` and +`R/plot.gg_partial_varpro.R` (alongside the existing Ishwaran et al. 2008 ref): + +```r +#' Ishwaran H, Blackstone EH (2025). +#' Harnessing the power of virtual (digital) twins: Graphical causal tools for +#' understanding patient and hospital differences. +#' \emph{Computational and Structural Biotechnology Journal}, \bold{28}, 312. +``` + +- [ ] **Step 3: NEWS bullets** + +In `NEWS.md`, under the `ggRandomForests v3.3.0` heading, add above the existing +RMST-docs bullet: + +``` +* `gg_partial_varpro()`: **classification partial plots now default to + probability.** `scale = "auto"` on a classification fit resolves to `"prob"` + (P(Y = target class)) instead of raw log-odds; `"odds"` and `"logodds"` are + options. The back-transform is applied before averaging (mean predicted + probability). The `causal` contrast is shown only on `"logodds"`. +* `gg_partial_varpro()`: **survival partial plots now default to survival + probability.** `scale = "auto"` on a survival fit resolves to `"surv"` + (S(tau | x), bounded 0-1) via a new partialpro learner, instead of the + unbounded ensemble-mortality score (still available via + `scale = "mortality"`). `"surv"` and `"rmst"` default `tau` to the median + follow-up time when `time` is omitted -- a units-safe, data-driven horizon + (v3.2.0's `rmst` required `time`; this is a loosening). The resolved `tau` is + reported in a message and the axis label. +* `plot.gg_partial_varpro()`: documents what the `causal` (virtual-twins) + estimator is and when to use it, and explains why it is hidden on the bounded + probability scales. +``` + +- [ ] **Step 4: Regenerate docs and verify clean** + +Run: + +```bash +Rscript -e 'devtools::document()' +``` + +Expected: writes `man/gg_partial_varpro.Rd` and `man/plot.gg_partial_varpro.Rd`, no warnings. + +- [ ] **Step 5: Commit** + +```bash +git add R/gg_partial_varpro.R R/plot.gg_partial_varpro.R NEWS.md man/ +git commit -m "docs(gg_partial_varpro): scale @details, prob/surv/causal @sections, NEWS, virtual-twins ref" +``` + +--- + +### Task 10: Full verification (self-review, suite, lint, R CMD check) + +**Files:** none new — verification only. + +- [ ] **Step 1: Run the full test file** + +```bash +NOT_CRAN=true Rscript -e 'suppressMessages(devtools::load_all(quiet=TRUE)); testthat::test_file("tests/testthat/test_gg_partial_varpro.R", reporter="summary")' +``` + +Expected: all PASS / SKIP, 0 FAIL. Investigate any failure before proceeding. + +- [ ] **Step 2: Reconcile sibling test files** + +```bash +NOT_CRAN=true Rscript -e 'suppressMessages(devtools::load_all(quiet=TRUE)); testthat::test_file("tests/testthat/test_gg_partialpro.R", reporter="minimal"); testthat::test_file("tests/testthat/test_ggrandomforests_news.R", reporter="minimal")' +``` + +Expected: PASS. The news test greps `NEWS.md` for the DESCRIPTION version (3.3.0) — already aligned. + +- [ ] **Step 3: Lint the changed files** + +```bash +Rscript -e 'cat(length(lintr::lint("R/gg_partial_varpro.R")), length(lintr::lint("R/plot.gg_partial_varpro.R")), "issues\n")' +``` + +Expected: `0 0 issues`. If `cyclocomp` fires on `gg_partial_varpro` (it grew), extract the new routing block into a small `.resolve_partial_part_dta(object, scale, time, ...)` helper and re-run. + +- [ ] **Step 4: Spec coverage self-check** + +Confirm each spec section maps to a task: 3a/3b → T1; 3c → T2,T3; 3d → T3 (NA) + T8 (warn) + T9 (docs); 3e → T4,T5,T6; 3f → T7; 3g → T7; §4 docs → T9; §5 tests → distributed. No gaps. + +- [ ] **Step 5: Full CRAN check (no submission)** + +```bash +R CMD build . && R CMD check --as-cran ggRandomForests_3.3.0.tar.gz +``` + +Expected: `Status: OK` (0/0/0), manual builds. (The tarball is gitignored.) Note check time stays < 10 min. + +- [ ] **Step 6: Commit any verification fixes** + +```bash +git add -A +git commit -m "test/chore: 3.3.0 scale work — full suite + check green" +``` + +--- + +## Notes for the implementer + +- The repo skips varpro forest grows on CRAN (gcc-UBSAN upstream issue); keep the heavy end-to-end fits behind `skip_on_cran()`, and exercise the pure helpers (`.scale_transform`, `.surv_at_tau`, `.default_surv_tau`, `.resolve_varpro_scale`, `.partial_varpro_ylabel`) with mock inputs that run on CRAN. +- `%||%` is defined in the package (`R/print_helpers.R`); usable in both files. +- Do NOT `git add -A` blind — the working tree has gitignored build artifacts (`.superpowers/`, vignette HTML, `*.tar.gz`); stage explicit paths. +- Two deliberate behaviour changes ship here (classification default → prob; survival default → surv, no more mortality default). They are documented in NEWS and are the headline of 3.3.0. diff --git a/man/gg_partial_rfsrc.Rd b/man/gg_partial_rfsrc.Rd index 29f73bf0..401753ab 100644 --- a/man/gg_partial_rfsrc.Rd +++ b/man/gg_partial_rfsrc.Rd @@ -60,7 +60,8 @@ A named list with two elements: (the level of \code{xvar2.name}) and \code{time} (survival forests only) for all continuous predictors.} \item{categorical}{A \code{data.frame} with the same columns but -\code{x} kept as character, for low-cardinality predictors.} +\code{x} kept as a \code{factor} (levels in the model's level order, +not alphabetical), for low-cardinality predictors.} } } \description{ diff --git a/man/gg_partial_varpro.Rd b/man/gg_partial_varpro.Rd index c41085e7..97f9ced2 100644 --- a/man/gg_partial_varpro.Rd +++ b/man/gg_partial_varpro.Rd @@ -8,7 +8,7 @@ gg_partial_varpro( part_dta = NULL, object = NULL, - scale = c("auto", "rmst", "mortality", "surv", "chf"), + scale = c("auto", "prob", "odds", "logodds", "rmst", "surv", "mortality", "chf"), time = NULL, nvars = NULL, cat_limit = 10, @@ -19,7 +19,7 @@ gg_partial_varpro( gg_partialpro( part_dta, object = NULL, - scale = c("auto", "rmst", "mortality", "surv", "chf"), + scale = c("auto", "prob", "odds", "logodds", "rmst", "surv", "mortality", "chf"), time = NULL, nvars = NULL, cat_limit = 10, @@ -39,9 +39,13 @@ came from. When supplied it provides the provenance metadata, and when \code{varPro::partialpro(object)} for you. Required when \code{scale \%in\% c("surv","chf")}.} -\item{scale}{Character; sets the y-axis label and, for survival forests, -the output type. One of \code{"auto"} (default), \code{"mortality"}, -\code{"rmst"}, \code{"surv"}, or \code{"chf"}.} +\item{scale}{Character; the y-axis scale. One of \code{"auto"} (default), +the classification scales \code{"prob"} / \code{"odds"} / \code{"logodds"}, +or the survival scales \code{"rmst"} / \code{"surv"} / \code{"mortality"} / +\code{"chf"}. With \code{"auto"}: classification fits resolve to +\code{"prob"} (probability of the target class) and survival fits to +\code{"surv"} (survival probability at a default horizon \eqn{\tau}); see +\strong{Details}.} \item{time}{Numeric; the evaluation time point. Required when \code{scale = "rmst"} (the RMST horizon \eqn{\tau}), where it now @@ -121,6 +125,27 @@ can only be relabeled, and \code{gg_partial_varpro} warns when you try. A \eqn{\tau} beyond the model's largest event time is truncated there (with a warning), since \eqn{S(t)} cannot be extrapolated. +\strong{Classification scale (scale = "prob"/"odds"/"logodds"):} +\code{varPro::partialpro} returns classification effects as \emph{log-odds} +of the target class. \code{scale = "prob"} (the classification default) +back-transforms to probability \eqn{P(Y = \mathrm{target})}, \code{"odds"} to +the odds, and \code{"logodds"} keeps the raw scale. The back-transform is +applied per observation \emph{before} averaging, so the curve is the mean +predicted probability, not the probability of the mean log-odds. The +\code{causal} contrast is shown only on \code{"logodds"} (see +\code{\link{plot.gg_partial_varpro}}). + +\strong{Survival probability (scale = "surv"):} \code{scale = "surv"} (the +survival default) computes \eqn{S(\tau \mid x)} through \code{partialpro} +(the same UVT engine as mortality and RMST), bounded in \eqn{[0, 1]}. When +\code{time} is not supplied, \eqn{\tau} defaults to the \strong{median +follow-up time} of the fit -- a data-driven horizon that is always in the +model's own time units, so it cannot be mis-specified the way a hand-typed +\eqn{\tau} can. The resolved \eqn{\tau} is reported in a message and the +axis label; pass \code{time = tau} to choose another. \code{scale = +"mortality"} keeps the unbounded ensemble-mortality score as an explicit +opt-in. + \strong{Ensemble mortality (scale = "mortality"):} here the y-axis is \emph{ensemble mortality}, the expected number of events a subject would see if they were exposed to the study-average cumulative hazard. It is @@ -228,6 +253,11 @@ head(result$categorical) Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008). Random survival forests. \emph{The Annals of Applied Statistics}, \bold{2}(3), 841--860. \doi{10.1214/08-AOAS169}. + +Ishwaran H, Blackstone EH (2025). +Harnessing the power of virtual (digital) twins: Graphical causal tools for +understanding patient and hospital differences. +\emph{Computational and Structural Biotechnology Journal}, \bold{28}, 312. } \seealso{ \code{\link{plot.gg_partial_varpro}}, diff --git a/man/plot.gg_partial_varpro.Rd b/man/plot.gg_partial_varpro.Rd index 910de5d6..916400e8 100644 --- a/man/plot.gg_partial_varpro.Rd +++ b/man/plot.gg_partial_varpro.Rd @@ -64,9 +64,62 @@ Use these curves to describe how the model uses each variable, not to claim how the world works. They are a window into the fitted relationship; they do not by themselves establish that intervening on the variable would move the outcome. For survival path-C -(\code{scale = "surv"} or \code{"chf"}), the y-axis is on the -probability or cumulative-hazard scale, which is usually the scale -you want to report to a clinical audience. +(\code{scale = "chf"}), the y-axis is on the cumulative-hazard scale. +} + +\section{Reading a probability curve (scale = "prob")}{ + +The y-axis is \eqn{P(Y = \mathrm{target})}, the model's predicted probability +of the target class as the focal variable varies (others held at their +UVT-plausible average). \code{"odds"} and \code{"logodds"} are the same +curve on the odds and log-odds scales. The \code{causal} curve is a +contrast (below) and is \emph{not} shown on \code{"prob"}/\code{"odds"}; +use \code{"logodds"} to see it. +} + +\section{Reading a survival-probability curve (scale = "surv")}{ + +The y-axis is \eqn{S(\tau \mid x)}, the predicted probability of surviving +past \eqn{\tau}, bounded in \eqn{[0, 1]} and read in the model's time units. +Higher is better (more survival). \eqn{\tau} defaults to the median +follow-up time when not supplied. +} + +\section{What the causal curve is, and when to use it}{ + +\code{causal} is the \strong{baseline-subtracted local effect} -- varPro's +virtual- ("digital-") twins estimator (Ishwaran & Blackstone, 2025). It +shows how the prediction shifts as the focal variable moves away from the +reference grid point, with the other covariates held at on-manifold +(UVT-plausible) values; it is a \strong{contrast} (it starts at 0), not a +level. Use it when you want the local effect (change-from-baseline) rather +than the absolute predicted level, and as a cross-check on the parametric +and nonparametric curves. It is varpro's local estimator \emph{within the +fitted model}, \strong{not a structural causal claim} about the +data-generating process. Because it is a contrast it cannot share a +probability/odds axis with the absolute curves, so it is shown only on the +additive scales (\code{"logodds"}, \code{"mortality"}, \code{"rmst"}). +} + +\section{Reading an RMST curve (scale = "rmst")}{ + +The y-axis is restricted mean survival time at horizon \eqn{\tau}, +\eqn{\mathrm{RMST}(\tau)=\int_0^\tau S(t)\,dt}: the \strong{expected +event-free time during the first \eqn{\tau} time-units}, the area under the +survival curve out to \eqn{\tau}. Read it in the \strong{model's own time +units}, where it is bounded by \eqn{0 \le \mathrm{RMST}(\tau) \le \tau}. + +Two things follow. First, \eqn{\tau} must be given in the fit's time units; +a \eqn{\tau} past the largest event time just truncates to the full +restricted mean and stops changing. Second, higher is better here -- more +time event-free -- which is the opposite of the ensemble-mortality scale. + +A continuous variable's curve sloping \emph{up} means higher values of that +covariate buy you \emph{more} restricted-mean event-free time within \eqn{\tau} +(with the other covariates held at their UVT-plausible average); a flat curve +means the covariate does not move it. Unlike ensemble mortality, RMST reads +on a directly clinical scale, "so many event-free time-units within +\eqn{\tau}", which is usually the one you want to report. } \examples{ @@ -97,6 +150,11 @@ plot(pp, type = "parametric") Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008). Random survival forests. \emph{The Annals of Applied Statistics}, \bold{2}(3), 841--860. \doi{10.1214/08-AOAS169}. + +Ishwaran H, Blackstone EH (2025). +Harnessing the power of virtual (digital) twins: Graphical causal tools for +understanding patient and hospital differences. +\emph{Computational and Structural Biotechnology Journal}, \bold{28}, 312. } \seealso{ \code{\link{gg_partial_varpro}} diff --git a/tests/testthat/test_default_dispatch.R b/tests/testthat/test_default_dispatch.R new file mode 100644 index 00000000..e087300e --- /dev/null +++ b/tests/testthat/test_default_dispatch.R @@ -0,0 +1,15 @@ +# Wrong-class inputs to the rfsrc/randomForest wrappers must error via their +# default S3 methods with a clear message. Fit-free, so these run on CRAN. + +test_that("rfsrc/randomForest wrappers error on non-model input", { + bad <- data.frame(a = 1:3) + + expect_error(gg_error(bad), "expected an 'rfsrc' or 'randomForest' object") + expect_error(gg_vimp(bad), "expected an 'rfsrc' or 'randomForest' object") + expect_error(gg_variable(bad), "expected an 'rfsrc' or 'randomForest' object") + expect_error(gg_rfsrc(bad), "expected an 'rfsrc' or 'randomForest' object") + expect_error(gg_brier(bad), "expected an 'rfsrc' survival object") + + # The error names the class it actually got. + expect_error(gg_error(1:3), "got an object of class integer") +}) diff --git a/tests/testthat/test_gg_beta_varpro.R b/tests/testthat/test_gg_beta_varpro.R index e2034275..97e49d1f 100644 --- a/tests/testthat/test_gg_beta_varpro.R +++ b/tests/testthat/test_gg_beta_varpro.R @@ -1,3 +1,10 @@ +test_that("gg_beta_varpro: non-varpro input errors via the default method", { + # No varPro fit needed — the default method fires before any varpro work. + expect_error(gg_beta_varpro(data.frame(a = 1:3)), + "expected a 'varpro' object") + expect_error(gg_beta_varpro(1:3), "got an object of class integer") +}) + test_that("gg_beta_varpro returns the expected tidy shape", { b <- .beta_fit_mtcars() v <- .varpro_mtcars() diff --git a/tests/testthat/test_gg_isopro.R b/tests/testthat/test_gg_isopro.R index acac8479..cef92cf3 100644 --- a/tests/testthat/test_gg_isopro.R +++ b/tests/testthat/test_gg_isopro.R @@ -19,6 +19,15 @@ make_iso_fit <- function(seed = 1L, method = "rnd", ntree = 25, sampsize = 16) { ) } +test_that("gg_isopro: non-isopro input errors via the default method", { + # No varPro fit needed — the default method fires before any isopro work, + # so this runs everywhere (incl. CRAN). + expect_error(gg_isopro(data.frame(a = 1:3)), + "expected an 'isopro' object") + expect_error(gg_isopro(1:3), + "got an object of class integer") +}) + test_that("gg_isopro: returns gg_isopro data.frame with correct columns", { fit <- make_iso_fit() gg <- gg_isopro(fit) diff --git a/tests/testthat/test_gg_ivarpro.R b/tests/testthat/test_gg_ivarpro.R index 4b26fbfd..7732f8d8 100644 --- a/tests/testthat/test_gg_ivarpro.R +++ b/tests/testthat/test_gg_ivarpro.R @@ -1,3 +1,12 @@ +# ---- Dispatch -------------------------------------------------------------- + +test_that("gg_ivarpro: non-varpro input errors via the default method", { + # No varPro fit needed — the default method fires before any varpro work. + expect_error(gg_ivarpro(data.frame(a = 1:3)), + "expected a 'varpro' object") + expect_error(gg_ivarpro(1:3), "got an object of class integer") +}) + # ---- Shape ---------------------------------------------------------------- test_that("gg_ivarpro regression returns long-format tidy frame", { diff --git a/tests/testthat/test_gg_partial_rfsrc.R b/tests/testthat/test_gg_partial_rfsrc.R new file mode 100644 index 00000000..a6ef7833 --- /dev/null +++ b/tests/testthat/test_gg_partial_rfsrc.R @@ -0,0 +1,55 @@ +test_that("gg_partial_rfsrc imposes factor levels correctly (not collapsed)", { + skip_if_not_installed("randomForestSRC") + + # A factor with a strong, monotone per-level effect. partial.rfsrc() imposes + # a factor level by its integer code; passing the *labels* collapses every + # level to one value (the bug this test guards). Ground truth: A < B < C. + set.seed(101) + n <- 200 + grp <- factor(sample(c("A", "B", "C"), n, TRUE)) + y <- c(A = 0, B = 5, C = 10)[as.character(grp)] + stats::rnorm(n, 0, 0.5) + d <- data.frame(y = y, grp = grp, noise = stats::rnorm(n)) + rf <- randomForestSRC::rfsrc(y ~ ., data = d, ntree = 200) + + g <- gg_partial_rfsrc(rf, xvar.names = "grp") + means <- tapply(g$categorical$yhat, g$categorical$x, mean) + + # All three levels are present and labelled (not integer codes). + expect_setequal(names(means), c("A", "B", "C")) + + # The levels must NOT collapse to a single value. + expect_gt(diff(range(means)), 3) + + # The A < B < C ordering of the imposed effect is recovered. + expect_lt(means[["A"]], means[["B"]]) + expect_lt(means[["B"]], means[["C"]]) +}) + +test_that("gg_partial_rfsrc factor partial dependence matches ground truth", { + skip_if_not_installed("randomForestSRC") + + set.seed(202) + n <- 200 + grp <- factor(sample(c("lo", "mid", "hi"), n, TRUE)) + y <- c(lo = 1, mid = 4, hi = 7)[as.character(grp)] + stats::rnorm(n, 0, 0.4) + d <- data.frame(y = y, grp = grp, noise = stats::rnorm(n)) + rf <- randomForestSRC::rfsrc(y ~ ., data = d, ntree = 300) + + lv <- levels(grp) + truth <- vapply(lv, function(lev) { + dd <- d + dd$grp <- factor(lev, levels = lv) + mean(predict(rf, newdata = dd)$predicted) + }, numeric(1)) + + g <- gg_partial_rfsrc(rf, xvar.names = "grp") + means <- tapply(g$categorical$yhat, g$categorical$x, mean)[lv] + + # gg_partial_rfsrc should track the ground-truth partial dependence. + expect_equal(as.numeric(means), as.numeric(truth), tolerance = 0.15) + + # x is returned as a factor in the model's level order (lo, mid, hi), not + # alphabetical (hi, lo, mid), so downstream factor(x) does not re-sort it. + expect_s3_class(g$categorical$x, "factor") + expect_identical(levels(g$categorical$x), lv) +}) diff --git a/tests/testthat/test_gg_partial_varpro.R b/tests/testthat/test_gg_partial_varpro.R index 1d097aab..38376090 100644 --- a/tests/testthat/test_gg_partial_varpro.R +++ b/tests/testthat/test_gg_partial_varpro.R @@ -41,11 +41,28 @@ test_that("gg_partial_varpro: scale='chf' without object → stop", { ) }) -test_that("gg_partial_varpro: scale='rmst' without time → stop", { - expect_error( - gg_partial_varpro(part_dta = make_mock_vpro_data(), scale = "rmst"), - regexp = "requires 'time'" - ) +test_that("gg_partial_varpro: scale='rmst' part_dta-only no longer needs time", { + # 3.3.0: tau defaults when recomputing from object; with part_dta only there + # is nothing to recompute, so this is a label-only call and must not error. + expect_no_error(suppressWarnings( + gg_partial_varpro(part_dta = make_mock_vpro_data(), scale = "rmst"))) +}) + +test_that("gg_partial_varpro: precomputed part_dta skips the default-tau path", { + # With part_dta supplied the call is label-only: the default-tau logic must + # NOT run (no 'default horizon' message, and no call to .default_surv_tau, + # which would error on a non-survival rf). + fake <- structure(list(family = "surv", x = data.frame(a = 1), + xvar.names = "a", max.tree = 1L, + rf = list(time.interest = c(1, 2, 3))), + class = "varpro") + expect_no_error(suppressWarnings( + gg_partial_varpro(part_dta = make_mock_vpro_data(), object = fake, + scale = "rmst"))) + expect_message( + suppressWarnings(gg_partial_varpro(part_dta = make_mock_vpro_data(), + object = fake, scale = "rmst")), + NA) # no 'using default horizon' message }) ## ── Class & structure ──────────────────────────────────────────────────────── @@ -240,14 +257,86 @@ test_that(".rmst_from_survival errors on a time-grid / column mismatch", { }) test_that(".resolve_varpro_scale maps 'auto' by family, passes others through", { - expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "surv"), - "mortality") + # 3.3.0: bounded/interpretable defaults + expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "surv"), "surv") + expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "class"), "prob") expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", "regr"), "generic") expect_equal(ggRandomForests:::.resolve_varpro_scale("auto", NA_character_), "generic") - expect_equal(ggRandomForests:::.resolve_varpro_scale("rmst", "surv"), - "rmst") # explicit scale returned unchanged + # explicit scales returned unchanged + expect_equal(ggRandomForests:::.resolve_varpro_scale("rmst", "surv"), "rmst") + expect_equal(ggRandomForests:::.resolve_varpro_scale("odds", "class"), "odds") +}) + +## ── v3.3.0 scale transform + bounded predicate ─────────────────────────────── +test_that(".scale_transform applies prob/odds/identity", { + z <- matrix(c(0, log(3)), nrow = 1) # log-odds: 0->.5, log(3)->.75 + expect_equal(ggRandomForests:::.scale_transform(z, "prob"), + matrix(c(0.5, 0.75), nrow = 1)) + expect_equal(ggRandomForests:::.scale_transform(z, "odds"), + matrix(c(1, 3), nrow = 1)) + for (s in c("logodds", "generic", "surv", "rmst", "mortality")) + expect_equal(ggRandomForests:::.scale_transform(z, s), z) +}) + +test_that(".is_bounded_scale flags prob/odds/surv only", { + for (s in c("prob", "odds", "surv")) + expect_true(ggRandomForests:::.is_bounded_scale(s)) + for (s in c("logodds", "generic", "mortality", "rmst", "chf")) + expect_false(ggRandomForests:::.is_bounded_scale(s)) +}) + +## ── v3.3.0 conversion in the extractor (mean of probabilities) ─────────────── +test_that("gg_partial_varpro: scale='prob' is mean of plogis, causal NA", { + d <- make_mock_vpro_data() + res <- gg_partial_varpro(d, scale = "prob") + age <- res$continuous[res$continuous$name == "age", ] + expected <- colMeans(stats::plogis(d$age$yhat.par), na.rm = TRUE) + expect_equal(age$parametric, expected) + expect_true(all(age$parametric >= 0 & age$parametric <= 1)) + expect_true(all(is.na(res$continuous$causal))) +}) + +test_that("gg_partial_varpro: scale='logodds' keeps raw values + causal", { + d <- make_mock_vpro_data() + res <- gg_partial_varpro(d, scale = "logodds") + age <- res$continuous[res$continuous$name == "age", ] + expect_equal(age$parametric, colMeans(d$age$yhat.par, na.rm = TRUE)) + expect_false(all(is.na(res$continuous$causal))) +}) + +## ── v3.3.0 survival probability at tau (pure) + default-tau fallback ────────── +test_that(".surv_at_tau pulls S(tau) snapped to nearest event time", { + surv <- matrix(c(0.9, 0.6, 0.3), nrow = 1) + times <- c(1, 2, 3) + expect_equal(ggRandomForests:::.surv_at_tau(surv, times, 2), 0.6) + expect_equal(ggRandomForests:::.surv_at_tau(surv, times, 2.1), 0.6) + expect_equal(ggRandomForests:::.surv_at_tau(surv, times, 3), 0.3) +}) + +test_that(".default_surv_tau falls back to median(time.interest)", { + fake <- list(rf = list(time.interest = c(2, 4, 6, 8))) + expect_equal(ggRandomForests:::.default_surv_tau(fake), 5) +}) + +## ── v3.3.0 y-axis labels ───────────────────────────────────────────────────── +test_that(".partial_varpro_ylabel: prob/odds/logodds/surv labels", { + lab <- ggRandomForests:::.partial_varpro_ylabel + expect_match(lab(list(scale = "prob", target = "yes")), "P\\(Y = yes\\)") + expect_match(lab(list(scale = "odds", target = "yes")), "Odds\\(Y = yes\\)") + expect_match(lab(list(scale = "logodds", target = "yes")), "Log-odds") + expect_equal(lab(list(scale = "prob", target = NA)), "Probability") + expect_match(lab(list(scale = "surv", rmst_tau = 1000)), + "Survival probability at t = 1000") +}) + +## ── v3.3.0 plot: causal hidden on bounded scales ───────────────────────────── +test_that("plot.gg_partial_varpro: bounded scale drops causal, warns if asked", { + d <- make_mock_vpro_data() + res <- gg_partial_varpro(d, nvars = 1, scale = "prob") + expect_s3_class(plot(res), "ggplot") + expect_warning(plot(res, type = "causal"), regexp = "causal") }) test_that(".rmst_from_survival is vectorised over rows and bounded by tau", { @@ -392,12 +481,73 @@ test_that(".rmst_learner returns RMST(tau) for OOB and newdata calls", { expect_true(all(oob >= 0 & oob <= tau + 1e-6)) }) -## ── C-path (scale = surv / chf) ────────────────────────────────────────────── +## ── v3.3.0 survival S(t) learner + default tau (rfsrc fit; no varpro grow) ──── +test_that(".surv_learner returns S(tau) in [0,1] for OOB and newdata", { + skip_if_not_installed("randomForestSRC") + m <- make_mock_cpath() + tau <- stats::median(m$rf$time.interest) + learner <- ggRandomForests:::.surv_learner(list(rf = m$rf), tau) + oob <- learner() + nd <- learner(survival::veteran[1:5, ]) + expect_length(oob, nrow(m$rf$xvar)) + expect_length(nd, 5L) + expect_true(all(oob >= 0 & oob <= 1)) + expect_true(all(nd >= 0 & nd <= 1)) +}) + +test_that(".default_surv_tau is the median observed survival time", { + skip_if_not_installed("randomForestSRC") + m <- make_mock_cpath() + tau <- ggRandomForests:::.default_surv_tau(list(rf = m$rf)) + expect_equal(tau, stats::median(survival::veteran$time)) +}) + +## ── v3.3.0 survival routing + classification provenance (real varpro fits) ─── +test_that("gg_partial_varpro: scale='surv' via learner, in [0,1], default tau", { + skip_on_cran() + skip_if_not_installed("randomForestSRC") + skip_if_not_installed("varPro") + set.seed(13) + pbc <- get(utils::data("pbc", package = "randomForestSRC", + envir = environment())) + pbc <- pbc[stats::complete.cases(pbc), ] + vp <- varPro::varpro(Surv(days, status) ~ ., pbc, ntree = 60, nvar = 2) + + r <- gg_partial_varpro(object = vp, scale = "surv", time = 1000, nvars = 1) + expect_equal(attr(r, "provenance")$scale, "surv") + expect_equal(attr(r, "provenance")$path, "A") + expect_true(all(r$continuous$parametric >= 0 & + r$continuous$parametric <= 1)) + expect_true(all(is.na(r$continuous$causal))) + + rd <- suppressMessages(gg_partial_varpro(object = vp, scale = "surv", + nvars = 1)) + expect_equal(attr(rd, "provenance")$rmst_tau, + ggRandomForests:::.default_surv_tau(vp)) + + ra <- suppressMessages(gg_partial_varpro(object = vp, nvars = 1)) + expect_equal(attr(ra, "provenance")$scale, "surv") +}) + +test_that("gg_partial_varpro: classification provenance records target class", { + skip_on_cran() + skip_if_not_installed("varPro") + set.seed(5) + dat <- data.frame(y = factor(rep(c("a", "b"), 60)), + x1 = rnorm(120), x2 = rnorm(120)) + vp <- varPro::varpro(y ~ ., dat, ntree = 40, nvar = 2) + r <- suppressMessages( + gg_partial_varpro(object = vp, scale = "prob", nvars = 1)) + expect_equal(attr(r, "provenance")$scale, "prob") + expect_equal(attr(r, "provenance")$target, "b") +}) + +## ── C-path (scale = chf; surv now uses the path-A learner) ─────────────────── test_that("gg_partial_varpro: C-path returns gg_partial_varpro class", { skip_if_not_installed("randomForestSRC") m <- make_mock_cpath() result <- suppressWarnings( - gg_partial_varpro(object = m$vp, scale = "surv", + gg_partial_varpro(object = m$vp, scale = "chf", time = median(m$rf$time.interest)) ) expect_s3_class(result, "gg_partial_varpro") @@ -407,18 +557,18 @@ test_that("gg_partial_varpro: C-path provenance path='C'", { skip_if_not_installed("randomForestSRC") m <- make_mock_cpath() result <- suppressWarnings( - gg_partial_varpro(object = m$vp, scale = "surv", + gg_partial_varpro(object = m$vp, scale = "chf", time = median(m$rf$time.interest)) ) expect_equal(attr(result, "provenance")$path, "C") - expect_equal(attr(result, "provenance")$scale, "surv") + expect_equal(attr(result, "provenance")$scale, "chf") }) test_that("gg_partial_varpro: plot C-path returns ggplot", { skip_if_not_installed("randomForestSRC") m <- make_mock_cpath(xvar_subset = 1L) result <- suppressWarnings( - gg_partial_varpro(object = m$vp, scale = "surv", + gg_partial_varpro(object = m$vp, scale = "chf", time = median(m$rf$time.interest)) ) gg <- plot(result) @@ -431,7 +581,7 @@ test_that("gg_partial_varpro: C-path 'model' label is attached to the frames", { # frames populate, so the model column is attached to each. m <- make_mock_cpath() result <- suppressWarnings( - gg_partial_varpro(object = m$vp, scale = "surv", + gg_partial_varpro(object = m$vp, scale = "chf", time = median(m$rf$time.interest), model = "forestC") ) got_cont <- is.data.frame(result$continuous) && diff --git a/vignettes/ggRandomForests-regression.qmd b/vignettes/ggRandomForests-regression.qmd index 6b8215c3..b88f39ad 100644 --- a/vignettes/ggRandomForests-regression.qmd +++ b/vignettes/ggRandomForests-regression.qmd @@ -41,7 +41,7 @@ Random forests [@Breiman:2001] are a non-parametric ensemble method that requires no distributional or functional assumptions on how covariates relate to the response. The method builds a large collection of de-correlated decision trees via bootstrap aggregation (bagging) and random feature selection, then -averages their predictions to produce a stable, low-variance estimator. The +averages their predictions to smooth out the noise any single tree carries. The **randomForestSRC** package [@Ishwaran:RFSRC:2014] provides a unified implementation for survival, regression, and classification forests. diff --git a/vignettes/ggRandomForests-survival.qmd b/vignettes/ggRandomForests-survival.qmd index 446ebfc0..aa3a110a 100644 --- a/vignettes/ggRandomForests-survival.qmd +++ b/vignettes/ggRandomForests-survival.qmd @@ -50,17 +50,16 @@ bounded between 0 and 1. Second, the cumulative hazard function (CHF): $\hat{H}(t) = -\log \hat{S}(t)$, an unbounded, monotone-increasing summary of accumulated risk. Third, mortality: the expected number of events a subject would accumulate if followed indefinitely under their covariate profile, -computed by summing the CHF over the observed time grid. Mortality is an -unbounded relative-risk score, not a probability, and works well as a single -scalar for ranking patients by risk. +computed by summing the CHF over the observed time grid. Mortality is not a +probability; it is an unbounded relative-risk score. Read it as a single number +for ranking patients from low to high risk, not as a survival chance. The **[randomForestSRC](https://www.randomforestsrc.org)** package [@Ishwaran:RFSRC:2014] provides a unified implementation for survival, regression, and classification forests. **ggRandomForests** extracts tidy data objects from `rfsrc` fits and renders -them with **ggplot2** [@Wickham:2009], making it straightforward to explore how -a forest is constructed, which variables matter, and how the response depends on -individual predictors. +them with **ggplot2** [@Wickham:2009], so you can see how a survival forest is +built, which predictors carry the risk, and how survival shifts across each one. This vignette demonstrates a complete random survival forest workflow on the primary biliary cirrhosis (PBC) data set [@fleming:1991]: diff --git a/vignettes/precompute_varpro.R b/vignettes/precompute_varpro.R index 7bf60d1e..6bb56503 100644 --- a/vignettes/precompute_varpro.R +++ b/vignettes/precompute_varpro.R @@ -48,8 +48,6 @@ pd_boston <- gg_partial_varpro(object = v_boston) b_boston <- varPro::beta.varpro(v_boston) iv_boston <- varPro::ivarpro(v_boston) set.seed(20260527L) -u_boston <- varPro::uvarpro(boston_x, ntree = 50) -set.seed(20260527L) iso_boston <- varPro::isopro(data = boston_x, method = "rnd", sampsize = 256, ntree = 50) @@ -83,9 +81,9 @@ iso_pbc <- varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", # vignette never invokes those on a stripped object (pd_pbc is cached and # gg_isopro() is training-path only). Dropping those heavy # slots takes the file from ~1.6 MB to ~0.4 MB (validated: every vignette -# wrapper call returns output identical to the un-stripped object). Two -# exceptions keep their forest: v_boston is printed in the vignette -# (print.varpro reads $rf), and u_boston feeds gg_udependent(), which uses it. +# wrapper call returns output identical to the un-stripped object). One +# exception keeps its forest: v_boston is printed in the vignette +# (print.varpro reads $rf). .strip_varpro <- function(v) { # $rf: unused on the cached path v$rf <- NULL v @@ -113,7 +111,6 @@ saveRDS( pd_boston = pd_boston, b_boston = b_boston, iv_boston = iv_boston, - u_boston = u_boston, iso_boston = iso_boston, # iris (class) v_iris_binary = v_iris_binary, diff --git a/vignettes/uvarpro.qmd b/vignettes/uvarpro.qmd new file mode 100644 index 00000000..5936faf9 --- /dev/null +++ b/vignettes/uvarpro.qmd @@ -0,0 +1,159 @@ +--- +title: "Variable selection without an outcome: unsupervised varPro" +author: "John Ehrlinger" +date: today +format: + html: + fig-format: png + fig-dpi: 96 + toc: true + toc-depth: 2 + html-math-method: mathjax + code-fold: false +bibliography: ggRandomForests.bib +vignette: > + %\VignetteIndexEntry{Variable selection without an outcome: unsupervised varPro} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set( + message = FALSE, + warning = FALSE, + fig.width = 7, + fig.height = 4.5, + cache = TRUE, + cache.path = "uvarpro_cache/" +) +options(mc.cores = 1, rf.cores = 1) +``` + +```{r libs} +library(ggplot2) + +# Match the pattern used by the other vignettes: try the installed +# package first, fall back to pkgload::load_all() for the R CMD check +# vignette rebuild where the package isn't yet on .libPaths(). All varPro +# calls below are ::-qualified, so no library(varPro) is needed. +if (requireNamespace("ggRandomForests", quietly = TRUE)) { + library(ggRandomForests) +} else if (requireNamespace("pkgload", quietly = TRUE)) { + pkgload::load_all(export_all = FALSE, helpers = FALSE, + attach_testthat = FALSE) +} else { + stop("Install ggRandomForests (or pkgload for dev builds) to render this vignette.") +} +``` + +# Importance without a response + +Most measures of variable importance start from a question: which predictors +help explain *this* outcome? Permutation VIMP, the varPro release rules, the +per-rule lasso weights behind `gg_beta_varpro()` — all of them score a variable +by how much it moves a response `y`. The companion +[varPro vignette](varpro.html) walks that supervised toolkit end to end. + +But sometimes there is no `y`. You have a matrix of predictors and you want to +know which columns carry the structure of the data: which variables define its +shape, which travel together, and which are close to noise. That is the job of +unsupervised varPro. `varPro::uvarpro()` grows a forest on the predictor matrix +alone, with no response, and scores each variable by how much entropy it +contributes to the regions the forest carves out [@Lu2024varpro]. "Important" +here +does not mean useful for prediction. It means a variable helps reconstruct the +feature space that the others cannot. + +This is a short vignette. It walks the three `gg_*` views of a single +`uvarpro()` fit, and each one answers a different question about that fit: what +depends on what, what ranks highest, and where to draw the line between signal +and noise. + +# One fit, three views + +We'll use `mtcars`. Its columns are all numeric, and several of them measure +closely related things — displacement, horsepower, weight, and cylinder count +all track engine size — so the unsupervised structure is easy to read. + +```{r fit} +set.seed(1) +u <- varPro::uvarpro(mtcars, ntree = 50) +``` + +Notice what `uvarpro()` was handed: the data frame, and nothing else. No +formula, no outcome column singled out. All three views below read off this one +fit. Two of them rest on the same `get.beta.entropy()` matrix, which is the only +part worth caching, so we compute it once and pass it to both rather than paying +for it twice: + +```{r beta_fit} +beta_fit <- varPro::get.beta.entropy(u) +``` + +## What depends on what: `gg_udependent()` + +`gg_udependent()` reads cross-variable structure off the fit and draws it as a +network: nodes are variables, edges are dependencies above a configurable +threshold. The picture is built with `ggraph`, which lives in `Suggests` rather +than `Imports`, so install it if you want this view. + +```{r udep, eval = requireNamespace("ggraph", quietly = TRUE)} +plot(gg_udependent(u)) +``` + +Clusters of mutually connected variables are worth a second look. Because the +fit has no response, a cluster tells you these columns are correlated in feature +space regardless of whether any of them would matter for a prediction. When you +do have a downstream model, that is useful on its own: a variable can be +important for prediction and still sit in a tight cluster with a near-duplicate, +and in that case parsimony may favour dropping one member of the cluster without +losing much. + +## What ranks highest: `gg_beta_uvarpro()` + +The network shows *structure*; `gg_beta_uvarpro()` turns the same fit into a +*ranking*. It is the unsupervised analogue of `gg_beta_varpro()`: from the +entropy matrix it aggregates the per-region lasso coefficients into a mean +absolute weight per variable (`beta_mean`), orders them most-important first, +and flags the ones above a selection cutoff. Here we hand it the `beta_fit` we +already computed. + +```{r beta} +plot(gg_beta_uvarpro(u, beta_fit = beta_fit)) +``` + +Read it as the unsupervised counterpart of a VIMP bar chart. The tall bars are +the variables that most define the structure of the predictor space. Paired with +the network above, you get both halves of the story: *which* variables carry the +most unsupervised signal, and *how* they group. + +## Where to draw the line: `gg_sdependent()` + +A ranking invites the obvious next question: where is the cut? Which variables +are signal, and which are noise? `gg_sdependent()` answers that narrower +question off the same fit. It wraps `varPro::sdependent()` and returns one row +per candidate variable — an importance score, its degree in the dependency +graph, and a `signal` flag — drawn as a ranked lollipop. + +```{r sdep} +plot(gg_sdependent(u, beta_fit = beta_fit)) +``` + +Where `gg_beta_uvarpro()` ranks *all* the variables, `gg_sdependent()` makes the +cut explicit, separating the columns the unsupervised analysis treats as signal +from the ones it treats as noise. + +# Reading the three together + +The three views are one workflow, not three unrelated plots. Start with +`gg_udependent()` to see the structure, rank it with `gg_beta_uvarpro()`, then +let `gg_sdependent()` draw the signal-versus-noise line. They all derive from the +one `uvarpro()` fit, and two of them from the one `get.beta.entropy()` matrix, so +computing that matrix once (as we did above) and passing it in keeps the whole +sequence cheap. + +For the supervised side of varPro — VIMP, partial dependence, per-rule lasso +refinement, local importance, and anomaly scoring — see the companion +[varPro vignette](varpro.html). + +# References diff --git a/vignettes/varpro.qmd b/vignettes/varpro.qmd index 0bd7bb44..b53724d6 100644 --- a/vignettes/varpro.qmd +++ b/vignettes/varpro.qmd @@ -122,16 +122,18 @@ is large; the conditional view catches that. One-hot consolidation via `get.orgvimp()` and `get.topvars()` maps the per-level scores back to the original factor, so downstream reporting stays on familiar ground. -That core machinery feeds six ggRandomForests wrappers. **`gg_varpro()`** -summarises the per-tree importance distribution. **`gg_beta_varpro()`** -refines those release-rule contrasts with a per-rule lasso. **`gg_partial_varpro()`** -turns the release machinery into partial-dependence curves. -**`gg_udependent()`** reads cross-variable dependency off a `uvarpro()` -fit. **`gg_isopro()`** scores observations for anomaly using an -isolation-forest variant. **`gg_ivarpro()`** computes per-observation -local importance. - -This vignette walks all six wrappers on three worked examples: a +That core machinery feeds five supervised ggRandomForests wrappers. +**`gg_varpro()`** summarises the per-tree importance distribution. +**`gg_beta_varpro()`** refines those release-rule contrasts with a per-rule +lasso. **`gg_partial_varpro()`** turns the release machinery into +partial-dependence curves. **`gg_isopro()`** scores observations for anomaly +using an isolation-forest variant. **`gg_ivarpro()`** computes per-observation +local importance. A separate set of *unsupervised* wrappers — +`gg_udependent()`, `gg_beta_uvarpro()`, and `gg_sdependent()` — reads structure +off a `uvarpro()` fit that has no response at all; those get their own +walk-through in the companion [uvarpro vignette](uvarpro.html). + +This vignette walks the five supervised wrappers on three worked examples: a regression problem (Boston housing), a classification problem (iris, binary and multi-class), and a survival problem (PBC). The closing section is a one-page reference matrix mapping each wrapper to the @@ -167,8 +169,10 @@ optimise toward. The first question varPro answers is the same one permutation importance answers: which variables matter, ranked. `gg_varpro()` extracts the per-tree importance scores and draws their distribution as a horizontal -boxplot, one row per variable, sorted by median z-score. Variables above -a configurable cutoff (default 0.79) are highlighted. +boxplot, one row per variable, sorted by median z-score, with the +variables above a configurable cutoff (default 0.79) highlighted. Read the +spread, not just the position: a tight box means every tree agrees on that +variable. ```{r boston-gg-varpro} plot(gg_varpro(v_boston)) @@ -260,37 +264,15 @@ signal: a variable that ranks high here but low there is one whose local linear effect inside many rules is real even when the release contrast is modest. -## Cross-variable dependency with `gg_udependent()` +## Unsupervised views: see the uvarpro vignette -The three views so far take one variable at a time. `gg_udependent()` -reads cross-variable structure off a `uvarpro()` fit and draws the -result as a network: nodes are variables, edges are dependencies above a -configurable threshold. The visual is built with `ggraph`, which is in -`Suggests` rather than `Imports`; install it if you want this view. - -```{r boston-uvarpro, cache=TRUE} -# Precomputed offline (see precompute_varpro.R); falls back to a live fit. -u_boston <- if (is.null(.vp$u_boston)) { - varPro::uvarpro(Boston[, setdiff(names(Boston), "medv")], ntree = 50) -} else { - .vp$u_boston -} -``` - -```{r boston-gg-udependent, eval = requireNamespace("ggraph", quietly = TRUE)} -plot(gg_udependent(u_boston)) -``` - -Clusters of mutually-connected variables are worth checking for -redundancy: they may be several views of the same underlying quantity. -`uvarpro()` operates on the predictor matrix alone; there is no response -in the fit, which is what makes the network genuinely unsupervised. -Variables that cluster here are correlated in feature space regardless of -whether any of them matters for prediction. That information is -complementary to the ranking from `gg_varpro()`: a variable can be -important for prediction and still sit in a tight cluster with a -near-duplicate; in that situation, model parsimony may favour dropping -one of the cluster members without losing much predictive accuracy. +The wrappers so far all score variables against a response. varPro also has an +unsupervised mode: `varPro::uvarpro()` grows a forest on the predictor matrix +alone, and three more wrappers read off that fit — `gg_udependent()` draws the +cross-variable dependency network, `gg_beta_uvarpro()` ranks the variables by +their entropy contribution, and `gg_sdependent()` marks the cut between signal +and noise. Because there is no outcome in play, they get their own short +walk-through in the companion [uvarpro vignette](uvarpro.html). ## Anomaly scoring with `gg_isopro()` @@ -432,10 +414,19 @@ plot(gg_varpro(v_iris_multi, conditional = TRUE)) ## Partial dependence: `gg_partial_varpro()` on classification -`gg_partial_varpro()` works the same way on a classification fit, but -each predictor's curve becomes per-class probabilities. The plot -panels are organised by predictor with class encoded as colour or -facet (the wrapper defaults handle this). +On a classification fit `gg_partial_varpro()` defaults to the +**probability scale** (`scale = "auto"` resolves to `"prob"`): each +predictor's curve is the predicted probability of the *target class* — +by default the last factor level (here *virginica*), selectable with +`target =`. `varPro::partialpro()` works internally on the log-odds of +that class; the wrapper back-transforms each observation to a +probability *before* averaging, so the curve is the mean predicted +probability (not the probability of the mean log-odds). On this bounded +$[0, 1]$ scale only the parametric and non-parametric curves are shown — +the `causal` contrast is a log odds-ratio, not a level, so it cannot +share the probability axis (use `scale = "logodds"` to see it). +`scale = "odds"` and `"logodds"` give the same relationship on the odds +and log-odds scales. ```{r iris-multi-gg-partial-varpro} # Precomputed offline (see precompute_varpro.R); falls back to a live fit. @@ -448,8 +439,8 @@ plot(gg_pd_iris) ``` Read each panel as: "as predictor X changes from low to high, how does -the probability of each class shift?" Patterns that match the -underlying biology (e.g. petal length separating setosa from the +the probability of the target class shift?" Patterns that match the +underlying biology (e.g. petal length separating *virginica* from the others) act as a sanity check on the forest. ## Per-class lasso refinement with `gg_beta_varpro()` @@ -508,9 +499,10 @@ censored response requires a local partial-likelihood or Nelson-Aalen estimator in place of the regression/classification local estimator; the design work for that is tracked but not yet landed. -What does work is the core release-rule importance, the C-path partial -dependence (routed through the embedded `$rf` random survival forest), -and anomaly scoring on the predictor matrix. For many applied problems, +What does work is the core release-rule importance, partial dependence +(survival probability $S(\tau)$ and RMST through the release-rule engine, +plus cumulative hazard via the embedded `$rf` survival forest), and +anomaly scoring on the predictor matrix. For many applied problems, those three views cover the questions you actually want to answer. The PBC (primary biliary cirrhosis) dataset from `randomForestSRC` has @@ -542,18 +534,30 @@ v_pbc <- if (is.null(.vp$v_pbc)) { plot(gg_varpro(v_pbc)) ``` -## Partial dependence: `gg_partial_varpro()` (C-path) - -For survival, `gg_partial_varpro()` dispatches into the -`randomForestSRC`-backed partial-dependence machinery via the underlying -`$rf` survival forest (what the package calls the *C-path* because it -routes through the `randomForestSRC` C-code layer rather than the varPro -release-rule engine). The output shape is the same as the regression case -(parametric / non-parametric / causal panels); the y-axis carries an -"ensemble mortality" interpretation [@Ishwaran:2007a], not a survival -probability. Higher values mean higher predicted mortality at a given -predictor value, integrated over the event-time horizon in the training -data. +## Partial dependence: `gg_partial_varpro()` on survival + +For survival `gg_partial_varpro()` defaults to **survival probability** +(`scale = "auto"` resolves to `"surv"`): each curve is +$S(\tau \mid x)$, the predicted probability of surviving past a horizon +$\tau$, computed through `partialpro()` on the same release-rule (UVT) +engine as the regression and classification fits — bounded in $[0, 1]$ +and read in the model's own time units. + +When you do not supply `time`, $\tau$ defaults to the **median follow-up +time** of the fit. Because that horizon is derived from the data it is +always in the model's units and cannot be mis-specified the way a +hand-typed number can — a units mismatch (days vs. years) is the classic +survival partial-plot trap, and a data-driven default sidesteps it. The +resolved $\tau$ is shown in the axis label and reported with a message; +pass `time = tau` to choose another horizon. As on the classification +probability scale, the `causal` contrast is hidden here (use +`scale = "rmst"` or `"mortality"` to see it). + +Other survival scales are explicit opt-ins: `scale = "rmst"` gives +restricted mean survival time RMST$(\tau)$ (also on the release-rule +engine, with the same median-follow-up default $\tau$), and +`scale = "mortality"` keeps the unbounded ensemble-mortality score +[@Ishwaran:2007a] — a relative-risk index, *not* a survival probability. ```{r pbc-gg-partial-varpro} # Precomputed offline (see precompute_varpro.R); falls back to a live fit. @@ -590,9 +594,9 @@ story. Both are tracked for v3.1.0. If you call either on a survival fit you'll get a clear error message pointing at the deferred work, not a silent miscalculation. The -family-support matrix in §6 records this; the rest of the toolkit -that *does* work on survival (`gg_varpro`, `gg_partial_varpro`, -`gg_isopro` above; `gg_udependent` shown in §3 on the X-matrix). +family-support matrix in the closing reference section records this; the +rest of the toolkit that *does* work on survival (`gg_varpro`, +`gg_partial_varpro`, and `gg_isopro` above). # Cross-cutting reference @@ -600,9 +604,8 @@ that *does* work on survival (`gg_varpro`, `gg_partial_varpro`, | Wrapper | regr | class | surv | regr+ | |---|---|---|---|---| -| `gg_partial_varpro` | ✓ | ✓ | ✓ (C-path via `$rf`) | ✗ (not audited) | +| `gg_partial_varpro` | ✓ | ✓ (prob) | ✓ (S(τ); rmst/mortality/chf) | ✗ (not audited) | | `gg_varpro` | ✓ | ✓ (`conditional = TRUE`) | ✓ | ✗ (errors) | -| `gg_udependent` | ✓ (uvarpro on X) | ✓ (X) | ✓ (X) | ✓ (X) | | `gg_isopro` | ✓ (X) | ✓ (X) | ✓ (X) | ✓ (X) | | `gg_beta_varpro` | ✓ | ✓ | ✗ (upstream stop) | ✗ (deferred) | | `gg_ivarpro` | ✓ | ✓ | ✗ (deferred) | ✗ (deferred) | @@ -652,16 +655,17 @@ The release-rule framework is laid out in @Lu2024varpro. Worked implementation examples and the full API are at the [varPro tools reference site](https://www.varprotools.org/articles/getstarted.html). -The ensemble mortality scale used by survival partial dependence is -introduced in @Ishwaran:2007a (the Annals of Applied Statistics methods -paper for random survival forests); the R-package side is described in -@Ishwaran:2008. The boosted nonparametric hazard framework that informs +The ensemble-mortality scale (`scale = "mortality"`), one of the +survival partial-dependence options alongside the default survival +probability, is introduced in @Ishwaran:2007a (the Annals of Applied +Statistics methods paper for random survival forests); the R-package +side is described in @Ishwaran:2008. The boosted nonparametric hazard framework that informs varPro's survival path (including the treatment of time-dependent covariates) is in @Lee:2021. Each wrapper's help page carries a "What this is doing" section that -goes one level deeper than this vignette. The cross-cutting reference -in §6 maps each wrapper to the forest families it supports and notes -which capabilities are deferred to v3.1.0. +goes one level deeper than this vignette. The cross-cutting reference at +the end of this vignette maps each wrapper to the forest families it +supports and notes which capabilities are deferred. # References diff --git a/vignettes/varpro_precomputed.rds b/vignettes/varpro_precomputed.rds index 82eea83b..f80d7ffa 100644 Binary files a/vignettes/varpro_precomputed.rds and b/vignettes/varpro_precomputed.rds differ