Skip to content

Commit 60d97cf

Browse files
ehrlingerclaude
andauthored
chore: forward-merge main (v3.1.0) into dev + reconcile to 3.1.0.9000 (#117)
* docs: v3.1.0 documentation sweep + gg_vimp fix (CRAN release) (#109) * docs: v3.1.0 documentation-sweep design spec * docs: v3.1.0 spec — fix bugs surfaced by canonical-source reconciliation (with severity triage) * docs: v3.1.0 doc-sweep implementation plan * docs: deepen varPro-family roxygen (release-rules framing, vimp-vs-varpro) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add the gg_vimp-vs-gg_varpro distinction to gg_vimp (Task 2 fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: deepen rfsrc partial/survival/rfsrc roxygen (ensemble + partial-dependence framing) * docs: address Task 2 review (drop invented first-person in gg_survival; non-positive VIMP wording) * docs: voice/drift cleanup on remaining roxygen topics Remove stale yvar @return item from gg_roc — the function returns sens/spec/pct (from calc_roc), not a yvar column per observation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(vignette): deepen varpro — release-rules framing, refs Deepens all prose sections of varpro.qmd with release-rules/guided-splitting framing; adds Lee:2021 bib key (AOS 49:4) cited in PBC section; adds varProtools URL in Further Reading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(vignette): regression — vimp-vs-varpro contrast, rfsrc ref Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(vignette): survival — rfsrc ensemble framing, ref Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(vignette): intro — voice/drift pass * docs(comments): correctness + gap pass on R/ source Fix a misleading AUC trapezoidal-rule comment in calc_roc.R (the old text introduced Δ(FPR) but the code uses Δspec; reworded to state the equivalence plainly). Remove a stale varPro-specific note from the categorical branch of gg_partial.R (plot.variable output has no connection to varPro one-hot encoding). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(NEWS): open v3.1.0 development heading (version unchanged) * docs(vignette): trim em-dashes in varpro per voice standard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(PR#109 review): gg_vimp positive flag (VIMP vs vimp case), Brier 0.25 precision * chore(release): prepare v3.1.0 for CRAN Bump DESCRIPTION + NEWS to 3.1.0 (CRAN never saw v3.0.0; jump from 2.7.3 is intentional) and finalize the v3.1.0 NEWS heading. Trim em-dashes and right-arrows from roxygen and code comments per the package voice standard (68 replacements across R/), then re-document so man/*.Rd carries no raw non-ASCII into the PDF manual build. Rewrite cran-comments.md for the 2.7.3 -> 3.1.0 submission: fold in the v3.0.0 feature layer, correct the local test env (R 4.6.0/darwin23). R CMD check --as-cran (with manual build, ggraph present): 0/0/0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cran-comments): correct v3.0.0 history v3.0.0 was submitted but did not complete the CRAN review cycle; 3.1.0 supersedes it. Prior wording said it was never submitted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cran-comments): note v3.0.0 pretests were clean, hold was heuristic Per the release handoff: tell the CRAN reviewer the 2026-05-28 v3.0.0 submission cleared incoming pretests on Windows + Debian (0/0/0) and the auto-hold looked like a version-jump/Depends-to-Imports heuristic, not a defect, in case the same heuristic flags v3.1.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(dev/plans): mark v3.0.0-held release mechanics as superseded The plan/design docs described the held workflow (keep Version 3.0.0, merge only after CRAN accepts v3.0.0, cut 3.1.0 at a post-acceptance RC). v3.0.0 lapsed un-reviewed, so we ship 3.1.0 directly. Banners note this; the documentation-content plan is unchanged. Addresses Copilot review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(vignettes): static PD surfaces + 96-dpi figures to cut install size (#110) The regression and survival partial-dependence surfaces were interactive plotly widgets; self-contained quarto inlined plotly.js (~3.5 MB) into each vignette HTML, and figures rendered at retina 2x. Installed size was 17.1 MB (doc 16.3 MB), well over CRAN's 5 MB guideline. Replace both surfaces with static ggplot2 heat maps, set fig-format png / fig-dpi 96 in all four vignettes, and drop the now-unused plotly Suggests. Installed size drops to ~5.5 MB (doc 4.7 MB); source tarball 9.0 -> 3.7 MB. R CMD check --as-cran (with manual, ggraph present): pending confirmation. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * docs(examples): \donttest the slow plot.gg_variable example sections (#113) win-builder R-oldrelease flagged the plot.gg_variable example at 10.33s elapsed (just over CRAN's 10s; under 10s on release/devel). Wrap the loess-heavy regression panel plot and the full survival section (veteran forest + multi-time variable/panel plots) in \donttest so they are excluded from the timed example run; the fast classification + basic regression plots still run. No behaviour change. R CMD check --as-cran: examples [14s] OK, examples --run-donttest [28s] OK, Status: OK (0/0/0). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * Cut CRAN overall check time below 10 min (#114) * perf(check): cut CRAN overall check time below 10 min CRAN flagged the 3.1.0 submission's overall check time (13 min > 10 min), driven by the vignette rebuild (331s) and tests (209s). Reduce both per Uwe Ligges' suggested levers (toy data / fewer iterations / precomputed results), with no change to test coverage or vignette content. Vignettes: - regression: Boston forest ntree 200, PD-surface grid 25 -> 10 - survival: impute ntree 100, forest ntree 150, PD-surface grid 25 -> 8 - varpro: the three gg_partial_varpro() calls (11-17s each) and the Boston beta.varpro() fit (~3s) -- the bulk of that vignette -- are precomputed offline by precompute_varpro.R and loaded from varpro_precomputed.rds (167 KB, xz), with an automatic live-computation fallback if absent. Tests: - test_gg_udependent memoised varPro::get.beta.entropy() (~1.5s, a pure function of the fit) per argument signature instead of recomputing it once per test (this file was ~24s of the suite, now ~9s). Verified: R CMD check --as-cran with manual is OK (0/0/0); local vignette rebuild 33s and tests 28s (were 331s/209s on CRAN's r-devel-windows). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * review: address Copilot feedback on the check-time PR - varpro.qmd: load the precompute via tryCatch(readRDS) so a missing OR unreadable .rds falls back to live computation instead of erroring. - precompute_varpro.R: mirror the vignette's requireNamespace/pkgload fallback instead of bare library(ggRandomForests), so the script runs in a fresh clone before the package is installed. - test_gg_udependent.R: make make_ggu() warning suppression opt-in (.quiet = FALSE by default); pass .quiet = TRUE only to the empty-graph (threshold = 999) cases that legitimately warn, so an unexpected warning on any other call still fails the test. Verified: test_gg_udependent 19 tests, 0 fail / 0 warn; varpro vignette renders in 20s with 0 errors (precompute still loaded). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * review(#114): search both paths for varpro_precomputed.rds The precomputed-load chunk read only the cwd-relative 'varpro_precomputed.rds'; depending on how Quarto sets the working directory during R CMD check this could miss the file and silently fall back to (slower) live computation. Search both the vignette-dir and package-root locations before the live fallback. Addresses Copilot review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * chore: open 3.1.0.9000 dev cycle after the CRAN release (#115) v3.1.0 accepted to CRAN (2026-06-11). Bump main to the post-release .9000 dev version (DESCRIPTION + NEWS, dual update so the news-version test sees the DESCRIPTION version), and record the release submission in CRAN-SUBMISSION (SHA a7d8052). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * review(#117): deterministic gg_vimp positive-flag test; MASS guard in precompute Copilot review on the forward-merge: - gg_vimp single-outcome regression test was non-deterministic (no seed) and only checked the invariant when a non-positive VIMP happened to exist. Add set.seed + a zero-variance `const` predictor (VIMP exactly 0 on every platform) and assert positive == (VIMP > 0) for all rows, plus any(!positive) to guarantee the bug condition is exercised. - vignettes/precompute_varpro.R now checks requireNamespace("MASS") with a clear message before loading the Boston data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 147a211 commit 60d97cf

74 files changed

Lines changed: 1815 additions & 514 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CRAN-SUBMISSION

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
Version: 2.7.3
2-
Date: 2026-05-12 13:23:24 UTC
3-
SHA: dd8e66f248a91e943c1c6dd1ffc2356058ac652b
1+
Version: 3.1.0
2+
Date: 2026-06-11 15:26:24 UTC
3+
SHA: a7d805290e69ae517d04846bf13fae6a01062fce

DESCRIPTION

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
Package: ggRandomForests
22
Type: Package
33
Title: Visually Exploring Random Forests
4-
Version: 3.0.0.9001
5-
Date: 2026-05-29
4+
Version: 3.1.0.9000
5+
Date: 2026-06-11
66
Authors@R: person("John", "Ehrlinger",
77
role = c("aut", "cre"),
88
email = "john.ehrlinger@gmail.com")
@@ -44,7 +44,6 @@ Suggests:
4444
pkgdown,
4545
pkgload,
4646
knitr,
47-
plotly,
4847
ggraph,
4948
callr,
5049
randomForestRHF

NEWS.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
Package: ggRandomForests
2-
Version: 3.0.0.9001
2+
Version: 3.1.0.9000
33

44
ggRandomForests v4.0.0 (development)
55
====================================
6+
* Development version 3.1.0.9000, opened after the v3.1.0 CRAN release.
67
* `gg_auct()` / `plot.gg_auct()`: tidy wrapper and plot for time-varying
78
AUC from `randomForestRHF::auct.rhf()` (RHF Phase 2). Returns a long
89
frame `time / auc / se / lower / upper / marker` with an `iauc`
@@ -16,6 +17,40 @@ ggRandomForests v4.0.0 (development)
1617
`requireNamespace("randomForestRHF")`. No change for users who do not
1718
install it.
1819

20+
ggRandomForests v3.1.0
21+
======================
22+
* Fix: `gg_vimp()` for single-outcome rfsrc forests now correctly flags
23+
variables with non-positive VIMP in the `positive` column (affecting
24+
plot coloring). The column was named `VIMP` (uppercase) in single-outcome
25+
fits but the flag check accessed `$vimp` (lowercase), leaving `positive`
26+
stuck at `TRUE` for all variables. Surfaced by the Copilot review on
27+
PR #109.
28+
* Documentation pass. Deepened the varPro-family and rfsrc
29+
importance/partial/survival help pages against the upstream
30+
randomForestSRC and varPro documentation, and made the line between
31+
`gg_vimp()` (permutation, Breiman-Cutler importance) and `gg_varpro()`
32+
(varPro release-rule importance) explicit and cross-linked. Vignette
33+
prose deepened with the same framing; one-line code-comment fixes;
34+
fixed a stale `@return` in `gg_roc()` (documented a `yvar` column the
35+
function does not return). No user-facing behaviour change.
36+
* Vignettes: the regression and survival partial-dependence surfaces are
37+
now rendered as static `ggplot2` heat maps instead of interactive
38+
`plotly` widgets, and figures render at 96 dpi. This cuts the installed
39+
size from ~17 MB to ~5 MB (the `plotly` library is no longer bundled into
40+
the vignette HTML). `plotly` is dropped from `Suggests`.
41+
* Check time: reduced the `R CMD check` vignette-rebuild and test timings to
42+
bring the overall CRAN check comfortably under budget (CRAN flagged the
43+
overall check time on the 3.1.0 submission). The regression and survival
44+
vignettes use lighter forests (`ntree` 200 / 150, imputation `ntree` 100)
45+
and coarser partial-dependence grids. The varpro vignette's three
46+
`gg_partial_varpro()` calls and the Boston `beta.varpro()` fit (~34 s
47+
combined) are precomputed offline by `vignettes/precompute_varpro.R` and
48+
loaded from `vignettes/varpro_precomputed.rds`, with an automatic
49+
live-computation fallback if the file is absent. The `gg_udependent()`
50+
tests memoise the per-fit entropy matrix (`varPro::get.beta.entropy()`,
51+
~1.5 s and a pure function of the fit) instead of recomputing it once per
52+
test. No user-facing behaviour change.
53+
1954
ggRandomForests v3.0.0
2055
======================
2156
* **Version jump to 3.0.0.** The varPro integration is a major scope

R/calc_roc.R

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ calc_roc <- function(object,
206206
}
207207

208208
# Build the sensitivity/specificity table for a single class index k.
209-
# Plain lapply (not mclapply) per-threshold work is a single table()
209+
# Plain lapply (not mclapply): per-threshold work is a single table()
210210
# + a few arithmetic ops (microseconds); fork overhead would dominate,
211211
# and the closure-scope fragility caused the earlier xtabs/Windows
212212
# failure. Returns a data.frame with columns sens, spec, pct.
@@ -320,8 +320,9 @@ calc_auc <- function(x) {
320320
# Sort in decreasing specificity so FPR = 1-spec increases monotonically
321321
x <- x[order(x$spec, decreasing = TRUE), ]
322322

323-
# Δ(FPR) = -(Δspec) — spec decreases, so (spec[i] - spec[i+1]) > 0
324-
# Average height of trapezoid = (sens[i] + sens[i+1]) / 2
323+
# Trapezoid area = sens_avg * Δspec, where Δspec = spec[i] - spec[i+1] > 0
324+
# (spec decreases left-to-right). This equals sens_avg * Δ(1-FPR), which
325+
# gives the standard AUC = ∫ sens d(FPR) with a positive sign.
325326
auc <- (x$sens + shift(x$sens)) / 2 * (x$spec - shift(x$spec)) # nolint: object_usage_linter
326327
sum(auc, na.rm = TRUE)
327328
}

R/gg_beta_varpro.R

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,22 @@
1111
#' `beta.varpro()` step once and reuse the result.
1212
#'
1313
#' @section What this is doing:
14-
#' For each rule (a tree-branch pair) in the forest, [varPro::beta.varpro()]
15-
#' fits a one-predictor lasso regression of the response on the released
16-
#' variable's values, restricted to the OOB observations inside the rule's
17-
#' region. The wrapper aggregates those per-rule coefficients into one
18-
#' number per variable.
14+
#' Think of the varPro release-rule mechanism as asking: "given a region of
15+
#' the feature space that the forest carved out, what changes when I remove
16+
#' the constraint on this one variable and let observations leave?" The
17+
#' standard importance answer (from [gg_varpro()]) measures that change as a
18+
#' z-scored contrast between local estimators: no synthetic data, no
19+
#' permutation. \code{beta.varpro()} asks the same question with a different
20+
#' ruler: for each rule (a tree-branch pair), it fits a one-predictor lasso
21+
#' regression of the response on the released variable's values, restricted
22+
#' to the OOB observations inside the rule's region. The wrapper aggregates
23+
#' those per-rule coefficients into one number per variable.
24+
#'
25+
#' The key distinction from [gg_vimp()], which measures Breiman-Cutler
26+
#' permutation importance by perturbing a variable's values and watching OOB
27+
#' error climb, is that neither [gg_varpro()] nor \code{gg_beta_varpro()}
28+
#' touches the data synthetically: all contrasts are between real subsets
29+
#' defined by the forest's rules.
1930
#'
2031
#' @section What `imp` actually is (pedantic, because the column name is misleading):
2132
#' The `imp` column on `beta.varpro()`'s `$results` is **not** a
@@ -63,7 +74,7 @@
6374
#'
6475
#' @section What you use this for:
6576
#' Picking variables when local effects matter more than aggregate
66-
#' split-strength contribution. Compare side-by-side with [gg_varpro()]
77+
#' split-strength contribution. Compare side-by-side with [gg_varpro()]:
6778
#' a variable that scores high here but low in `gg_varpro` is one whose
6879
#' local linear effect inside many rules is real even though its
6980
#' release-rule contrast is modest.
@@ -92,7 +103,7 @@
92103
#' class.
93104
#'
94105
#' **Binary default**: `which_class = NULL` resolves to the *last*
95-
#' factor level of the response the positive-class convention used
106+
#' factor level of the response, the positive-class convention used
96107
#' by `glm` and `gg_roc`. For a 30-day-mortality outcome with levels
97108
#' `c("no", "yes")`, that means the wrapper shows you `"yes"` (the
98109
#' event) by default.
@@ -118,7 +129,7 @@
118129
#' @section Reproducibility:
119130
#' Byte-for-byte agreement between cached (`beta_fit = b`) and uncached
120131
#' (`beta_fit = NULL`) outputs requires that `b` was computed by
121-
#' `beta.varpro(object, ...)` on the same `object` `set.seed()` alone is
132+
#' `beta.varpro(object, ...)` on the same `object`; `set.seed()` alone is
122133
#' not sufficient, because `beta.varpro`'s internal `cv.glmnet` fits can
123134
#' pick slightly different folds across separate calls. Reuse `beta_fit`
124135
#' when reproducibility matters.
@@ -132,15 +143,15 @@
132143
#' @param ... Forwarded to [varPro::beta.varpro()] when `beta_fit = NULL`;
133144
#' ignored otherwise (with a warning). Documented forwardables: `use.cv`,
134145
#' `use.1se`, `nfolds`, `maxit`, `thresh`, `max.rules.tree`, `max.tree`.
135-
#' @param cutoff Selection threshold on `beta_mean`. `NULL` (default)
146+
#' @param cutoff Selection threshold on `beta_mean`. `NULL` (default) means
136147
#' `mean(beta_mean)` across released variables. Numeric scalar otherwise.
137148
#' @param beta_fit Optional pre-computed [varPro::beta.varpro()] result for
138-
#' the same `object`. `NULL` (default) the wrapper runs `beta.varpro()`
149+
#' the same `object`. `NULL` (default) means the wrapper runs `beta.varpro()`
139150
#' itself. When supplied, must be a `varpro`-class object whose `$results`
140151
#' has columns `tree / branch / variable / n.oob / imp`.
141152
#' @param which_class For a classification fit, name of a single response
142153
#' level to subset on. `NULL` (default) returns all classes (binary fits
143-
#' resolve to the *last* factor level the positive-class convention
154+
#' resolve to the *last* factor level, the positive-class convention
144155
#' used by `glm` and `gg_roc`). Ignored with a warning on regression
145156
#' fits.
146157
#'
@@ -153,7 +164,7 @@
153164
#' the same row order. `which_class` (or the binary default
154165
#' last-factor-level) collapses the output to a single class.
155166
#'
156-
#' @seealso [gg_varpro()], [plot.gg_beta_varpro()], [varPro::beta.varpro()].
167+
#' @seealso [gg_varpro()], [gg_vimp()], [plot.gg_beta_varpro()], [varPro::beta.varpro()].
157168
#'
158169
#' @examples
159170
#' \donttest{
@@ -208,7 +219,7 @@ gg_beta_varpro.varpro <- function(object, ..., cutoff = NULL,
208219
which_class <- NULL
209220
}
210221

211-
# Capture use.cv from `...` here (NOT inside the internals the dots
222+
# Capture use.cv from `...` here (NOT inside the internals; the dots
212223
# don't pass through to the internal frame).
213224
dots_use_cv <- if (is.null(beta_fit)) isTRUE(list(...)$use.cv) else NA
214225

@@ -372,7 +383,7 @@ gg_beta_varpro.varpro <- function(object, ..., cutoff = NULL,
372383
ord_names <- names(sort(beta_mean_total, decreasing = TRUE))
373384
lvl <- rev(ord_names)
374385

375-
# Per-class aggregation long format
386+
# Per-class aggregation: long format
376387
rows <- list()
377388
for (k in seq_len(n_classes)) {
378389
col <- imp_cols[k]
@@ -452,8 +463,8 @@ gg_beta_varpro.varpro <- function(object, ..., cutoff = NULL,
452463
class(base) <- c("gg_beta_varpro", "data.frame")
453464

454465
# Build provenance with shape-stable cutoff:
455-
# regr c("regr" = NA_real_)
456-
# class named NA_real_ vector, one entry per class level
466+
# regr gives c("regr" = NA_real_)
467+
# class gives named NA_real_ vector, one entry per class level
457468
if (fam == "class") {
458469
class_levels <- .class_levels_from_varpro(object)
459470
cutoff_empty <- stats::setNames(rep(NA_real_, length(class_levels)),

R/gg_brier.R

Lines changed: 30 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -16,24 +16,37 @@
1616
#'
1717
#' The Brier score asks a familiar question of any probabilistic forecast:
1818
#' how far did the predicted probability sit from what actually happened?
19-
#' For a survival forest the forecast is the predicted survival probability,
20-
#' and the score is computed at each event time, so the result is a curve
21-
#' rather than a single number -- lower is better, at every time.
19+
#' For a survival forest the forecast is the predicted survival probability
20+
#' at a given moment, and the "what happened" is whether the subject was
21+
#' still alive at that moment. The score is computed at every event time,
22+
#' so you get a curve rather than a single number -- lower is better
23+
#' everywhere. A perfectly calibrated forest that predicts \code{0} for
24+
#' every subject who died and \code{1} for every subject who survived would
25+
#' score \code{0}; a forest that predicts \code{0.5} for everyone scores
26+
#' roughly \code{0.25} regardless of the true outcome -- that is the
27+
#' "uninformative" ceiling.
2228
#'
23-
#' This function extracts that time-resolved Brier score for a survival
24-
#' forest grown with \code{randomForestSRC}, both overall and split by
25-
#' mortality-risk quartile. It also returns the continuous ranked
26-
#' probability score (CRPS), which is the Brier score integrated over time
27-
#' and divided by elapsed time -- a running average of the curve so far.
29+
#' This function extracts the time-resolved Brier score for a survival
30+
#' forest grown with \code{randomForestSRC}, both overall and broken down
31+
#' by mortality-risk quartile (lowest-risk to highest-risk subjects). It
32+
#' also returns the continuous ranked probability score (CRPS) -- the Brier
33+
#' score integrated over time and divided by elapsed time, a running average
34+
#' that summarises calibration up to each point on the time axis.
2835
#'
29-
#' @details This wraps \code{\link[randomForestSRC]{get.brier.survival}} and
30-
#' rebuilds the quartile decomposition and running CRPS from the returned
31-
#' \code{brier.matx} and \code{mort} components, following the computation
32-
#' in the internal \code{plot.survival} function of \pkg{randomForestSRC}.
33-
#' Right-censored data make a plain Brier score biased, so the score uses
34-
#' inverse-probability-of-censoring weighting. The censoring distribution
35-
#' is estimated either by Kaplan-Meier (\code{cens.model = "km"}, the
36-
#' default) or by a separate censoring forest (\code{cens.model = "rfsrc"}).
36+
#' @details
37+
#' Because subjects are right-censored, a plain Brier score is biased:
38+
#' censored subjects contribute no outcome information yet still inflate the
39+
#' denominator. The score here uses inverse-probability-of-censoring
40+
#' weighting (IPCW), which up-weights uncensored observations to compensate.
41+
#' The censoring distribution is estimated either by Kaplan-Meier
42+
#' (\code{cens.model = "km"}, the default) or by a separate censoring
43+
#' forest (\code{cens.model = "rfsrc"}) when the censoring mechanism is
44+
#' itself covariate-dependent.
45+
#'
46+
#' Internally, this wraps \code{\link[randomForestSRC]{get.brier.survival}}
47+
#' and rebuilds the quartile decomposition and running CRPS from the returned
48+
#' \code{brier.matx} and \code{mort} components, following the approach in
49+
#' the internal \code{plot.survival} of \pkg{randomForestSRC}.
3750
#'
3851
#' @param object A fitted \code{\link[randomForestSRC]{rfsrc}} survival
3952
#' forest (\code{object$family == "surv"}).
@@ -143,7 +156,7 @@ gg_brier.rfsrc <- function(object,
143156
bs_quartile <- vapply(seq_len(4), function(k) {
144157
in_bin <- mort > mort_breaks[k] & mort <= mort_breaks[k + 1]
145158
if (!any(in_bin, na.rm = TRUE)) {
146-
# Empty bin can occur when mort has ties at a quantile boundary.
159+
# Empty bin: can occur when mort has ties at a quantile boundary.
147160
return(rep(NA_real_, nrow(bs_df)))
148161
}
149162
colMeans(brier_matx[in_bin, , drop = FALSE], na.rm = TRUE)

R/gg_isopro.R

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
#' a typical observation sits in the dense middle of the feature cloud and
2121
#' takes many splits to isolate, while an unusual observation sits out
2222
#' near an edge and gets cut off after only a few. So \strong{the depth at
23-
#' which an observation is isolated is a proxy for how typical it is}
23+
#' which an observation is isolated is a proxy for how typical it is}:
2424
#' shallow depth means anomalous, deep depth means ordinary. Average a
2525
#' single observation's depth across many trees and the noise washes out,
2626
#' leaving a stable per-observation rank.
@@ -68,7 +68,7 @@
6868
#' against a fitted model and compare the test scores to the training
6969
#' distribution.
7070
#' }
71-
#' The score is a \emph{rank}, not a probability of being an outlier two
71+
#' The score is a \emph{rank}, not a probability of being an outlier: two
7272
#' observations with \code{howbad = 0.92} are both unusual, not "92\%
7373
#' likely to be anomalous". Pick a cutoff by looking at where the elbow
7474
#' rises; \code{\link{plot.gg_isopro}} can annotate either a score
@@ -86,7 +86,7 @@
8686
#' \code{howbad} (where \emph{higher} is more anomalous). The wrapper
8787
#' exposes both conventions so nothing is hidden:
8888
#' \itemize{
89-
#' \item \code{case.depth} carries varPro's native polarity \emph{lower
89+
#' \item \code{case.depth} carries varPro's native polarity, \emph{lower
9090
#' = more anomalous}. This is the unmodified output of
9191
#' \code{predict(object, newdata, quantiles = FALSE)}. Use it to
9292
#' cross-reference against raw varPro output.
@@ -128,7 +128,7 @@
128128
#' order as the rows of the data passed to
129129
#' \code{\link[varPro]{isopro}}.}
130130
#' \item{case.depth}{Numeric; mean isolation depth across the forest.
131-
#' Lower means the observation was isolated quickly more
131+
#' Lower means the observation was isolated quickly, so more
132132
#' anomalous.}
133133
#' \item{howbad}{Numeric in \code{[0, 1]}; the \code{case.depth}
134134
#' values pushed through their own empirical CDF and flipped so

R/gg_ivarpro.R

Lines changed: 23 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,27 @@
1010
#' `ivarpro()` call.
1111
#'
1212
#' @section What this is doing:
13-
#' `ivarpro()` walks the varPro forest's rules and, for each
14-
#' (observation, variable) pair, computes a scaled per-rule
15-
#' contribution to predicting that observation. Per-rule LOO removes
16-
#' the observation from its own rule before scoring. Per-region
17-
#' scaling (`scale = "local"`, default) standardises the contribution
18-
#' by the rule's local response standard deviation so values are
19-
#' comparable across rules of different size. Aggregating those
20-
#' per-rule scores into one number per (obs, variable) pair gives the
21-
#' `local_imp` cell.
13+
#' The varPro framework builds importance from release rules: for a given
14+
#' rule region, it compares a local estimator inside that region to what
15+
#' the estimator becomes after the constraint on the tested variable is
16+
#' removed ("released"). That contrast is summed over many rules and trees
17+
#' to get a global z-score: the quantity [gg_varpro()] shows. What
18+
#' `ivarpro()` adds is a per-observation view of the same mechanism.
19+
#'
20+
#' Concretely: `ivarpro()` walks the forest's rules and, for each
21+
#' (observation, variable) pair, computes a scaled per-rule contribution
22+
#' to predicting that observation. Per-rule LOO removes the observation
23+
#' from its own rule before scoring, so the contribution is not inflated
24+
#' by the observation having helped define the region. Per-region scaling
25+
#' (`scale = "local"`, default) standardises the contribution by the
26+
#' rule's local response standard deviation so values are comparable
27+
#' across rules of different size. Aggregating those per-rule scores into
28+
#' one number per (obs, variable) pair gives the `local_imp` cell.
29+
#'
30+
#' No permutation, no synthetic data: the contrast is always between real
31+
#' subsets of the observed data, defined by the forest's own rules. This
32+
#' is the same no-synthetic-features property that distinguishes
33+
#' [gg_varpro()] from [gg_vimp()]'s Breiman-Cutler permutation importance.
2234
#'
2335
#' @section What `local_imp` actually is (pedantic):
2436
#' `local_imp[i, v]` is the **scaled aggregated rule contribution** of
@@ -126,7 +138,7 @@
126138
#' `mean(|local_imp|)` descending across all rows (the unified
127139
#' ranking axis shared across facets / panels).
128140
#'
129-
#' @seealso [gg_varpro()], [gg_beta_varpro()], [varPro::ivarpro()].
141+
#' @seealso [gg_varpro()], [gg_vimp()], [gg_beta_varpro()], [varPro::ivarpro()].
130142
#'
131143
#' @examples
132144
#' \donttest{
@@ -415,7 +427,7 @@ gg_ivarpro.varpro <- function(object, ..., which_obs = NULL,
415427
}
416428

417429
# Unified factor-level ordering across all (obs, class), REVERSED so the
418-
# most-important variable lands at the TOP after coord_flip shared
430+
# most-important variable lands at the TOP after coord_flip; shared
419431
# across every class facet for alignment.
420432
agg <- tapply(abs(long$local_imp), long$variable, mean, na.rm = TRUE)
421433
ord_names <- names(sort(agg, decreasing = TRUE))

0 commit comments

Comments
 (0)