Skip to content

Commit 5c78a66

Browse files
ehrlingerclaude
andauthored
docs: pedagogical audit of varPro wrappers (gg_partial_varpro, gg_varpro, gg_udependent) (#95)
* docs(gg_partial_varpro): teach what varPro partialpro is doing * docs(gg_varpro): teach what varpro variable priority is doing * docs(gg_udependent): teach what cross-variable dependency is doing * chore: open v2.7.3.9009 + NEWS for varPro pedagogical doc audit * docs: enable roxygen2 markdown package-wide Add Roxygen: list(markdown = TRUE) to DESCRIPTION so devtools::document() auto-converts backticks / [fn()] / [pkg::fn()] in source roxygen to \code{} / \link{} / \link[pkg]{} in the generated Rd. Existing Rd-style markup keeps working; both styles now coexist. Saves the manual conversion work the Copilot review on PR #94 flagged. Two source-roxygen edits needed to keep R CMD check clean under markdown: - R/help.R: randomForest[SRC] -> randomForestSRC (markdown read [SRC] as an unfinished link reference, producing a missing-link warning). - R/gg_rfsrc.R::bootstrap_survival: 95\% -> 95% (markdown over-escaped the backslash, producing a malformed Rd with shifted section order). Regenerates all 31 Rd files. No functional or rendered-content change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: address PR #95 Copilot review - plot.gg_udependent: clarify that truly isolated nodes are dropped by gg_udependent() before plotting; reword 'Isolated' as 'low-degree'. - gg_partial_varpro: fix varpro::partialpro -> varPro::partialpro (six instances) so \link{} renders correctly. - plot.gg_varpro: clarify the cutoff line lives in z-units on the default axis and in raw-importance units when type='raw'; the numeric is the same, the scale is not. - gg_varpro reference: complete the dangling 'arXiv 2409.' with the full arXiv:2409.09003 identifier and an https://arxiv.org link. * docs: regenerate gg_isopro Rd under markdown mode post-rebase --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6f4d649 commit 5c78a66

44 files changed

Lines changed: 884 additions & 282 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

DESCRIPTION

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: ggRandomForests
22
Type: Package
33
Title: Visually Exploring Random Forests
4-
Version: 2.7.3.9008
4+
Version: 2.7.3.9009
55
Date: 2026-05-21
66
Authors@R: person("John", "Ehrlinger",
77
role = c("aut", "cre"),
@@ -50,3 +50,4 @@ Suggests:
5050
callr
5151
VignetteBuilder: quarto
5252
Config/roxygen2/version: 8.0.0
53+
Roxygen: list(markdown = TRUE)

NEWS.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,25 @@
11
Package: ggRandomForests
2-
Version: 2.7.3.9008
2+
Version: 2.7.3.9009
33

44
ggRandomForests v2.8.0 (development) — continued
55
=================================================
6+
* Documentation: pedagogical pass over the varPro wrappers
7+
(`gg_partial_varpro`, `gg_varpro`, `gg_udependent` and their `plot.*`
8+
methods). Each help page now has explicit "What X is doing", "What's
9+
in the output", and "What you use this for" sections so a reader new
10+
to varPro can learn the underlying method (release rules, beta-entropy
11+
dependency, parametric / nonparametric / causal partial estimators)
12+
from the help page alone, not just the wrapper mechanics. No API or
13+
behavioural change.
14+
* Documentation: enable roxygen2 markdown package-wide via
15+
`Roxygen: list(markdown = TRUE)` in `DESCRIPTION`. New roxygen blocks
16+
can use backticks and `[fn()]` link syntax; existing `\code{}` /
17+
`\link{}` markup keeps working. Two source-roxygen edits to keep
18+
R CMD check clean: `randomForest[SRC]` in `R/help.R` (markdown read
19+
it as an unfinished link) becomes plain `randomForestSRC`; the `95\%`
20+
escape in `R/gg_rfsrc.R::bootstrap_survival` becomes a literal `95%`.
21+
No API or rendered-doc behavioural change beyond the conventions
22+
switch.
623
* New `gg_isopro()` and `plot.gg_isopro()`: tidy wrapper and ranked-elbow +
724
density visualisation for `varPro::isopro` isolation-forest anomaly
825
scores. `plot.gg_isopro()` takes `panel = c("both", "elbow", "density")`

R/gg_partial_varpro.R

Lines changed: 62 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,78 @@
11
##=============================================================================
22
#' Partial dependence data from a varPro model
33
#'
4-
#' \code{varpro::partialpro} returns one list, with continuous and
4+
#' \code{varPro::partialpro} returns one list, with continuous and
55
#' categorical predictors mixed together. This function splits that list into
66
#' two tidy data frames, one for each kind, and resolves the y-axis label the
77
#' plot method will use.
88
#'
9-
#' @param part_dta Partial plot data from \code{varpro::partialpro}. Each
9+
#' @section What partialpro is doing:
10+
#' A partial dependence curve answers the question, "if I hold a single
11+
#' variable at a grid of values and average out everything else, how does
12+
#' the model's prediction move?" That is the same question \code{rfsrc}
13+
#' partial dependence answers. What \code{varPro::partialpro} adds is two
14+
#' wrinkles that are worth understanding before you read the curves.
15+
#'
16+
#' First, \code{partialpro} filters the partial grid through an isolation
17+
#' forest (Unlimited Virtual Twins, or UVT) so that unlikely combinations
18+
#' of the focal variable with the rest of the data are downweighted. The
19+
#' \code{rfsrc} version, by contrast, averages over the full marginal grid
20+
#' regardless of plausibility. So when a covariate is highly correlated
21+
#' with others, the two methods can disagree, and \code{partialpro}'s
22+
#' curve is the one restricted to the data manifold.
23+
#'
24+
#' Second, \code{partialpro} fits a local polynomial model to the
25+
#' predicted values rather than just plotting their mean. That gives
26+
#' three parallel curves per variable, stored as \code{yhat.par},
27+
#' \code{yhat.nonpar}, and \code{yhat.causal}, which the plot method
28+
#' overlays so you can see whether a smooth parametric story and the
29+
#' raw forest predictions are telling you the same thing.
30+
#'
31+
#' Interpretation of the y-axis depends on the outcome (per
32+
#' \code{varPro::partialpro}): response scale for regression, log-odds of
33+
#' the target class for classification, and either ensemble mortality
34+
#' (default) or RMST (if the original \code{varpro} call set
35+
#' \code{rmst}) for survival.
36+
#'
37+
#' @section What's in the output:
38+
#' We split \code{partialpro}'s mixed list into two tidy data frames so
39+
#' the plot method does not have to. A variable with more than
40+
#' \code{cat_limit} distinct grid points goes into \code{$continuous},
41+
#' one row per grid point with the column means of \code{yhat.par},
42+
#' \code{yhat.nonpar}, and \code{yhat.causal} stored as
43+
#' \code{parametric}, \code{nonparametric}, and \code{causal}. A
44+
#' variable at or below \code{cat_limit} goes into \code{$categorical},
45+
#' one row per observation per category level, carrying the same three
46+
#' columns unaveraged so the plot method can draw boxplots. Path C
47+
#' (\code{scale \%in\% c("surv","chf")}) takes a different route: we
48+
#' hand the underlying \code{rfsrc} forest to \code{gg_partial_rfsrc} so
49+
#' you get a survival-probability or cumulative-hazard curve on the
50+
#' usual rfsrc scale instead.
51+
#'
52+
#' @section What you use this for:
53+
#' \itemize{
54+
#' \item read the marginal shape of a relationship the varpro model
55+
#' found important — monotone, threshold, U-shape, flat;
56+
#' \item compare the three partialpro estimators on the same variable
57+
#' and flag the ones where parametric and nonparametric disagree —
58+
#' those are the candidates for closer inspection;
59+
#' \item report a survival partial dependence on the probability or
60+
#' cumulative-hazard scale (\code{scale = "surv"} or \code{"chf"})
61+
#' rather than the unbounded mortality scale.
62+
#' }
63+
#' A varpro partial dependence curve is a description of the model, not
64+
#' a causal effect. The \code{causal} column is varpro's local
65+
#' estimator, not a structural causal claim about the data-generating
66+
#' process.
67+
#'
68+
#' @param part_dta Partial plot data from \code{varPro::partialpro}. Each
1069
#' element must contain \code{xvirtual}, \code{xorg}, \code{yhat.par},
1170
#' \code{yhat.nonpar}, and \code{yhat.causal}. Supply at least one of
1271
#' \code{part_dta} or \code{object}.
1372
#' @param object A fitted \code{varpro} object, the forest the partial data
1473
#' came from. When supplied it provides the provenance metadata, and when
1574
#' \code{part_dta} is \code{NULL} it is passed to
16-
#' \code{varpro::partialpro(object)} for you. Required when
75+
#' \code{varPro::partialpro(object)} for you. Required when
1776
#' \code{scale \%in\% c("surv","chf")}.
1877
#' @param scale Character; sets the y-axis label and, for survival forests,
1978
#' the output type. One of \code{"auto"} (default), \code{"mortality"},

R/gg_rfsrc.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -383,7 +383,7 @@ gg_rfsrc.rfsrc <- function(object, # nolint: cyclocomp_linter
383383
#' @param bs_samples Integer; number of bootstrap resamples.
384384
#' @param level_set Numeric vector of length 2 giving the lower and upper
385385
#' quantile probabilities for the confidence band (e.g. \code{c(0.025, 0.975)}
386-
#' for a 95\% CI).
386+
#' for a 95% CI).
387387
#'
388388
#' @return A \code{data.frame} with one row per unique event time and columns
389389
#' \code{value} (time), \code{lower}, \code{upper}, \code{median}, and

R/gg_udependent.R

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,67 @@
77
#' \code{\link[varPro]{sdependent}}, and returns them as a tidy list that
88
#' \code{plot.gg_udependent} can draw as a network.
99
#'
10+
#' @section What cross-variable dependency is doing:
11+
#' UVarPro (Zhou, Lu and Ishwaran, 2026) extends the varpro framework to
12+
#' the unsupervised setting: grow a forest without a response, then use
13+
#' the same region-release contrasts varpro uses for supervised
14+
#' importance to ask, "which variables explain the structure in the
15+
#' data?" The lasso-driven variant frames each region-release contrast
16+
#' as a classification task — does an observation belong to the region
17+
#' or to its release? — and fits a lasso logistic regression with the
18+
#' other variables as predictors. The coefficient on variable \eqn{j}
19+
#' in the model for variable \eqn{i}'s region-release contrast is the
20+
#' entry \eqn{I[i, j]} of the matrix \code{varPro::get.beta.entropy()}
21+
#' returns.
22+
#'
23+
#' Read that entry as "how much does knowing \eqn{j} help separate
24+
#' \eqn{i}'s region from its release". A large \eqn{I[i, j]} says
25+
#' \eqn{j} carries information about the structure varpro picked up in
26+
#' \eqn{i}. \code{varPro::sdependent} thresholds that matrix at a
27+
#' user-chosen cut and returns the set of "signal" variables — the
28+
#' nodes with high enough out-degree to be worth keeping. We pass the
29+
#' threshold through to \code{sdependent} and use the same matrix to
30+
#' weight the edges of the resulting graph.
31+
#'
32+
#' The graph is directed by default because \eqn{I[i, j]} and
33+
#' \eqn{I[j, i]} are separate lasso coefficients and need not agree;
34+
#' setting \code{directed = FALSE} collapses each pair by taking the
35+
#' larger of the two, which is appropriate when you only want to see
36+
#' that two variables are dependent, not which way the dependency
37+
#' reads.
38+
#'
39+
#' @section What's in the output:
40+
#' \code{$edges} has one row per surviving edge with the raw weight
41+
#' \code{I[i, j]} (or, for undirected graphs, the max of the two
42+
#' directions). \code{$nodes} has one row per surviving variable with
43+
#' its degree (out-degree for directed, total degree for undirected)
44+
#' and a \code{selected} flag for membership in the \code{sdependent}
45+
#' signal set. \code{$graph} is the same information packaged as an
46+
#' \code{igraph} object, with \code{weight}, \code{degree}, and
47+
#' \code{selected} attached so \code{plot.gg_udependent} can render it
48+
#' without recomputing anything.
49+
#'
50+
#' @section What you use this for:
51+
#' \itemize{
52+
#' \item screen a wide unsupervised dataset for the small set of
53+
#' variables UVarPro thinks are carrying the signal — the nodes
54+
#' with high degree, or those flagged \code{selected = TRUE};
55+
#' \item spot clusters of mutually dependent variables (hubs and the
56+
#' spokes around them) that may be measuring the same underlying
57+
#' construct;
58+
#' \item compare two datasets, or two preprocessing pipelines, by
59+
#' looking at how their dependency graphs change.
60+
#' }
61+
#' An edge in this graph is a statistical dependency in the unsupervised
62+
#' decomposition of the data — it is not a causal arrow. A high
63+
#' \eqn{I[i, j]} says \eqn{j} predicts \eqn{i}'s region membership,
64+
#' not that \eqn{j} causes \eqn{i}.
65+
#'
66+
#' @references
67+
#' Zhou, L., Lu, M. and Ishwaran, H. (2026). Variable priority for
68+
#' unsupervised variable selection. \emph{Pattern Recognition},
69+
#' 172:112727.
70+
#'
1071
#' @param object A fitted \code{uvarpro} object (required).
1172
#' @param threshold Numeric; the positive dependency threshold passed on to
1273
#' \code{sdependent()}. An edge \eqn{i \to j} is drawn when

R/gg_varpro.R

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,70 @@
88
#' For a classification forest you can also keep the class-conditional
99
#' importances.
1010
#'
11+
#' @section What varpro is doing:
12+
#' Permutation importance asks "what happens to OOB accuracy when I scramble
13+
#' this variable?" That works, but it leans on artificial data — the
14+
#' permuted column — and the answer can be unstable when variables are
15+
#' correlated. The varpro framework (Lu and Ishwaran, 2024) replaces
16+
#' permutation with \emph{release rules}. The forest is grown with guided
17+
#' splitting; from a subset of trees varpro samples a collection of
18+
#' decision-rule branches; for each variable it then compares the
19+
#' response inside the rule's region to the response after the rule's
20+
#' constraint on that variable is "released". The size of that change,
21+
#' aggregated over many rules and trees, is the variable's importance.
22+
#' No synthetic covariates, no permutation — the contrast is between two
23+
#' real subsets of the data.
24+
#'
25+
#' Because varpro builds importance from rules sampled over trees, every
26+
#' tree contributes its own importance value for each variable. Those are
27+
#' the per-tree scores we summarise here. With \code{local.std = TRUE}
28+
#' (the default) the per-tree values are standardised by their column
29+
#' standard deviation so the column mean equals the aggregate z-score
30+
#' returned by \code{varPro::importance()}; that z-score is the canonical
31+
#' "is this variable in or out?" statistic, and \code{cutoff = 0.79} is
32+
#' varpro's default selection threshold.
33+
#'
34+
#' For a classification forest, varpro also returns a class-conditional
35+
#' z table: the same importance computed restricting attention to rules
36+
#' relevant to each class. \code{conditional = TRUE} keeps that table so
37+
#' the plot method can show which variables matter for which class
38+
#' rather than only in aggregate.
39+
#'
40+
#' @section What's in the output:
41+
#' \code{$imp} is the one-row-per-variable summary: aggregate z from
42+
#' \code{varPro::importance()}, plus a \code{selected} flag for
43+
#' \code{z > cutoff}. \code{$stats} holds the box quantiles
44+
#' (5/15/50/85/95 percentiles, plus the raw mean) computed from the
45+
#' per-tree matrix; these are what the boxplot draws. \code{$imp.tree}
46+
#' is the per-tree matrix itself, kept only when \code{faithful = TRUE}
47+
#' so the plot method can scatter individual tree values over the box.
48+
#' \code{$conditional} is the tidy class x variable z table, present
49+
#' only when \code{conditional = TRUE} and the family is
50+
#' classification.
51+
#'
52+
#' @section What you use this for:
53+
#' \itemize{
54+
#' \item rank candidate variables by importance and pick a working set
55+
#' above varpro's z cutoff;
56+
#' \item see, via the boxplot's spread and the per-tree points
57+
#' (\code{faithful = TRUE}), how stable each variable's importance
58+
#' is across trees — a high median with a wide box is a different
59+
#' story from a high median with a tight box;
60+
#' \item for a classification forest, ask which variables drive which
61+
#' class (\code{conditional = TRUE}) rather than just which
62+
#' variables drive the model overall.
63+
#' }
64+
#' The z-score is a standardised ranking statistic, not a p-value or a
65+
#' probability. Two variables with the same z are "similarly important
66+
#' by this method", not "equally likely to be true signal". For a
67+
#' data-driven cutoff rather than the 0.79 default, see
68+
#' \code{varPro::cv.varpro}.
69+
#'
70+
#' @references
71+
#' Lu, M. and Ishwaran, H. (2024). Model-independent variable selection
72+
#' via the rule-based variable priority framework. \emph{arXiv preprint}
73+
#' \href{https://arxiv.org/abs/2409.09003}{arXiv:2409.09003}.
74+
#'
1175
#' @param object A fitted \code{varpro} object (required).
1276
#' @param local.std Logical; default \code{TRUE}. When \code{TRUE} the
1377
#' per-tree importances are put on the z-scale before the box statistics

R/help.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@
5858
#'
5959
#' The \code{ggRandomForests} package contains the following data functions:
6060
#' \itemize{
61-
#' \item \code{\link{gg_rfsrc}}: randomForest[SRC] predictions.
62-
#' \item \code{\link{gg_error}}: randomForest[SRC] convergence rate based on
61+
#' \item \code{\link{gg_rfsrc}}: randomForestSRC predictions.
62+
#' \item \code{\link{gg_error}}: randomForestSRC convergence rate based on
6363
#' the OOB error rate.
6464
#' \item \code{\link{gg_roc}}: ROC curves for randomForest classification
6565
#' models.

R/plot.gg_partial_varpro.R

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,33 @@
88
#' \code{scale \%in\% c("surv","chf")} was passed to the extractor) are
99
#' handed off to \code{\link{plot.gg_partial_rfsrc}} for drawing.
1010
#'
11+
#' @section Reading the partial dependence:
12+
#' For a continuous variable the x-axis is the variable's grid of values
13+
#' and the y-axis is the partial prediction; each of the three effect
14+
#' types (\code{parametric}, \code{nonparametric}, \code{causal}) is
15+
#' drawn as its own line. The shape of the line is the story: a clear
16+
#' slope says the model uses the variable, a flat line says it
17+
#' essentially does not, and a U-shape or a threshold says the effect
18+
#' is nonlinear in a way a single coefficient would miss. For a
19+
#' categorical variable the picture is a boxplot per level; here the
20+
#' eye is looking at level-to-level shifts in the centre of each box.
21+
#'
22+
#' Where the three effect types track each other, the parametric story
23+
#' is a fair summary of what the forest is doing. Where they fan
24+
#' apart — typically the parametric curve smoother than the
25+
#' nonparametric, or the causal curve flatter than either — the
26+
#' variable is one to inspect more carefully before reading a single
27+
#' effect off the plot.
28+
#'
29+
#' @section What this tells you:
30+
#' Use these curves to describe how the model uses each variable, not
31+
#' to claim how the world works. They are a window into the fitted
32+
#' relationship; they do not by themselves establish that intervening
33+
#' on the variable would move the outcome. For survival path-C
34+
#' (\code{scale = "surv"} or \code{"chf"}), the y-axis is on the
35+
#' probability or cumulative-hazard scale, which is usually the scale
36+
#' you want to report to a clinical audience.
37+
#'
1138
#' @param x A \code{\link{gg_partial_varpro}} object.
1239
#' @param type Character vector; one or more of \code{"parametric"},
1340
#' \code{"nonparametric"}, \code{"causal"}. Defaults to all three.

R/plot.gg_udependent.R

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,34 @@
66
#' set, and the width and opacity of an edge tell you how strong the
77
#' dependency between its two variables is.
88
#'
9+
#' @section Reading the network:
10+
#' Each node is a variable; each edge is a cross-variable dependency
11+
#' that cleared the threshold passed to \code{gg_udependent}. The
12+
#' Fruchterman-Reingold layout (the default) places mutually connected
13+
#' variables near each other, so the picture tends to show hubs and
14+
#' the clusters around them rather than a tidy ring. The eye usually
15+
#' goes first to the largest blue node — a variable that is both in
16+
#' the signal set and connects to many others is a hub of the
17+
#' dependency structure. Edges with wider, more opaque strokes are
18+
#' stronger dependencies; thin, faint edges sit near the threshold and
19+
#' are the ones that would disappear first if you raised it.
20+
#'
21+
#' Grey, low-degree nodes are the ones UVarPro thinks are not
22+
#' contributing much to the structure. (Truly isolated nodes are
23+
#' dropped by `gg_udependent()` before the graph is drawn — what you
24+
#' see is the connected component.) A cluster of mutually
25+
#' connected variables is worth checking for redundancy — they may be
26+
#' several views of the same underlying quantity.
27+
#'
28+
#' @section What this tells you:
29+
#' Use the figure to pick a working set of variables: the hubs and
30+
#' their immediate neighbours are the candidates UVarPro flags as
31+
#' carrying structure. If a cluster of high-degree variables looks
32+
#' like it might be measuring the same thing, that is a cue to look at
33+
#' their pairwise correlations or fit them as a block rather than
34+
#' individually. The threshold and layout are recorded in the caption
35+
#' so a different choice is easy to spot in a later figure.
36+
#'
937
#' @param x A \code{gg_udependent} object from \code{\link{gg_udependent}}.
1038
#' @param layout Character; the igraph/ggraph layout algorithm. Common
1139
#' choices are \code{"fr"} (Fruchterman-Reingold, the default),

0 commit comments

Comments
 (0)