Skip to content

Commit 6f4d649

Browse files
ehrlingerclaude
andauthored
feat: gg_isopro — varPro Phase 4 anomaly-score wrapper (#94)
* docs: design spec for varPro Phase 4 — gg_isopro First of three Phase 4 sub-projects (isopro -> beta.varpro -> ivarpro). Tidy-data wrapper + plot method for varPro::isopro anomaly scores. Single fit per call (Phase 1-3 pattern), patchwork of elbow + density with panel=c('both','elbow','density'); threshold/top_n_pct annotation with threshold-wins precedence; ground-truth evaluation deferred. * docs: implementation plan for varPro Phase 4 gg_isopro * chore: open v2.7.3.9008 dev cycle (varPro Phase 4 gg_isopro) * feat(gg_isopro): tidy extractor for varPro::isopro anomaly scores Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(plot.gg_isopro): elbow + density patchwork with panel arg * test(plot.gg_isopro): threshold and top_n_pct precedence * test(plot.gg_isopro): method column triggers colour grouping * feat(gg_isopro): print / summary / autoplot S3 companions * test: vdiffr snapshots for gg_isopro (Phase 4) * docs: NEWS entry for varPro Phase 4 gg_isopro * docs(gg_isopro): expand roxygen to teach what isopro does and when to use it A core goal of the v2.8.0 varPro integration is to make the package self- teaching: a reader who has not used varPro before should learn what each function is doing and what they would use it for, just from the help page. The first pass on gg_isopro was correctly written in voice but too thin; it described the mechanics of the wrapper without explaining the method. Add to gg_isopro: - "What isopro is doing" — isolation forests, geometric intuition for why shallow depth means anomalous, what the three methods (rnd / unsupv / auto) actually do and how they differ. - "What's in the output" — case.depth vs howbad: raw depth and its [0,1]-rescaled cousin, why both are kept. - "What you use this for" — screening for data-entry errors, cohort- distribution checks, ranked review lists. The score is a rank, not a probability. - Liu/Ting/Zhou 2008 reference. Add to plot.gg_isopro: - "Reading the elbow" — the bend is the cutoff; the plot is for seeing where it is, not reading single scores. - "Reading the density" — single mode + thin right tail is the picture; bimodal means two populations. - "Comparing methods" — agreement vs divergence across rnd/unsupv/auto is the actual signal. Voice-only expansion; no API or behavioural change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: register gg_isopro / plot.gg_isopro in pkgdown reference index pkgdown's build_reference_index() requires every exported topic to be listed in _pkgdown.yml. Added an 'Anomaly Detection' section after Variable Importance so gg_isopro and plot.gg_isopro are indexed; this unblocks the pkgdown CI job on PR #94. * fix(gg_isopro): address PR #94 Copilot review - autoplot.gg_isopro uses plot() generic for S3 dispatch - roxygen converted from markdown to Rd-style (\code{} / \link{}) - .resolve_isopro_threshold validates threshold in [0,1] and top_n_pct in (0,100) - panel='both' uses patchwork::wrap_plots() for consistency * refactor(plot.gg_isopro): extract .check_threshold_arg to satisfy cyclocomp lint The validation Copilot asked for in the previous round pushed .resolve_isopro_threshold to cyclomatic complexity 38, well over the project's 20-line budget. Factor the per-argument checks into a small helper (.check_threshold_arg) parameterised by name/lo/hi/closure; .resolve_isopro_threshold now reads as the three-branch decision it actually is. Same external behaviour; 43 tests still pass. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent af0405c commit 6f4d649

18 files changed

Lines changed: 1876 additions & 2 deletions

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: ggRandomForests
22
Type: Package
33
Title: Visually Exploring Random Forests
4-
Version: 2.7.3.9007
4+
Version: 2.7.3.9008
55
Date: 2026-05-21
66
Authors@R: person("John", "Ehrlinger",
77
role = c("aut", "cre"),

NAMESPACE

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
S3method(autoplot,gg_brier)
44
S3method(autoplot,gg_error)
5+
S3method(autoplot,gg_isopro)
56
S3method(autoplot,gg_partial)
67
S3method(autoplot,gg_partial_rfsrc)
78
S3method(autoplot,gg_partial_varpro)
@@ -19,6 +20,7 @@ S3method(gg_brier,rfsrc)
1920
S3method(gg_error,randomForest)
2021
S3method(gg_error,randomForest.formula)
2122
S3method(gg_error,rfsrc)
23+
S3method(gg_isopro,isopro)
2224
S3method(gg_rfsrc,randomForest)
2325
S3method(gg_rfsrc,rfsrc)
2426
S3method(gg_roc,default)
@@ -32,6 +34,7 @@ S3method(gg_vimp,randomForest)
3234
S3method(gg_vimp,rfsrc)
3335
S3method(plot,gg_brier)
3436
S3method(plot,gg_error)
37+
S3method(plot,gg_isopro)
3538
S3method(plot,gg_partial)
3639
S3method(plot,gg_partial_rfsrc)
3740
S3method(plot,gg_partial_varpro)
@@ -45,6 +48,7 @@ S3method(plot,gg_varpro)
4548
S3method(plot,gg_vimp)
4649
S3method(print,gg_brier)
4750
S3method(print,gg_error)
51+
S3method(print,gg_isopro)
4852
S3method(print,gg_partial)
4953
S3method(print,gg_partial_rfsrc)
5054
S3method(print,gg_partial_varpro)
@@ -60,6 +64,7 @@ S3method(print,summary.gg)
6064
S3method(print,summary.gg_udependent)
6165
S3method(summary,gg_brier)
6266
S3method(summary,gg_error)
67+
S3method(summary,gg_isopro)
6368
S3method(summary,gg_partial)
6469
S3method(summary,gg_partial_rfsrc)
6570
S3method(summary,gg_partial_varpro)
@@ -75,6 +80,7 @@ export(calc_auc)
7580
export(calc_roc)
7681
export(gg_brier)
7782
export(gg_error)
83+
export(gg_isopro)
7884
export(gg_partial)
7985
export(gg_partial_rfsrc)
8086
export(gg_partial_varpro)

NEWS.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,17 @@
11
Package: ggRandomForests
2-
Version: 2.7.3.9007
2+
Version: 2.7.3.9008
33

44
ggRandomForests v2.8.0 (development) — continued
55
=================================================
6+
* New `gg_isopro()` and `plot.gg_isopro()`: tidy wrapper and ranked-elbow +
7+
density visualisation for `varPro::isopro` isolation-forest anomaly
8+
scores. `plot.gg_isopro()` takes `panel = c("both", "elbow", "density")`
9+
and optional `threshold` (score-space) or `top_n_pct` (quantile-space)
10+
to draw a reference line; if both are set, `threshold` wins with a
11+
message. A `method` column auto-triggers colour grouping for multi-method
12+
comparisons (use `dplyr::bind_rows()` on three `gg_isopro()` calls).
13+
`print` / `summary` / `autoplot` S3 companions follow the existing `gg_*`
14+
conventions. First of three Phase 4 sub-projects.
615
* `plot.gg_variable()`: fix render error on the default multi-class
716
classification plot. The default-xvar selection was treating `yvar` (the
817
observed-class column) and `outcome` (the multi-class pivot facet) as

R/autoplot_methods.R

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,3 +135,9 @@ autoplot.gg_varpro <- function(object, ...) {
135135
autoplot.gg_udependent <- function(object, ...) {
136136
plot(object, ...)
137137
}
138+
139+
#' @rdname autoplot.gg
140+
#' @export
141+
autoplot.gg_isopro <- function(object, ...) {
142+
plot(object, ...)
143+
}

R/gg_isopro.R

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
####**********************************************************************
2+
#### gg_isopro: tidy extractor for varPro::isopro anomaly scores.
3+
####
4+
#### varPro::isopro returns a list with $howbad (per-observation anomaly
5+
#### score in [0,1]) and $case.depth (average isolation depth, lower =
6+
#### more anomalous). gg_isopro() reshapes these into a tidy data.frame
7+
#### the plot/print/summary methods can consume.
8+
####**********************************************************************
9+
10+
#' Tidy data from a varPro isolation-forest fit
11+
#'
12+
#' Pulls per-observation anomaly scores out of a \code{\link[varPro]{isopro}}
13+
#' fit so you can plot them, sort them, or write them to disk without having
14+
#' to know the internal shape of the fit.
15+
#'
16+
#' @section What isopro is doing:
17+
#' An isolation forest (Liu, Ting and Zhou 2008) is a random forest grown
18+
#' on very small subsamples of the data and asked to split until each
19+
#' observation lands in its own terminal node. The intuition is geometric:
20+
#' a typical observation sits in the dense middle of the feature cloud and
21+
#' takes many splits to isolate, while an unusual observation sits out
22+
#' near an edge and gets cut off after only a few. So \strong{the depth at
23+
#' which an observation is isolated is a proxy for how typical it is} —
24+
#' shallow depth means anomalous, deep depth means ordinary. Average a
25+
#' single observation's depth across many trees and the noise washes out,
26+
#' leaving a stable per-observation rank.
27+
#'
28+
#' \code{\link[varPro]{isopro}} supports three flavours of isolation
29+
#' forest, which differ in how the splits are chosen:
30+
#' \describe{
31+
#' \item{\code{"rnd"}}{The original Liu/Ting/Zhou method: each tree node
32+
#' picks a variable at random and a split point uniformly at random
33+
#' in the variable's range. Fast, no model, surprisingly effective.}
34+
#' \item{\code{"unsupv"}}{Unsupervised splitting from
35+
#' \code{randomForestSRC}: splits are chosen to separate the data
36+
#' along the directions of highest variance. More structured than
37+
#' \code{"rnd"}; sometimes more accurate, especially when the
38+
#' anomalies follow a coherent direction.}
39+
#' \item{\code{"auto"}}{An auto-encoder formulation that grows a
40+
#' multivariate forest predicting each feature from the others. Most
41+
#' expressive, slowest, best suited to low-dimensional data.}
42+
#' }
43+
#' No method is universally best. The varPro authors recommend trying at
44+
#' least two and comparing the score distributions; the plot method here
45+
#' colours per-method curves automatically when you stack the results.
46+
#'
47+
#' @section What's in the output:
48+
#' The fit gives back two parallel per-observation vectors:
49+
#' \code{case.depth} is the raw mean isolation depth (units of "splits",
50+
#' lower = more anomalous) and \code{howbad} is the same information
51+
#' transformed onto a \code{[0, 1]} scale via the empirical CDF of
52+
#' \code{case.depth} (higher = more anomalous). Both columns are kept so
53+
#' you can plot in either space and have the raw depth on hand for
54+
#' diagnostics; \code{howbad} is the canonical score and is what the plot
55+
#' method uses by default.
56+
#'
57+
#' @section What you use this for:
58+
#' This is screening, not inference. Reach for it when you want to:
59+
#' \itemize{
60+
#' \item flag observations that may be data-entry errors, out-of-range
61+
#' measurements, or distinct subpopulations before fitting a primary
62+
#' model;
63+
#' \item check whether a held-out cohort sits inside the training
64+
#' distribution before scoring with a model trained elsewhere;
65+
#' \item give the analyst a ranked list of "look at these first" cases
66+
#' for a manual review.
67+
#' }
68+
#' The score is a \emph{rank}, not a probability of being an outlier — two
69+
#' observations with \code{howbad = 0.92} are both unusual, not "92\%
70+
#' likely to be anomalous". Pick a cutoff by looking at where the elbow
71+
#' rises; \code{\link{plot.gg_isopro}} can annotate either a score
72+
#' (\code{threshold}) or a top-percent (\code{top_n_pct}) for you.
73+
#'
74+
#' @param object An \code{isopro} fit returned by
75+
#' \code{\link[varPro]{isopro}}.
76+
#' @param ... Currently unused.
77+
#'
78+
#' @return A \code{data.frame} of class \code{c("gg_isopro", "data.frame")},
79+
#' one row per observation. Columns:
80+
#' \describe{
81+
#' \item{obs}{Integer; observation index \code{1..n}, in the same
82+
#' order as the rows of the data passed to
83+
#' \code{\link[varPro]{isopro}}.}
84+
#' \item{case.depth}{Numeric; mean isolation depth across the forest.
85+
#' Lower means the observation was isolated quickly — more
86+
#' anomalous.}
87+
#' \item{howbad}{Numeric in \code{[0, 1]}; the \code{case.depth}
88+
#' values pushed through their own empirical CDF and flipped so
89+
#' higher means more anomalous. This is the score the plot method
90+
#' draws by default.}
91+
#' }
92+
#' A \code{provenance} attribute records
93+
#' \code{source = "varPro::isopro"}, the observation count \code{n}, and
94+
#' the number of trees \code{ntree}.
95+
#'
96+
#' @section Comparing methods:
97+
#' To compare methods (\code{"rnd"}, \code{"unsupv"}, \code{"auto"}), call
98+
#' \code{\link{gg_isopro}} on each fit and \code{dplyr::bind_rows()} the
99+
#' results with a \code{method} label column. The plot method auto-detects
100+
#' \code{method} and colours the curves.
101+
#'
102+
#' @references
103+
#' Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008). Isolation Forest.
104+
#' \emph{Eighth IEEE International Conference on Data Mining}, 413-422.
105+
#'
106+
#' Ishwaran, H., Mantero, A., and Lu, M. (2025). varPro: Model-Independent
107+
#' Variable Selection via the Rule-Based Variable Priority Framework.
108+
#' \emph{R package version 3.x}.
109+
#'
110+
#' @seealso \code{\link{plot.gg_isopro}}, \code{\link[varPro]{isopro}}
111+
#'
112+
#' @examples
113+
#' \donttest{
114+
#' if (requireNamespace("varPro", quietly = TRUE)) {
115+
#' set.seed(1)
116+
#' fit <- varPro::isopro(data = iris[, 1:4], method = "rnd",
117+
#' sampsize = 32, ntree = 50)
118+
#' gg <- gg_isopro(fit)
119+
#' plot(gg)
120+
#' }
121+
#' }
122+
#'
123+
#' @export
124+
gg_isopro <- function(object, ...) {
125+
UseMethod("gg_isopro", object)
126+
}
127+
128+
#' @export
129+
gg_isopro.isopro <- function(object, ...) {
130+
if (!inherits(object, "isopro")) {
131+
stop("gg_isopro expects a 'isopro' object from varPro::isopro().",
132+
call. = FALSE)
133+
}
134+
135+
howbad <- as.numeric(object$howbad)
136+
depth <- as.numeric(object$case.depth)
137+
n <- length(howbad)
138+
139+
gg_dta <- data.frame(
140+
obs = seq_len(n),
141+
case.depth = depth,
142+
howbad = howbad
143+
)
144+
145+
class(gg_dta) <- c("gg_isopro", class(gg_dta))
146+
147+
# isopro-specific provenance (the shared .gg_provenance helper only knows
148+
# about rfsrc / randomForest objects, so build the list inline).
149+
ntree <- tryCatch(
150+
as.integer(object$isoforest$ntree),
151+
error = function(e) NA_integer_
152+
)
153+
attr(gg_dta, "provenance") <- list(
154+
source = "varPro::isopro",
155+
n = n,
156+
ntree = if (length(ntree) == 1 && !is.na(ntree)) ntree else NA_integer_
157+
)
158+
159+
invisible(gg_dta)
160+
}

0 commit comments

Comments
 (0)