|
| 1 | +####********************************************************************** |
| 2 | +#### gg_isopro: tidy extractor for varPro::isopro anomaly scores. |
| 3 | +#### |
| 4 | +#### varPro::isopro returns a list with $howbad (per-observation anomaly |
| 5 | +#### score in [0,1]) and $case.depth (average isolation depth, lower = |
| 6 | +#### more anomalous). gg_isopro() reshapes these into a tidy data.frame |
| 7 | +#### the plot/print/summary methods can consume. |
| 8 | +####********************************************************************** |
| 9 | + |
| 10 | +#' Tidy data from a varPro isolation-forest fit |
| 11 | +#' |
| 12 | +#' Pulls per-observation anomaly scores out of a \code{\link[varPro]{isopro}} |
| 13 | +#' fit so you can plot them, sort them, or write them to disk without having |
| 14 | +#' to know the internal shape of the fit. |
| 15 | +#' |
| 16 | +#' @section What isopro is doing: |
| 17 | +#' An isolation forest (Liu, Ting and Zhou 2008) is a random forest grown |
| 18 | +#' on very small subsamples of the data and asked to split until each |
| 19 | +#' observation lands in its own terminal node. The intuition is geometric: |
| 20 | +#' a typical observation sits in the dense middle of the feature cloud and |
| 21 | +#' takes many splits to isolate, while an unusual observation sits out |
| 22 | +#' near an edge and gets cut off after only a few. So \strong{the depth at |
| 23 | +#' which an observation is isolated is a proxy for how typical it is} — |
| 24 | +#' shallow depth means anomalous, deep depth means ordinary. Average a |
| 25 | +#' single observation's depth across many trees and the noise washes out, |
| 26 | +#' leaving a stable per-observation rank. |
| 27 | +#' |
| 28 | +#' \code{\link[varPro]{isopro}} supports three flavours of isolation |
| 29 | +#' forest, which differ in how the splits are chosen: |
| 30 | +#' \describe{ |
| 31 | +#' \item{\code{"rnd"}}{The original Liu/Ting/Zhou method: each tree node |
| 32 | +#' picks a variable at random and a split point uniformly at random |
| 33 | +#' in the variable's range. Fast, no model, surprisingly effective.} |
| 34 | +#' \item{\code{"unsupv"}}{Unsupervised splitting from |
| 35 | +#' \code{randomForestSRC}: splits are chosen to separate the data |
| 36 | +#' along the directions of highest variance. More structured than |
| 37 | +#' \code{"rnd"}; sometimes more accurate, especially when the |
| 38 | +#' anomalies follow a coherent direction.} |
| 39 | +#' \item{\code{"auto"}}{An auto-encoder formulation that grows a |
| 40 | +#' multivariate forest predicting each feature from the others. Most |
| 41 | +#' expressive, slowest, best suited to low-dimensional data.} |
| 42 | +#' } |
| 43 | +#' No method is universally best. The varPro authors recommend trying at |
| 44 | +#' least two and comparing the score distributions; the plot method here |
| 45 | +#' colours per-method curves automatically when you stack the results. |
| 46 | +#' |
| 47 | +#' @section What's in the output: |
| 48 | +#' The fit gives back two parallel per-observation vectors: |
| 49 | +#' \code{case.depth} is the raw mean isolation depth (units of "splits", |
| 50 | +#' lower = more anomalous) and \code{howbad} is the same information |
| 51 | +#' transformed onto a \code{[0, 1]} scale via the empirical CDF of |
| 52 | +#' \code{case.depth} (higher = more anomalous). Both columns are kept so |
| 53 | +#' you can plot in either space and have the raw depth on hand for |
| 54 | +#' diagnostics; \code{howbad} is the canonical score and is what the plot |
| 55 | +#' method uses by default. |
| 56 | +#' |
| 57 | +#' @section What you use this for: |
| 58 | +#' This is screening, not inference. Reach for it when you want to: |
| 59 | +#' \itemize{ |
| 60 | +#' \item flag observations that may be data-entry errors, out-of-range |
| 61 | +#' measurements, or distinct subpopulations before fitting a primary |
| 62 | +#' model; |
| 63 | +#' \item check whether a held-out cohort sits inside the training |
| 64 | +#' distribution before scoring with a model trained elsewhere; |
| 65 | +#' \item give the analyst a ranked list of "look at these first" cases |
| 66 | +#' for a manual review. |
| 67 | +#' } |
| 68 | +#' The score is a \emph{rank}, not a probability of being an outlier — two |
| 69 | +#' observations with \code{howbad = 0.92} are both unusual, not "92\% |
| 70 | +#' likely to be anomalous". Pick a cutoff by looking at where the elbow |
| 71 | +#' rises; \code{\link{plot.gg_isopro}} can annotate either a score |
| 72 | +#' (\code{threshold}) or a top-percent (\code{top_n_pct}) for you. |
| 73 | +#' |
| 74 | +#' @param object An \code{isopro} fit returned by |
| 75 | +#' \code{\link[varPro]{isopro}}. |
| 76 | +#' @param ... Currently unused. |
| 77 | +#' |
| 78 | +#' @return A \code{data.frame} of class \code{c("gg_isopro", "data.frame")}, |
| 79 | +#' one row per observation. Columns: |
| 80 | +#' \describe{ |
| 81 | +#' \item{obs}{Integer; observation index \code{1..n}, in the same |
| 82 | +#' order as the rows of the data passed to |
| 83 | +#' \code{\link[varPro]{isopro}}.} |
| 84 | +#' \item{case.depth}{Numeric; mean isolation depth across the forest. |
| 85 | +#' Lower means the observation was isolated quickly — more |
| 86 | +#' anomalous.} |
| 87 | +#' \item{howbad}{Numeric in \code{[0, 1]}; the \code{case.depth} |
| 88 | +#' values pushed through their own empirical CDF and flipped so |
| 89 | +#' higher means more anomalous. This is the score the plot method |
| 90 | +#' draws by default.} |
| 91 | +#' } |
| 92 | +#' A \code{provenance} attribute records |
| 93 | +#' \code{source = "varPro::isopro"}, the observation count \code{n}, and |
| 94 | +#' the number of trees \code{ntree}. |
| 95 | +#' |
| 96 | +#' @section Comparing methods: |
| 97 | +#' To compare methods (\code{"rnd"}, \code{"unsupv"}, \code{"auto"}), call |
| 98 | +#' \code{\link{gg_isopro}} on each fit and \code{dplyr::bind_rows()} the |
| 99 | +#' results with a \code{method} label column. The plot method auto-detects |
| 100 | +#' \code{method} and colours the curves. |
| 101 | +#' |
| 102 | +#' @references |
| 103 | +#' Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008). Isolation Forest. |
| 104 | +#' \emph{Eighth IEEE International Conference on Data Mining}, 413-422. |
| 105 | +#' |
| 106 | +#' Ishwaran, H., Mantero, A., and Lu, M. (2025). varPro: Model-Independent |
| 107 | +#' Variable Selection via the Rule-Based Variable Priority Framework. |
| 108 | +#' \emph{R package version 3.x}. |
| 109 | +#' |
| 110 | +#' @seealso \code{\link{plot.gg_isopro}}, \code{\link[varPro]{isopro}} |
| 111 | +#' |
| 112 | +#' @examples |
| 113 | +#' \donttest{ |
| 114 | +#' if (requireNamespace("varPro", quietly = TRUE)) { |
| 115 | +#' set.seed(1) |
| 116 | +#' fit <- varPro::isopro(data = iris[, 1:4], method = "rnd", |
| 117 | +#' sampsize = 32, ntree = 50) |
| 118 | +#' gg <- gg_isopro(fit) |
| 119 | +#' plot(gg) |
| 120 | +#' } |
| 121 | +#' } |
| 122 | +#' |
| 123 | +#' @export |
| 124 | +gg_isopro <- function(object, ...) { |
| 125 | + UseMethod("gg_isopro", object) |
| 126 | +} |
| 127 | + |
| 128 | +#' @export |
| 129 | +gg_isopro.isopro <- function(object, ...) { |
| 130 | + if (!inherits(object, "isopro")) { |
| 131 | + stop("gg_isopro expects a 'isopro' object from varPro::isopro().", |
| 132 | + call. = FALSE) |
| 133 | + } |
| 134 | + |
| 135 | + howbad <- as.numeric(object$howbad) |
| 136 | + depth <- as.numeric(object$case.depth) |
| 137 | + n <- length(howbad) |
| 138 | + |
| 139 | + gg_dta <- data.frame( |
| 140 | + obs = seq_len(n), |
| 141 | + case.depth = depth, |
| 142 | + howbad = howbad |
| 143 | + ) |
| 144 | + |
| 145 | + class(gg_dta) <- c("gg_isopro", class(gg_dta)) |
| 146 | + |
| 147 | + # isopro-specific provenance (the shared .gg_provenance helper only knows |
| 148 | + # about rfsrc / randomForest objects, so build the list inline). |
| 149 | + ntree <- tryCatch( |
| 150 | + as.integer(object$isoforest$ntree), |
| 151 | + error = function(e) NA_integer_ |
| 152 | + ) |
| 153 | + attr(gg_dta, "provenance") <- list( |
| 154 | + source = "varPro::isopro", |
| 155 | + n = n, |
| 156 | + ntree = if (length(ntree) == 1 && !is.na(ntree)) ntree else NA_integer_ |
| 157 | + ) |
| 158 | + |
| 159 | + invisible(gg_dta) |
| 160 | +} |
0 commit comments