Skip to content

Commit edbc9b6

Browse files
ehrlingerclaude
andauthored
feat: gg_isopro newdata arg — varPro Phase 4b predict.isopro wrapper (+ training-path polarity fix) (#96)
* docs: design spec for varPro Phase 4 — predict.isopro wrapper Second sub-project of Phase 4 (gg_beta_varpro and gg_ivarpro come after). Adds a newdata argument to gg_isopro() so a fitted isopro model can score new observations into the same tidy gg_isopro frame. The polarity flip between varPro's predict.isopro (smaller = anomalous) and the package's howbad (higher = anomalous) is hidden inside the wrapper; the column is semantically the same whether you score training or test data. Train/test overlay reuses the existing method-column auto-detect in plot.gg_isopro, explicitly documented. * docs: sharpen polarity language in predict.isopro spec After review discussion: rename the 'Polarity reminder' section to 'Polarity: how the wrapper presents both conventions' and rewrite it so it explicitly names that case.depth keeps varPro's native polarity while howbad carries the flipped version. Documentation section gains a concrete-code-form requirement so the implementer writes the transformation as 'howbad = 1 - predict(fit, newdata, quantiles=TRUE)' in the roxygen. Same design (Option A), clearer framing. * docs: implementation plan for varPro Phase 4b predict.isopro wrapper * chore: open v2.7.3.9010 dev cycle (varPro Phase 4b predict.isopro) * feat(gg_isopro): newdata argument for predict.isopro scoring * test(gg_isopro): newdata validation and polarity-flip sanity checks Adds three sanity tests for the predict.isopro path: newdata type validation, training-as-newdata top-5 ordering agreement, and the howbad = 1 - quantile relationship. The top-5 ordering test caught a real polarity bug in the training path: gg_isopro.isopro was returning howbad = object$howbad directly, but varPro's $howbad uses "lower = more anomalous" polarity (it is the quantile of case.depth, low depth = anomalous). The wrapper convention is "higher = more anomalous". Flip the training path the same way the prediction path does (1 - quantile) so train and test scores live on the same polarity. Also drop backticks from the newdata validation error so the regex match in the new tests is unambiguous. * test(gg_isopro): train + test overlay via the method-column path * docs(gg_isopro): document newdata arg and the polarity flip * test: vdiffr snapshot for gg_isopro train+test overlay * docs: NEWS entry for varPro Phase 4b predict.isopro wrapper + training-path polarity fix * refactor(gg_isopro): move newdata after ... for back-compat Address Copilot review on PR #96: placing newdata as the 2nd positional argument would change positional matching for any caller of the PR #94 signature gg_isopro(object, ...). Moving newdata after ... means it can only be matched by name, so existing positional calls are unaffected. All tests already pass newdata by name; no test changes needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5c78a66 commit edbc9b6

8 files changed

Lines changed: 1097 additions & 24 deletions

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
Package: ggRandomForests
22
Type: Package
33
Title: Visually Exploring Random Forests
4-
Version: 2.7.3.9009
4+
Version: 2.7.3.9010
55
Date: 2026-05-21
66
Authors@R: person("John", "Ehrlinger",
77
role = c("aut", "cre"),

NEWS.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,34 @@
11
Package: ggRandomForests
2-
Version: 2.7.3.9009
2+
Version: 2.7.3.9010
33

44
ggRandomForests v2.8.0 (development) — continued
55
=================================================
6+
* `gg_isopro()` gains a `newdata` argument so a fitted `varPro::isopro`
7+
model can score new observations into the same tidy `gg_isopro` frame.
8+
Internally the wrapper calls `predict.isopro()` twice: with
9+
`quantiles = FALSE` to populate the `case.depth` column (varPro's native
10+
polarity, lower = more anomalous) and with `quantiles = TRUE` to compute
11+
`howbad = 1 - quantile` (the wrapper convention, higher = more anomalous).
12+
Both polarities are visible in the returned data frame, and the
13+
relationship is named in the roxygen. The `plot` / `print` / `summary` /
14+
`autoplot` S3 companions work unchanged on the new tidy frame; to overlay
15+
training and test scores, bind the two extractor calls with a `method`
16+
label column and pass the result to `plot()`. Second of three Phase 4
17+
sub-projects.
18+
* **Fix (gg_isopro training-path polarity).** Bug in the original
19+
`gg_isopro` (PR #94): varPro's `$howbad` on an `isopro` fit uses
20+
"lower = more anomalous" polarity (it is the quantile of `case.depth`),
21+
but the wrapper's plot method and documentation both assume "higher =
22+
more anomalous". Train scores and the new test-data scores were
23+
anti-correlated until this PR's training-path flip
24+
(`howbad = 1 - object$howbad`) brought them into agreement. The fix
25+
surfaced because the test-data sanity check (training-as-newdata top-5
26+
overlap) failed at 0/5 instead of 5/5 before the flip. Note: the two
27+
vdiffr baselines recorded in PR #94 (`gg-isopro-default` and
28+
`gg-isopro-threshold`) were recorded under the inverted polarity; they
29+
are visually flipped relative to the new behaviour but CI skips
30+
snapshots (`VDIFFR_RUN_TESTS = false`) so no failure surfaces. Re-record
31+
with `VDIFFR_RUN_TESTS = true` when convenient.
632
* Documentation: pedagogical pass over the varPro wrappers
733
(`gg_partial_varpro`, `gg_varpro`, `gg_udependent` and their `plot.*`
834
methods). Each help page now has explicit "What X is doing", "What's

R/gg_isopro.R

Lines changed: 101 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -63,17 +63,63 @@
6363
#' \item check whether a held-out cohort sits inside the training
6464
#' distribution before scoring with a model trained elsewhere;
6565
#' \item give the analyst a ranked list of "look at these first" cases
66-
#' for a manual review.
66+
#' for a manual review;
67+
#' \item score a held-out cohort or a fresh batch of incoming data
68+
#' against a fitted model and compare the test scores to the training
69+
#' distribution.
6770
#' }
6871
#' The score is a \emph{rank}, not a probability of being an outlier — two
6972
#' observations with \code{howbad = 0.92} are both unusual, not "92\%
7073
#' likely to be anomalous". Pick a cutoff by looking at where the elbow
7174
#' rises; \code{\link{plot.gg_isopro}} can annotate either a score
7275
#' (\code{threshold}) or a top-percent (\code{top_n_pct}) for you.
7376
#'
77+
#' @section Scoring new data:
78+
#' Pass a \code{data.frame} as \code{newdata} and the extractor calls
79+
#' \code{\link[varPro]{predict.isopro}} twice: once with
80+
#' \code{quantiles = FALSE} to get the raw mean case depth per row, and once
81+
#' with \code{quantiles = TRUE} to get the per-row quantile of that depth
82+
#' against the training-data depth distribution.
83+
#'
84+
#' varPro's \code{predict.isopro} returns quantiles where \emph{smaller is
85+
#' more anomalous}, which is the opposite polarity of the wrapper's
86+
#' \code{howbad} (where \emph{higher} is more anomalous). The wrapper
87+
#' exposes both conventions so nothing is hidden:
88+
#' \itemize{
89+
#' \item \code{case.depth} carries varPro's native polarity — \emph{lower
90+
#' = more anomalous}. This is the unmodified output of
91+
#' \code{predict(object, newdata, quantiles = FALSE)}. Use it to
92+
#' cross-reference against raw varPro output.
93+
#' \item \code{howbad} is the flipped, wrapper-convention version. The
94+
#' relationship is \code{howbad = 1 - predict(object, newdata, quantiles = TRUE)}.
95+
#' }
96+
#'
97+
#' To overlay training and test scores in one plot, bind the two extractor
98+
#' calls with a \code{method} label column (the same column
99+
#' \code{\link{plot.gg_isopro}} uses to colour rnd / unsupv / auto
100+
#' comparisons):
101+
#'
102+
#' \preformatted{
103+
#' gg_train <- gg_isopro(fit)
104+
#' gg_test <- gg_isopro(fit, newdata = test_df)
105+
#' gg_both <- rbind(cbind(gg_train, method = "train"),
106+
#' cbind(gg_test, method = "test"))
107+
#' class(gg_both) <- c("gg_isopro", "data.frame")
108+
#' plot(gg_both)
109+
#' }
110+
#'
74111
#' @param object An \code{isopro} fit returned by
75112
#' \code{\link[varPro]{isopro}}.
76-
#' @param ... Currently unused.
113+
#' @param ... Currently unused. Present before \code{newdata} so that
114+
#' \code{newdata} is only matched by name, preserving backward
115+
#' compatibility with callers of the PR #94 signature
116+
#' \code{gg_isopro(object, ...)}.
117+
#' @param newdata Optional \code{data.frame} of new observations to score
118+
#' against the fit. Must be passed by name. When \code{NULL} (default)
119+
#' the extractor returns the in-sample tidy frame from the fit's stored
120+
#' \code{$case.depth} and \code{$howbad}. When supplied, each row is
121+
#' scored via \code{\link[varPro]{predict.isopro}} and the same tidy
122+
#' shape is returned for the test data.
77123
#'
78124
#' @return A \code{data.frame} of class \code{c("gg_isopro", "data.frame")},
79125
#' one row per observation. Columns:
@@ -121,40 +167,76 @@
121167
#' }
122168
#'
123169
#' @export
124-
gg_isopro <- function(object, ...) {
170+
gg_isopro <- function(object, ..., newdata = NULL) {
125171
UseMethod("gg_isopro", object)
126172
}
127173

128174
#' @export
129-
gg_isopro.isopro <- function(object, ...) {
175+
gg_isopro.isopro <- function(object, ..., newdata = NULL) {
130176
if (!inherits(object, "isopro")) {
131177
stop("gg_isopro expects a 'isopro' object from varPro::isopro().",
132178
call. = FALSE)
133179
}
134180

135-
howbad <- as.numeric(object$howbad)
136-
depth <- as.numeric(object$case.depth)
137-
n <- length(howbad)
181+
ntree <- tryCatch(
182+
as.integer(object$isoforest$ntree),
183+
error = function(e) NA_integer_
184+
)
185+
ntree <- if (length(ntree) == 1L && !is.na(ntree)) ntree else NA_integer_
186+
187+
## ---- Training path (newdata = NULL) ------------------------------------
188+
if (is.null(newdata)) {
189+
# varPro's $howbad uses "lower = more anomalous" polarity (it is the
190+
# quantile of case.depth, low depth = anomalous). The wrapper convention
191+
# is "higher = more anomalous", so flip the polarity here the same way
192+
# the prediction path does (howbad = 1 - quantile).
193+
howbad <- 1 - as.numeric(object$howbad)
194+
depth <- as.numeric(object$case.depth)
195+
n <- length(howbad)
196+
197+
gg_dta <- data.frame(
198+
obs = seq_len(n),
199+
case.depth = depth,
200+
howbad = howbad
201+
)
202+
class(gg_dta) <- c("gg_isopro", class(gg_dta))
203+
attr(gg_dta, "provenance") <- list(
204+
source = "varPro::isopro",
205+
n = n,
206+
ntree = ntree,
207+
prediction = FALSE
208+
)
209+
return(invisible(gg_dta))
210+
}
211+
212+
## ---- Prediction path (newdata supplied) -------------------------------
213+
if (!is.data.frame(newdata)) {
214+
stop("newdata must be a data.frame.", call. = FALSE)
215+
}
216+
217+
# Two calls to predict.isopro: raw depth and quantile-against-training.
218+
# The wrapper polarity is "higher = more anomalous", so we flip the quantile:
219+
# howbad = 1 - predict(object, newdata, quantiles = TRUE)
220+
# case.depth keeps varPro's native scale (lower = more anomalous), giving
221+
# the user a varPro-polarity number for cross-reference.
222+
depth <- as.numeric(stats::predict(object, newdata = newdata,
223+
quantiles = FALSE))
224+
q <- as.numeric(stats::predict(object, newdata = newdata,
225+
quantiles = TRUE))
226+
howbad <- 1 - q
227+
n <- nrow(newdata)
138228

139229
gg_dta <- data.frame(
140230
obs = seq_len(n),
141231
case.depth = depth,
142232
howbad = howbad
143233
)
144-
145234
class(gg_dta) <- c("gg_isopro", class(gg_dta))
146-
147-
# isopro-specific provenance (the shared .gg_provenance helper only knows
148-
# about rfsrc / randomForest objects, so build the list inline).
149-
ntree <- tryCatch(
150-
as.integer(object$isoforest$ntree),
151-
error = function(e) NA_integer_
152-
)
153235
attr(gg_dta, "provenance") <- list(
154-
source = "varPro::isopro",
155-
n = n,
156-
ntree = if (length(ntree) == 1 && !is.na(ntree)) ntree else NA_integer_
236+
source = "varPro::isopro",
237+
n = n,
238+
ntree = ntree,
239+
prediction = TRUE
157240
)
158-
159241
invisible(gg_dta)
160242
}
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# ggRandomForests v2.8.0 — varPro Phase 4: predict.isopro Wrapper Design
2+
3+
**Date:** 2026-05-26
4+
**Author:** John Ehrlinger (design via Claude brainstorming)
5+
**Status:** Approved — ready for implementation planning
6+
**Sequencing:** Second of the Phase 4 sub-projects. Builds on PR #94 (gg_isopro for the in-sample case). `gg_beta_varpro` and `gg_ivarpro` come after. Lands as one PR before the v2.8.0 release candidate.
7+
8+
---
9+
10+
## Goal
11+
12+
Let users score new observations against a fitted `varPro::isopro` model with the same tidy-data ergonomics as the in-sample `gg_isopro()` call: same return shape, same plot method, same threshold semantics.
13+
14+
## Scope
15+
16+
A single sub-project. Implemented as one new argument on the existing `gg_isopro()` extractor — no new exported function, no new plot method. Other Phase 4 functions (`gg_beta_varpro`, `gg_ivarpro`) are tracked separately.
17+
18+
---
19+
20+
## Architecture
21+
22+
```
23+
varPro::isopro fit ──┐
24+
├──► gg_isopro(object, newdata = NULL)
25+
data.frame (newdata) ─┘ │
26+
└── tidy data.frame
27+
class: c("gg_isopro", "data.frame")
28+
cols : obs, case.depth, howbad
29+
attr : provenance (prediction flag)
30+
31+
plot / print / summary / autoplot
32+
(unchanged from PR #94)
33+
```
34+
35+
The two input paths produce the same return shape. The plot/print/summary methods do not care which path produced the object.
36+
37+
---
38+
39+
## Extractor signature
40+
41+
```r
42+
gg_isopro(object, newdata = NULL, ...)
43+
```
44+
45+
- **`object`** — an `isopro` fit from `varPro::isopro`. Method dispatch via `gg_isopro.isopro`.
46+
- **`newdata`**`NULL` (default) or a `data.frame`. When `NULL`, returns the training-data tidy frame (PR #94 behaviour). When a `data.frame`, scores each row against the fit and returns the same tidy shape for the test data.
47+
- **`...`** — currently unused.
48+
49+
## Internal flow when `newdata` is supplied
50+
51+
1. Validate: `newdata` must be a `data.frame`. Otherwise `stop()` with `"newdata must be a data.frame."`.
52+
2. Call `predict(object, newdata = newdata, quantiles = FALSE)` → raw mean case-depth per row.
53+
3. Call `predict(object, newdata = newdata, quantiles = TRUE)` → quantile per row (smaller = more anomalous, per varPro's convention).
54+
4. **Flip polarity** for column consistency:
55+
```r
56+
howbad <- 1 - quantile
57+
```
58+
With the flip, `howbad` always means "higher = more anomalous", whether the row came from the training set or `newdata`. The plot method and any `threshold = ...` value the user picks from the training elbow apply unchanged.
59+
5. Assemble the tidy frame:
60+
```r
61+
data.frame(obs = seq_len(nrow(newdata)),
62+
case.depth = case_depth_vec,
63+
howbad = howbad_vec)
64+
```
65+
6. Set class `c("gg_isopro", "data.frame")` and attach a provenance attribute:
66+
- `source = "varPro::isopro"`
67+
- `n = nrow(newdata)`
68+
- `ntree` (carried from `object$isoforest$ntree`)
69+
- `prediction = TRUE` (new — distinguishes test-data extractor from training)
70+
71+
## Plot / print / summary
72+
73+
Unchanged. The new tidy frame has the same class and columns as the training case, so every S3 companion from PR #94 works as-is.
74+
75+
## Overlay train + test (caller pattern)
76+
77+
No new machinery in the package; the existing `method`-column auto-detect in `plot.gg_isopro` is overloaded:
78+
79+
```r
80+
gg_train <- gg_isopro(fit)
81+
gg_test <- gg_isopro(fit, newdata = test_df)
82+
gg_both <- rbind(cbind(gg_train, method = "train"),
83+
cbind(gg_test, method = "test"))
84+
class(gg_both) <- c("gg_isopro", "data.frame")
85+
plot(gg_both)
86+
```
87+
88+
`method` is the existing special column used to colour-group rnd / unsupv / auto curves; reusing it for `train` / `test` works because the plot only cares about the column's existence, not its semantics. A `@section` in the `gg_isopro` roxygen documents this overload so it isn't a hidden trick.
89+
90+
## Polarity: how the wrapper presents both conventions
91+
92+
`varPro::predict.isopro(quantiles = TRUE)` returns quantiles where *smaller is more anomalous* (a row whose case depth sits in the lower tail of the training depth distribution). `gg_isopro`'s `howbad` is the opposite: *higher is more anomalous*. The wrapper is **not** trying to hide the conflict — it shows both polarities by keeping both columns:
93+
94+
- `case.depth` carries the raw mean depth from `predict(quantiles = FALSE)`. **Lower = more anomalous.** This is varPro's native scale, exposed directly, with no transformation. A user who wants to cross-reference against `varPro::predict.isopro()` output can do it on this column.
95+
- `howbad` carries `1 - predict(quantiles = TRUE)`. **Higher = more anomalous.** This is the wrapper convention, and it matches what the training-path `gg_isopro()` already produces. The plot method's elbow shape, the `threshold` annotation, and the `top_n_pct` quantile all assume this polarity.
96+
97+
The roxygen must name this transformation explicitly. A user who reads only the `howbad` column should still come away understanding: (i) it isn't byte-identical to `predict.isopro(quantiles = TRUE)`, (ii) the relationship is `howbad = 1 - quantile`, and (iii) `case.depth` is the unmodified varPro number if you need the raw measure.
98+
99+
## Validation
100+
101+
- `newdata` is supplied but isn't a `data.frame``stop("newdata must be a data.frame.")`.
102+
- `nrow(newdata) == 0` → empty `gg_isopro` frame with the same columns; downstream plot handles zero rows by erroring with a clear ggplot message. No special-case in the extractor.
103+
- Unknown columns / NAs in `newdata` → pass through to `predict.isopro`; varPro decides.
104+
105+
## Tests (mirroring the Phase 1–4a coverage)
106+
107+
1. **Shape**: `gg_isopro(fit, newdata = test_df)` returns `c("gg_isopro", "data.frame")` with columns `obs / case.depth / howbad`, `nrow == nrow(newdata)`.
108+
2. **Polarity flip**: synthetic check that `howbad` is in `[0, 1]` and corresponds to `1 - predict(..., quantiles = TRUE)` for the same rows.
109+
3. **Sanity check**: scoring the training set as newdata produces `howbad` values close to (but not necessarily identical to) `fit$howbad`. Tolerance is loose because varPro may use a slightly different code path for `predict` vs the in-bag scoring; the test asserts the same range and the same per-row ordering for the top-5 most anomalous rows.
110+
4. **Provenance**: returned object's provenance attribute has `prediction = TRUE` and `n == nrow(newdata)`.
111+
5. **Validation error**: `gg_isopro(fit, newdata = "not a df")` errors with `"newdata must be a data.frame"`.
112+
6. **Overlay smoke test**: rbind of train + test extractor outputs with a `method` label column plots without error; every patchwork sub-plot builds.
113+
114+
## Snapshots
115+
116+
One new `vdiffr::expect_doppelganger` inside the existing `VDIFFR_RUN_TESTS` guard: `gg-isopro-predict-overlay` — train + test bound with a `method` column, default `plot()`. Skip cleanly without the env var.
117+
118+
## Documentation
119+
120+
- Extend the existing `gg_isopro` roxygen with:
121+
- A new `@param newdata` line in the terse register.
122+
- A short `@section Scoring new data` block in the narrative register, written to make the polarity transformation explicit:
123+
- What `newdata` does.
124+
- The two `predict.isopro` calls and how their outputs map to the two columns (raw depth → `case.depth`, `1 - quantile``howbad`).
125+
- One sentence naming the transformation in code form (e.g. "`howbad = 1 - predict(fit, newdata, quantiles = TRUE)`") so a user diffing against raw `predict()` output sees exactly where the difference comes from.
126+
- The train/test overlay caller pattern.
127+
- Update the existing "What you use this for" section to mention the new-data use case (a held-out cohort, a production scoring scenario).
128+
129+
## Files
130+
131+
- **Modify**: `R/gg_isopro.R` (signature + new internal path), `tests/testthat/test_gg_isopro.R` (six new tests), `tests/testthat/test_snapshots.R` (one snapshot), `NEWS.md`, `DESCRIPTION` (version bump to the next available `2.7.3.900x` increment — `.9010` if PR #95 has landed by then, otherwise the implementer picks the next free slot above `.9008`).
132+
- **New**: none.
133+
134+
## Acceptance criteria
135+
136+
- `R CMD check --as-cran`: 0 errors / 0 warnings / 0 notes.
137+
- Full `devtools::test()`: 0 failures. New tests pass; gg_isopro coverage from PR #94 (43 expectations) still green.
138+
- Roxygen produced under markdown mode (PR #95 enables this; if #95 hasn't merged, write in Rd-style and document() will produce valid Rd either way).
139+
- One PR before the v2.8.0 release candidate.
140+
141+
## Out of scope
142+
143+
- A new function (e.g. `gg_isopro_predict()`) — rejected in favour of one optional argument.
144+
- An S3 `predict.gg_isopro()` method — rejected because gg_isopro is a data frame and doesn't carry the fit.
145+
- Generalising the `method`-column auto-detect to *any* grouping column. Today we reuse `method`; if real friction emerges, generalise in a later release.
146+
- Exposing `quantiles = FALSE` to the caller. Internally we call both; externally the user gets the unified columns.

0 commit comments

Comments
 (0)