Skip to content

Latest commit

 

History

History
445 lines (324 loc) · 12.5 KB

File metadata and controls

445 lines (324 loc) · 12.5 KB

Contributing to ggRandomForests

Thank you for your interest in contributing! This guide is written for R programmers who are new to package development. It covers everything from setting up your environment to opening a pull request.


Table of contents

  1. What you will need
  2. Getting the code
  3. Package structure
  4. The gg_* design pattern
  5. Making a change
  6. Writing tests
  7. Documentation standards
  8. Code style
  9. Running the full check suite
  10. Opening a pull request
  11. Getting help

1. What you will need

Tool Why Install
R >= 4.4.0 Runtime https://cloud.r-project.org
RStudio or Positron IDE with R-aware tools https://posit.co/downloads
Git Version control https://git-scm.com
Quarto Builds the vignette https://quarto.org/docs/get-started

Install the development helper packages in R:

install.packages(c(
  "devtools",    # load, test, check, document in one place
  "testthat",    # testing framework
  "roxygen2",    # builds man/ pages from inline comments
  "lintr",       # style linter
  "covr",        # test coverage measurement
  "pkgdown"      # builds the website
))

Install the package dependencies (randomForestSRC >= 3.4.0 is required):

install.packages(c("randomForestSRC", "randomForest", "ggplot2",
                   "dplyr", "tidyr", "survival"))

2. Getting the code

# Fork the repo on GitHub first, then:
git clone https://github.com/YOUR-USERNAME/ggRandomForests.git
cd ggRandomForests

# Add the upstream remote so you can pull future changes
git remote add upstream https://github.com/ehrlinger/ggRandomForests.git

Open ggRandomForests.Rproj in RStudio/Positron. Then load the package in development mode — this makes all functions available without installing:

devtools::load_all()   # shortcut: Ctrl+Shift+L

Confirm it works:

library(randomForestSRC)
rf <- rfsrc(Species ~ ., data = iris)
plot(gg_error(rf))

3. Package structure

R/                   All source code — one file per function family
  gg_*.R             Data extraction functions (return gg_* objects)
  plot.gg_*.R        S3 plot methods (return ggplot objects)
  help.R             Package-level ?ggRandomForests documentation
  zzz.R              .onAttach startup message

man/                 Auto-generated — never edit by hand
  *.Rd               Built from Roxygen comments by devtools::document()

tests/
  testthat/
    test_*.R         One test file per R/ source file

vignettes/
  ggRandomForests.qmd  Main package vignette (Quarto)

DESCRIPTION          Package metadata, dependencies, version
NAMESPACE            Exports/imports — auto-generated by roxygen2
NEWS.md              Changelog — add an entry for every user-visible change

Key rule: the man/ and NAMESPACE files are always auto-generated. Run devtools::document() after any Roxygen change and commit the updated files.


4. The gg_* design pattern

Every feature in this package follows the same two-step pattern:

forest object
    │
    ▼
gg_*(forest)          ← R/gg_*.R   — data extraction, returns a gg_* data.frame
    │
    ▼
plot(gg_object)       ← R/plot.gg_*.R — builds and returns a ggplot2 object

Why two steps? Keeping data and plotting separate means users can inspect, save, transform, or combine the intermediate data before plotting, and apply any ggplot2 layers they want on top of the returned object.

The gg_* object

A gg_* object is just a data.frame with extra class attributes:

# Example: what gg_vimp returns
class(gg_vimp(rf))
# [1] "gg_vimp"     "data.frame"

The extra class ("gg_vimp") lets R dispatch plot(gg_dta) to plot.gg_vimp automatically through R's S3 system.

S3 dispatch for multiple forest packages

Most gg_* functions support both randomForestSRC and randomForest objects. The pattern is:

# 1. Generic — dispatches based on class of `object`
gg_vimp <- function(object, ...) {
  UseMethod("gg_vimp", object)
}

# 2. rfsrc method
gg_vimp.rfsrc <- function(object, ...) { ... }

# 3. randomForest method
gg_vimp.randomForest <- function(object, ...) { ... }

Both methods should return an identically structured gg_* object so that plot.gg_vimp works for either.

Example: adding a new gg_* function

Suppose you want to add gg_depth() to plot average tree depth. Here is the skeleton:

# R/gg_depth.R

#' Tree depth data object
#'
#' Extracts average depth statistics per tree from a random forest.
#'
#' @param object A fitted \code{\link[randomForestSRC]{rfsrc}} or
#'   \code{\link[randomForest]{randomForest}} object.
#' @param ... Optional arguments passed to methods.
#'
#' @return A \code{gg_depth} \code{data.frame} with columns \code{ntree}
#'   and \code{depth}.
#'
#' @seealso \code{\link{plot.gg_depth}}
#'
#' @examples
#' rf <- rfsrc(Species ~ ., data = iris)
#' plot(gg_depth(rf))
#'
#' @export
gg_depth <- function(object, ...) {
  UseMethod("gg_depth", object)
}

#' @export
gg_depth.rfsrc <- function(object, ...) {
  # ... extract depth data ...
  gg_dta <- data.frame(ntree = seq_len(object$ntree), depth = depths)
  class(gg_dta) <- c("gg_depth", class(gg_dta))
  invisible(gg_dta)
}

Then create R/plot.gg_depth.R following the same pattern as plot.gg_error.R.


5. Making a change

Always work on a branch — never commit directly to main:

git checkout -b my-feature-name

The development cycle is:

devtools::load_all()       # reload after editing source
devtools::test()           # run tests
devtools::document()       # rebuild man/ from Roxygen comments
devtools::check()          # full R CMD check (slow — run before PR)

6. Writing tests

Tests live in tests/testthat/ and are named test_<source_file>.R to match the file they cover. The framework is testthat.

Basic structure

# tests/testthat/test_gg_depth.R
test_that("gg_depth returns correct class for rfsrc", {
  rf <- randomForestSRC::rfsrc(Species ~ ., data = iris, ntree = 50)

  gg_dta <- gg_depth(rf)

  expect_s3_class(gg_dta, "gg_depth")
  expect_s3_class(gg_dta, "data.frame")
  expect_true(all(c("ntree", "depth") %in% names(gg_dta)))
  expect_equal(nrow(gg_dta), rf$ntree)
})

test_that("plot.gg_depth returns a ggplot", {
  rf <- randomForestSRC::rfsrc(Species ~ ., data = iris, ntree = 50)

  gg_plt <- plot(gg_depth(rf))

  expect_s3_class(gg_plt, "ggplot")
})

test_that("gg_depth throws on wrong input", {
  expect_error(gg_depth("not a forest"))
})

Practical tips

  • Keep forests small in tests — ntree = 50 is plenty, faster than the default 1000.
  • Test the error path as well as the happy path (expect_error, expect_warning).
  • Use expect_s3_class() rather than the older expect_is().
  • Avoid set.seed() unless you are explicitly testing something random — randomForestSRC results are stochastic and exact-value tests break across versions.

Run tests for a single file during development:

testthat::test_file("tests/testthat/test_gg_depth.R")

Check coverage (aim for > 80%):

covr::package_coverage()

7. Documentation standards

Documentation is written in Roxygen2 comments (lines starting with #') immediately above each function.

Required sections for every exported function

#' Short one-line title
#'
#' One or two paragraphs describing what the function does and why.
#'
#' @param arg1 Type and meaning. Include the default value if there is one.
#' @param arg2 ...
#'
#' @return Describe what is returned: the class, the columns in any
#'   data.frame, and any class attributes set on the object.
#'
#' @seealso \code{\link{related_function}}
#'
#' @examples
#' # A runnable example — must complete in < 10 seconds for CRAN
#' rf <- rfsrc(Species ~ ., data = iris, ntree = 50)
#' plot(gg_something(rf))
#'
#' @export

Rules

  • @param for every argument, including ... when the extras are meaningful.
  • @return must describe the shape of the output — not just the class name.
  • @seealso links to the paired plot.* function (or the gg_* function from a plot.* file).
  • @examples must be runnable without error by R CMD check. Wrap slow examples in \donttest{}. Never wrap in \dontrun{} unless they literally cannot run on CRAN (network, credentials, etc.).
  • Internal helpers (not exported) get @keywords internal instead of @export.

Rebuild the docs after any change:

devtools::document()

Then spot-check the result:

?gg_depth

Updating NEWS.md

Every user-visible change needs a bullet in NEWS.md under the appropriate version heading:

ggRandomForests v2.7.0
=====================
* Add `gg_depth()` to visualise average tree depth per forest (#42)

8. Code style

The package follows the tidyverse style guide. Key points:

Rule Good Bad
Spacing around operators x <- x + 1 x<-x+1
Spaces after commas f(x, y) f(x,y)
Indentation 2 spaces tabs
Object names snake_case camelCase, dotted.name
Boolean checks !inherits(x, "foo") inherits(x, "foo") == FALSE
Safe sequences seq_len(n) 1:n
Column references in aes() .data$col or .data[[var]] bare col or string "col"
dplyr column selection dplyr::select(tidyr::all_of(vars)) dplyr::select(vars)

Check your code with lintr before opening a PR:

lintr::lint_package()

Common issues lintr flags:

  • Lines > 120 characters.
  • T / F instead of TRUE / FALSE.
  • Trailing whitespace.
  • 1:n instead of seq_len(n).
  • inherits(x, "cls") == FALSE instead of !inherits(x, "cls").

9. Running the full check suite

Before opening a PR, run the same checks CI runs:

# Quick: just tests
devtools::test()

# Thorough: full R CMD check (builds vignette, checks examples, etc.)
devtools::check()

A clean check means:

0 errors ✔ | 0 warnings ✔ | 0 notes ✔

One note about the package size or installed path is acceptable. Errors or warnings must be fixed before a PR can be merged.

To reproduce the exact CI matrix locally you can use rhub:

rhub::rhub_check()

10. Opening a pull request

  1. Commit your changes with a clear, present-tense message:

    git add R/gg_depth.R R/plot.gg_depth.R tests/testthat/test_gg_depth.R
    git commit -m "Add gg_depth() for average tree depth visualisation"
  2. Push to your fork:

    git push origin my-feature-name
  3. Open a PR on GitHub against the main branch of ehrlinger/ggRandomForests.

  4. PR description checklist — include in the description:

    • What problem does this solve or what feature does it add?
    • Which functions are new or changed?
    • Did you add or update tests?
    • Did you add a NEWS.md entry?
    • Does devtools::check() pass cleanly?
  5. CI will run automatically across macOS, Windows, and Linux on R release, devel, and oldrel-1. All checks must pass before merge.

Commit message conventions

Add gg_depth() for average tree depth               ← new feature
Fix factor ordering in gg_partial categorical branch ← bug fix
Improve @return docs for gg_rfsrc                   ← documentation
Refactor bootstrap_survival to utils.R              ← refactor

Avoid "WIP", "fix", or "update" with no context.


11. Getting help

When filing a bug, always include:

# Minimum reproducible example
library(ggRandomForests)
library(randomForestSRC)

rf <- rfsrc(Species ~ ., data = iris, ntree = 50)
# ... the code that triggers the error ...

sessionInfo()  # paste this output into the issue

Thank you for helping improve ggRandomForests!