Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# ssddata — Copilot Instructions

## Project Overview
`ssddata` is a data-only R package providing Species Sensitivity Distribution (SSD) benchmark datasets for evaluating SSD software (`ssdtools`, `Burrlioz`). It packages chemical toxicity datasets from CCME, AIMS, CSIRO, ANZG, and the US EPA ECOTOX database.

## Architecture
- `R/` — Roxygen2 documentation files only; no logic except three utility functions (`get_ssddata()`, `gm_mean()`, `ssd_data_sets()`)
- `data/` — Pre-built `.rda` files (one per dataset); loaded via `LazyData: true`
- `data-raw/` — Source CSVs by org + build scripts; **not part of package build**; maintainer-only
- `tests/testthat/` — testthat edition 3 tests
- `vignettes/` — Quarto-based vignettes
- `inst/REFERENCES.bib` — Rdpack bibliography for `\insertRef{}` citations in roxygen docs

## Dataset Conventions
- Individual datasets: `{prefix}_{chemical}` (e.g., `ccme_boron`, `aims_ddt`)
- Combined datasets: `{prefix}_data` (e.g., `ccme_data`, `aims_data`)
- Prefixes: `aims`, `ccme`, `csiro`, `anzg`, `anon`, `wqbench`
- All datasets stored as `tbl_df`
- Individual dataset `.R` doc files end with `NULL`; aggregate `*_data.R` files end with `"dataset_name"`

## Code Style
- **Pipe**: `%>%` (magrittr) in `data-raw/` scripts; no pipe in `R/` package code
- **Imports**: Only `chk`, `dplyr`, `Rdpack`, `utils` — keep imports minimal
- **Input validation**: Use `chk` package for all user-facing function arguments (e.g., `chk::chk_string()`, `chk::chk_flag()`)
- **Side-effects**: Use `message()` only — no `print()` or `cat()` in exported functions
- **Docs**: Roxygen2 with markdown; use `Rdpack::reprompt()` and `\insertRef{key}{ssddata}` for literature citations

## Build and Test
```r
devtools::load_all()
devtools::document() # regenerates roxygen2 + Rdpack docs
devtools::test() # runs testthat suite
devtools::check() # full R CMD CHECK

# Rebuild all datasets from source
source("data-raw/source_all.R")
```
CI runs `R-CMD-check` and `pkgdown` via GitHub Actions.

## Testing Conventions
- Use `chk::check_data()` to validate dataset structure and value ranges
- Assert `message()` side-effects with `expect_message()`
- Test files mirror `R/` structure: `tests/testthat/test-{function}.R`

## Adding a New Dataset
1. Add source CSV(s) to `data-raw/{prefix}/`
2. Add build code to `data-raw/create_data.R` using `usethis::use_data(..., overwrite = TRUE)`
3. Create `R/{prefix}_{chemical}.R` with roxygen2 docs ending in `NULL`
4. Update the relevant `{prefix}_data` aggregate and its `.R` doc file
5. Run `source("data-raw/source_all.R")` then `devtools::document()`

---

## envirotox Integration

### Source Package
[`poissonconsulting/envirotox`](https://github.com/poissonconsulting/envirotox) (v0.0.0.9003) is a separate data-only R package providing SSD datasets derived from the **EnviroTox database 2.0.0** (Connors et al. 2019, doi:10.1002/etc.4382). Its data will be integrated into `ssddata` under the `envirotox` prefix.

### envirotox Datasets
Three datasets are exported from the package:

#### `envirotox_acute` — 14,949 rows × 6 columns
Acute toxicity records (EC50/LC50) aggregated to one geometric mean concentration per species per chemical.

| Column | Type | Description |
|---|---|---|
| `Chemical` | chr | Chemical name (short name before first `;`) |
| `Conc` | dbl | Geometric mean concentration (µg/L) |
| `Species` | chr | Latin species name |
| `Group` | chr | Taxonomic group (sentence case): `Fish`, `Invertebrate`, `Algae`, `Amphibian`, `Plant`, etc. |
| `Yanagihara24` | lgl | Meets Yanagihara et al. (2024) criteria: ≥10 species, ≥3 trophic groups, bimodality coefficient ≤ 0.555 |
| `Iwasaki25` | lgl | Meets Iwasaki et al. (2025) criteria: >50 species, ≥3 trophic groups (excludes certain metals) |

Key constraint: `chk::check_key(envirotox_acute, c("Chemical", "Species"))` — unique per chemical × species.

#### `envirotox_chronic` — 1,721 rows × 5 columns
Chronic toxicity records (NOEC/NOEL) aggregated similarly.

| Column | Type | Description |
|---|---|---|
| `Chemical` | chr | Chemical name |
| `Conc` | dbl | Geometric mean concentration (µg/L) |
| `Species` | chr | Latin species name |
| `Group` | chr | Taxonomic group (sentence case) |
| `Yanagihara24` | lgl | Meets Yanagihara et al. (2024) criteria |

Key constraint: `chk::check_key(envirotox_chronic, c("Chemical", "Species"))`.

#### `envirotox_chemical` — 744 rows × 2 columns
Chemical lookup table joining the two datasets above.

| Column | Type | Description |
|---|---|---|
| `Chemical` | chr | Chemical name (primary key) |
| `OriginalCAS` | int | Original CAS Registry Number (also a key) |

### Data Processing Pipeline (data-raw/envirotox.R)
The build script (modified from Yanagihara et al. 2024 code) processes `envirotox.xlsx` (three sheets: `test`, `substance`, `taxonomy`):

1. **Filter** records: acute = EC50/LC50 with `Test.type == "A"`; chronic = NOEC/NOEL with `Test.type == "C"`; exclude records where `Effect.is.5X.above.water.solubility == "1"`
2. **Unit conversion**: mg/L → µg/L (× 1000)
3. **Geometric mean** per `original.CAS × Latin.name` using `EnvStats::geoMean()`
4. **Minimum species/group thresholds**: ≥6 species, ≥2 trophic groups per chemical
5. **Bimodality coefficient** (BC) via `mousetrap::bimodality_coefficient(log10(Conc))` — used for `Yanagihara24` flag
6. **Group normalisation**: `"Invert"` → `"Invertebrate"`, `"Amphib"` → `"Amphibian"`, all sentence case
7. **`OriginalCAS` dropped** from acute/chronic before saving; kept only in `envirotox_chemical`

### Integration into ssddata
When adding envirotox data to `ssddata`:

- **Prefix**: `envirotox` (e.g., `envirotox_acute`, `envirotox_chronic`, `envirotox_chemical`)
- **Column mapping**: `ssddata` uses `Conc` (µg/L), `Species`, `Group` — matches envirotox columns directly; no `Chemical` column in individual `ssddata` datasets (chemical is encoded in dataset name for per-chemical sets)
- **Aggregate datasets**: `envirotox_acute` and `envirotox_chronic` are already multi-chemical aggregates; treat like `ccme_data` / `aims_data` — doc files end with `"envirotox_acute"` / `"envirotox_chronic"`
- **Source file**: place `envirotox.xlsx` in `data-raw/envirotox/` and the build script at `data-raw/envirotox/create_envirotox.R`
- **References to add to `inst/REFERENCES.bib`**:
- `Connors2019` — EnviroTox database paper (doi:10.1002/etc.4382)
- `Yanagihara2024` — distribution comparison paper (doi:10.1016/j.ecoenv.2024.116379)
- `Iwasaki2025` — model-averaging paper (doi:10.1093/etojnl/vgae060)

### Key Difference from Other ssddata Sources
Unlike CCME/AIMS/CSIRO datasets (one dataset per chemical), envirotox datasets are pre-aggregated multi-chemical tables with a `Chemical` column acting as the grouping variable. The `Yanagihara24` and `Iwasaki25` logical flags allow subsetting to published benchmark subsets without creating separate datasets.
8 changes: 5 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: ssddata
Title: Species Sensitivity Distribution Data
Version: 1.0.0
Version: 1.0.0.9000
Authors@R: c(
person("Rebecca", "Fisher", , "R.Fisher@aims.gov.au", role = c("aut", "cre")),
person("Joe", "Thorley", , "joe@poissonconsulting.ca", role = "aut",
Expand Down Expand Up @@ -32,10 +32,12 @@ Suggests:
knitr,
readr,
kableExtra,
testthat,
here,
tidyverse,
rprojroot
rprojroot,
covr,
pkgdown,
testthat (>= 3.0.0)
VignetteBuilder: quarto
RdMacros:
Rdpack
Expand Down
6 changes: 3 additions & 3 deletions R/aims_aluminium_marine.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#' Species Sensitivity Data for aluminium_marine
#'
#' Species Sensitivity Data provided by the Australian Institute of Marine
#' Science for ***aluminium*** in marine water.
#' Science for \strong{\emph{aluminium}} in marine water.
#'
#' These data were sourced from:
#'\insertRef{VanDam2018}{ssddata}
Expand All @@ -23,8 +23,8 @@
#'
#' @name aims_aluminium_marine
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 20 rows and 9 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 20 rows and 9 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/aims_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@
#'
#' @name aims_data
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 40 rows and 11 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 40 rows and 11 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
6 changes: 3 additions & 3 deletions R/aims_gallium_marine.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#' Species Sensitivity Data for gallium_marine
#'
#' Species Sensitivity Data provided by the Australian Institute of Marine
#' Science for ***gallium*** in marine water.
#' Science for \strong{\emph{gallium}} in marine water.
#'
#' These data were sourced from:
#'\insertRef{VanDam2018}{ssddata}
Expand All @@ -23,8 +23,8 @@
#'
#' @name aims_gallium_marine
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 6 rows and 9 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 6 rows and 9 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
6 changes: 3 additions & 3 deletions R/aims_molybdenum_marine.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#' Species Sensitivity Data for molybdenum_marine
#'
#' Species Sensitivity Data provided by the Australian Institute of Marine
#' Science for ***molybdenum*** in marine water.
#' Science for \strong{\emph{molybdenum}} in marine water.
#'
#' These data were sourced from:
#'\insertRef{VanDam2018}{ssddata}
Expand All @@ -23,8 +23,8 @@
#'
#' @name aims_molybdenum_marine
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 14 rows and 9 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 14 rows and 9 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/anon_a.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
#'
#' @name anon_a
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 18 rows and 2 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 18 rows and 2 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/anon_b.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
#'
#' @name anon_b
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 10 rows and 2 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 10 rows and 2 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/anon_c.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
#'
#' @name anon_c
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 16 rows and 2 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 16 rows and 2 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/anon_d.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
#'
#' @name anon_d
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 12 rows and 2 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 12 rows and 2 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/anon_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
#'
#' @name anon_data
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 73 rows and 2 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 73 rows and 2 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
4 changes: 2 additions & 2 deletions R/anon_e.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
#'
#' @name anon_e
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 17 rows and 2 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 17 rows and 2 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
10 changes: 5 additions & 5 deletions R/anzg_alpha_cypermethrin_fresh.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
#'
#' Species Sensitivity Data provided by the Department of Agriculture Water and
#' the Environment, Australia. This data underpins the ANZG default guideline
#' for ***alpha cypermethrin*** in freshwater.
#' for \strong{\emph{alpha cypermethrin}} in freshwater.
#'
#' These data are licensed under CC BY 4.0 (summary of terms provided here:
#' <https://creativecommons.org/licenses/by/4.0/>) Additional information
#' \url{https://creativecommons.org/licenses/by/4.0/}) Additional information
#' is available from the Water Quality website at
#' <https://www.waterquality.gov.au/>
#' \url{https://www.waterquality.gov.au/}
#'
#' Please cite these data as:
#'\insertRef{Alpha-cypermethrin2023}{ssddata}
Expand All @@ -33,8 +33,8 @@
#'
#' @name anzg_alpha_cypermethrin_fresh
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 14 rows and 7 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 14 rows and 7 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
10 changes: 5 additions & 5 deletions R/anzg_aluminium_marine.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
#'
#' Species Sensitivity Data provided by the Department of Agriculture Water and
#' the Environment, Australia. This data underpins the ANZG default guideline
#' for ***aluminium*** in marine water.
#' for \strong{\emph{aluminium}} in marine water.
#'
#' These data are licensed under CC BY 4.0 (summary of terms provided here:
#' <https://creativecommons.org/licenses/by/4.0/>) Additional information
#' \url{https://creativecommons.org/licenses/by/4.0/}) Additional information
#' is available from the Water Quality website at
#' <https://www.waterquality.gov.au/>
#' \url{https://www.waterquality.gov.au/}
#'
#' Please cite these data as:
#'\insertRef{aluminium-marine2025}{ssddata}
Expand All @@ -32,8 +32,8 @@
#'
#' @name anzg_aluminium_marine
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 18 rows and 6 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 18 rows and 6 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
10 changes: 5 additions & 5 deletions R/anzg_ametryn_fresh.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
#'
#' Species Sensitivity Data provided by the Department of Agriculture Water and
#' the Environment, Australia. This data underpins the ANZG default guideline
#' for ***ametryn*** in freshwater.
#' for \strong{\emph{ametryn}} in freshwater.
#'
#' These data are licensed under CC BY 4.0 (summary of terms provided here:
#' <https://creativecommons.org/licenses/by/4.0/>) Additional information
#' \url{https://creativecommons.org/licenses/by/4.0/}) Additional information
#' is available from the Water Quality website at
#' <https://www.waterquality.gov.au/>
#' \url{https://www.waterquality.gov.au/}
#'
#' Please cite these data as:
#'\insertRef{ametryn-fresh2025}{ssddata}
Expand All @@ -32,8 +32,8 @@
#'
#' @name anzg_ametryn_fresh
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 8 rows and 6 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 8 rows and 6 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
10 changes: 5 additions & 5 deletions R/anzg_ammonia_fresh.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
#'
#' Species Sensitivity Data provided by the Department of Agriculture Water and
#' the Environment, Australia. This data underpins the ANZG default guideline
#' for ***ammonia*** in freshwater.
#' for \strong{\emph{ammonia}} in freshwater.
#'
#' These data are licensed under CC BY 4.0 (summary of terms provided here:
#' <https://creativecommons.org/licenses/by/4.0/>) Additional information
#' \url{https://creativecommons.org/licenses/by/4.0/}) Additional information
#' is available from the Water Quality website at
#' <https://www.waterquality.gov.au/>
#' \url{https://www.waterquality.gov.au/}
#'
#' Please cite these data as:
#'\insertRef{ammonia-fresh2026}{ssddata}
Expand All @@ -32,8 +32,8 @@
#'
#' @name anzg_ammonia_fresh
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 40 rows and 6 columns.
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 40 rows and 6 columns.
#' @keywords datasets
#' @examples
#'
Expand Down
Loading
Loading