Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
b1ec296
adding script to create database
aylapear Oct 31, 2025
5c7df81
git ignore wqbench additional files
aylapear Oct 31, 2025
e5ed5b2
aggregate dataset
aylapear Nov 4, 2025
26db6ff
updating messages and simplifying code
aylapear Nov 4, 2025
4f94316
adding species and filtering to at least 4 classes
aylapear Nov 7, 2025
396ee78
Added WQ bench .rda workflow
beckyfisher Jan 16, 2026
8192b09
Test document workflow still runs
beckyfisher Jan 16, 2026
e7cdbbd
update to include wq bench
beckyfisher Jan 16, 2026
7057ae3
added license page
beckyfisher Jan 19, 2026
8014cb9
adding documentation files
aylapear Jan 20, 2026
f4e6bbb
add latin name and medium
beckyfisher Jan 21, 2026
8038798
Merge branch 'wqbench-data' of github.com:open-AIMS/ssddata into wqbe…
beckyfisher Jan 21, 2026
f463916
Renamed data_wqbench to wqbench_data to be consistent with the other …
beckyfisher Jan 23, 2026
fcb991a
create data not required as .R file generated manuall for wqbench
beckyfisher Jan 23, 2026
c7545b4
Ensure wqbench DATASET.R not automatically sourced
beckyfisher Jan 23, 2026
2d09e59
Revert to original wqbench workflow with only minor modifications
beckyfisher Jan 23, 2026
b9e1868
Added further detail to documentation
beckyfisher Jan 23, 2026
ee683aa
Minor updates to readme
beckyfisher Jan 23, 2026
d08009a
Add Ayla as an author
beckyfisher Jan 23, 2026
1f6f6f1
Add anzg master table
beckyfisher Feb 22, 2026
20abbff
Added raw 2021-2026 anzg data and refs
beckyfisher Feb 23, 2026
1799049
corrections to documentation generating files and minor edits
beckyfisher May 8, 2026
25e62d8
Updated datasets and documentation files
beckyfisher May 8, 2026
fb1df56
added additional datasets
beckyfisher May 8, 2026
6e227b0
Merge pull request #15 from open-AIMS/wqbench-data
beckyfisher May 8, 2026
6251834
Run pkgdown GitHub Action on dev branch
beckyfisher May 8, 2026
abf7cdf
Ignore large upstream SQL dumps in data-raw
beckyfisher May 8, 2026
47425a6
Ignore raw folder for anztox
beckyfisher May 8, 2026
c1da726
Create folder for the anztox database
beckyfisher May 8, 2026
a73ce09
moved exploratory files to raw so untracked
beckyfisher May 13, 2026
9cf1754
update rmcd badge
beckyfisher May 13, 2026
1279d03
ai ignores
beckyfisher May 15, 2026
8229706
minor updates and additional ignores
beckyfisher May 19, 2026
c12c36b
Minor fix to prevent pkgdown error
beckyfisher May 19, 2026
6e00f1b
Added anztox_data including relevant code for building the dataset an…
beckyfisher May 19, 2026
921a2b0
Moved md for endpoint lookup to vignettes
beckyfisher May 19, 2026
e78de9e
Minor corrections to pass checks
beckyfisher May 19, 2026
6b012a3
added research builder md
beckyfisher May 19, 2026
c194acb
additional suggests packages for vignette building
beckyfisher May 19, 2026
0f0c5a6
moved research builder to notes folder
beckyfisher May 19, 2026
303039f
Fixes to sow unmapped endpoints dynamically in vignette
beckyfisher May 19, 2026
a203112
Add R script to build a dynamic pkgdown.yml and minro tweaks to ensur…
beckyfisher May 20, 2026
d712997
Add github copilot instructions
beckyfisher May 20, 2026
72ba3ae
Changed documentation scripts to write markdown syntax
beckyfisher May 20, 2026
0fb084f
updated documentation following minor edit to anzg
beckyfisher May 20, 2026
5ae288e
updates to ensure envirotox_data and children appear in references pa…
beckyfisher May 20, 2026
28fe905
Add packages required for envirotox to suggests
beckyfisher May 20, 2026
63d8bb4
Add information on envirotox to readme
beckyfisher May 20, 2026
fca5157
Add ne envirotox datasets to data
beckyfisher May 20, 2026
3ab3fe9
add envirotox data details to readme
beckyfisher May 20, 2026
dd8b079
Add envirotox DATASET.R workflow to source_all.R
beckyfisher May 20, 2026
3f77984
Add envirotox Dataset.R and raw data to the data-raw folder
beckyfisher May 20, 2026
73e0ac9
Add required references for envirotox data documentation to bib
beckyfisher May 20, 2026
717dadb
Add .R files for envirotox documentation
beckyfisher May 20, 2026
75e6131
updated man pages following devtoools::document()
beckyfisher May 20, 2026
4ef95ed
update details to reflect this function will NOT return envirotox data
beckyfisher May 20, 2026
6082f90
Minor styling fixes
beckyfisher May 20, 2026
ff183ac
Add test for envirotox
beckyfisher May 20, 2026
5166e45
Minor styling fixes
beckyfisher May 20, 2026
28df04a
added new package function to pkgdown to extract envirotox datasets
beckyfisher May 20, 2026
bf2df28
Updates to readme
beckyfisher May 20, 2026
868adac
added the original envirotox list_datasets function as a deprecated f…
beckyfisher May 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,8 @@
^codecov\.yml$
^scripts$
^ignore$
^\.vscode$
^\.positai$
^\.claude$
^vignettes/\.quarto$
ssddata_builder_research.md
11 changes: 11 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Normalise line endings to LF in the repository
* text=auto eol=lf

# Force binary treatment for R data files
*.rda binary
*.rds binary
*.RData binary

# Windows-specific
*.bat text eol=crlf
*.cmd text eol=crlf
121 changes: 121 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# ssddata — Copilot Instructions

## Project Overview
`ssddata` is a data-only R package providing Species Sensitivity Distribution (SSD) benchmark datasets for evaluating SSD software (`ssdtools`, `Burrlioz`). It packages chemical toxicity datasets from CCME, AIMS, CSIRO, ANZG, and the US EPA ECOTOX database.

## Architecture
- `R/` — Roxygen2 documentation files only; no logic except three utility functions (`get_ssddata()`, `gm_mean()`, `ssd_data_sets()`)
- `data/` — Pre-built `.rda` files (one per dataset); loaded via `LazyData: true`
- `data-raw/` — Source CSVs by org + build scripts; **not part of package build**; maintainer-only
- `tests/testthat/` — testthat edition 3 tests
- `vignettes/` — Quarto-based vignettes
- `inst/REFERENCES.bib` — Rdpack bibliography for `\insertRef{}` citations in roxygen docs

## Dataset Conventions
- Individual datasets: `{prefix}_{chemical}` (e.g., `ccme_boron`, `aims_ddt`)
- Combined datasets: `{prefix}_data` (e.g., `ccme_data`, `aims_data`)
- Prefixes: `aims`, `ccme`, `csiro`, `anzg`, `anon`, `wqbench`
- All datasets stored as `tbl_df`
- Individual dataset `.R` doc files end with `NULL`; aggregate `*_data.R` files end with `"dataset_name"`

## Code Style
- **Pipe**: `%>%` (magrittr) in `data-raw/` scripts; no pipe in `R/` package code
- **Imports**: Only `chk`, `dplyr`, `Rdpack`, `utils` — keep imports minimal
- **Input validation**: Use `chk` package for all user-facing function arguments (e.g., `chk::chk_string()`, `chk::chk_flag()`)
- **Side-effects**: Use `message()` only — no `print()` or `cat()` in exported functions
- **Docs**: Roxygen2 with markdown; use `Rdpack::reprompt()` and `\insertRef{key}{ssddata}` for literature citations

## Build and Test
```r
devtools::load_all()
devtools::document() # regenerates roxygen2 + Rdpack docs
devtools::test() # runs testthat suite
devtools::check() # full R CMD CHECK

# Rebuild all datasets from source
source("data-raw/source_all.R")
```
CI runs `R-CMD-check` and `pkgdown` via GitHub Actions.

## Testing Conventions
- Use `chk::check_data()` to validate dataset structure and value ranges
- Assert `message()` side-effects with `expect_message()`
- Test files mirror `R/` structure: `tests/testthat/test-{function}.R`

## Adding a New Dataset
1. Add source CSV(s) to `data-raw/{prefix}/`
2. Add build code to `data-raw/create_data.R` using `usethis::use_data(..., overwrite = TRUE)`
3. Create `R/{prefix}_{chemical}.R` with roxygen2 docs ending in `NULL`
4. Update the relevant `{prefix}_data` aggregate and its `.R` doc file
5. Run `source("data-raw/source_all.R")` then `devtools::document()`

---

## envirotox Integration

### Source Package
[`poissonconsulting/envirotox`](https://github.com/poissonconsulting/envirotox) (v0.0.0.9003) is a separate data-only R package providing SSD datasets derived from the **EnviroTox database 2.0.0** (Connors et al. 2019, doi:10.1002/etc.4382). Its data will be integrated into `ssddata` under the `envirotox` prefix.

### envirotox Datasets
Three datasets are exported from the package:

#### `envirotox_acute` — 14,949 rows × 6 columns
Acute toxicity records (EC50/LC50) aggregated to one geometric mean concentration per species per chemical.

| Column | Type | Description |
|---|---|---|
| `Chemical` | chr | Chemical name (short name before first `;`) |
| `Conc` | dbl | Geometric mean concentration (µg/L) |
| `Species` | chr | Latin species name |
| `Group` | chr | Taxonomic group (sentence case): `Fish`, `Invertebrate`, `Algae`, `Amphibian`, `Plant`, etc. |
| `Yanagihara24` | lgl | Meets Yanagihara et al. (2024) criteria: ≥10 species, ≥3 trophic groups, bimodality coefficient ≤ 0.555 |
| `Iwasaki25` | lgl | Meets Iwasaki et al. (2025) criteria: >50 species, ≥3 trophic groups (excludes certain metals) |

Key constraint: `chk::check_key(envirotox_acute, c("Chemical", "Species"))` — unique per chemical × species.

#### `envirotox_chronic` — 1,721 rows × 5 columns
Chronic toxicity records (NOEC/NOEL) aggregated similarly.

| Column | Type | Description |
|---|---|---|
| `Chemical` | chr | Chemical name |
| `Conc` | dbl | Geometric mean concentration (µg/L) |
| `Species` | chr | Latin species name |
| `Group` | chr | Taxonomic group (sentence case) |
| `Yanagihara24` | lgl | Meets Yanagihara et al. (2024) criteria |

Key constraint: `chk::check_key(envirotox_chronic, c("Chemical", "Species"))`.

#### `envirotox_chemical` — 744 rows × 2 columns
Chemical lookup table joining the two datasets above.

| Column | Type | Description |
|---|---|---|
| `Chemical` | chr | Chemical name (primary key) |
| `OriginalCAS` | int | Original CAS Registry Number (also a key) |

### Data Processing Pipeline (data-raw/envirotox.R)
The build script (modified from Yanagihara et al. 2024 code) processes `envirotox.xlsx` (three sheets: `test`, `substance`, `taxonomy`):

1. **Filter** records: acute = EC50/LC50 with `Test.type == "A"`; chronic = NOEC/NOEL with `Test.type == "C"`; exclude records where `Effect.is.5X.above.water.solubility == "1"`
2. **Unit conversion**: mg/L → µg/L (× 1000)
3. **Geometric mean** per `original.CAS × Latin.name` using `EnvStats::geoMean()`
4. **Minimum species/group thresholds**: ≥6 species, ≥2 trophic groups per chemical
5. **Bimodality coefficient** (BC) via `mousetrap::bimodality_coefficient(log10(Conc))` — used for `Yanagihara24` flag
6. **Group normalisation**: `"Invert"` → `"Invertebrate"`, `"Amphib"` → `"Amphibian"`, all sentence case
7. **`OriginalCAS` dropped** from acute/chronic before saving; kept only in `envirotox_chemical`

### Integration into ssddata
When adding envirotox data to `ssddata`:

- **Prefix**: `envirotox` (e.g., `envirotox_acute`, `envirotox_chronic`, `envirotox_chemical`)
- **Column mapping**: `ssddata` uses `Conc` (µg/L), `Species`, `Group` — matches envirotox columns directly; no `Chemical` column in individual `ssddata` datasets (chemical is encoded in dataset name for per-chemical sets)
- **Aggregate datasets**: `envirotox_acute` and `envirotox_chronic` are already multi-chemical aggregates; treat like `ccme_data` / `aims_data` — doc files end with `"envirotox_acute"` / `"envirotox_chronic"`
- **Source file**: place `envirotox.xlsx` in `data-raw/envirotox/` and the build script at `data-raw/envirotox/create_envirotox.R`
- **References to add to `inst/REFERENCES.bib`**:
- `Connors2019` — EnviroTox database paper (doi:10.1002/etc.4382)
- `Yanagihara2024` — distribution comparison paper (doi:10.1016/j.ecoenv.2024.116379)
- `Iwasaki2025` — model-averaging paper (doi:10.1093/etojnl/vgae060)

### Key Difference from Other ssddata Sources
Unlike CCME/AIMS/CSIRO datasets (one dataset per chemical), envirotox datasets are pre-aggregated multi-chemical tables with a `Chemical` column acting as the grouping variable. The `Yanagihara24` and `Iwasaki25` logical flags allow subsetting to published benchmark subsets without creating separate datasets.
4 changes: 1 addition & 3 deletions .github/workflows/pkgdown.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
branches: [main, master, dev]
pull_request:
release:
types: [published]
Expand Down
61 changes: 37 additions & 24 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,24 +1,37 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata

.DS_Store
.Renviron
.RProfile
.httr-oauth

*.html
*.Rproj

inst/docs/

ignore/

output/
output[0-9]/
output-20[0-9][0-9]-[0-9][0-9]*

report/
ignore/
docs
.Rproj.user
.Rhistory
.RData
.Ruserdata


.DS_Store
.Renviron
.RProfile
.httr-oauth

*.html
*.Rproj

inst/docs/

ignore/

output/
output[0-9]/
output-20[0-9][0-9]-[0-9][0-9]*

report/
ignore/
docs

data-raw/wqbench

data-raw/**/raw/
data-raw/**/*.sql
data-raw/**/*.sqlite
data-raw/**/*.db

data-raw/**/raw/
.positai

data-raw/anztox/toxicityvalue_combined_clean.csv
25 changes: 21 additions & 4 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
Package: ssddata
Title: Species Sensitivity Distribution Data
Version: 1.0.0
Version: 1.0.0.9000
Authors@R: c(
person("Rebecca", "Fisher", , "R.Fisher@aims.gov.au", role = c("aut", "cre")),
person("Joe", "Thorley", , "joe@poissonconsulting.ca", role = "aut",
comment = c(ORCID = "0000-0002-7683-4592")),
person("Ayla", "Pearson", , "ayla@poissonconsulting.ca", role = "aut",
comment = c(ORCID = "0000-0001-7388-1222")),
person("Carl", "Schwarz", role = "ctb"),
person("David", "Fox", role = "ctb")
)
Expand All @@ -15,6 +17,7 @@ Description: Reference data sets of species sensitivities to compare the
as five datasets from anonymous sources. It also includes a data set
of the results of fitting various distributions using different
software.
URL: https://open-aims.github.io/ssddata/, https://github.com/open-AIMS/ssddata
License: Apache License (== 2.0)
Depends:
R (>= 3.5)
Expand All @@ -23,13 +26,27 @@ Imports:
dplyr,
Rdpack,
utils
Suggests:
Suggests:
quarto,
tibble,
knitr,
readr,
kableExtra,
here,
tidyverse,
rprojroot,
covr,
testthat (>= 3.0.0)
pkgdown,
testthat (>= 3.0.0),
EnvStats,
mousetrap,
openxlsx,
stringr
VignetteBuilder: quarto
RdMacros:
Rdpack
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
RoxygenNote: 7.3.3
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Generated by roxygen2: do not edit by hand

export(envirotox_data_sets)
export(get_ssddata)
export(gm_mean)
export(list_datasets)
export(ssd_data_sets)
import(chk)
importFrom(Rdpack,reprompt)
Expand Down
67 changes: 34 additions & 33 deletions R/aims_aluminium_marine.R
Original file line number Diff line number Diff line change
@@ -1,33 +1,34 @@
#' Species Sensitivity Data for aluminium_marine
#'
#' Species Sensitivity Data provided by the Australian Institute of Marine
#' Science for aluminium in marine water.
#'
#' These data were sourced from:
#'\insertRef{VanDam2018}{ssddata}
#'
#'
#' The columns are as follows:
#'
#' \describe{
#'\item{Common}{The species common name (chr).}
#'\item{Conc}{The chemical concentration in micrograms per Litre (dbl).}
#'\item{Domain}{Tropical, temperate or other filter (chr).}
#'\item{Life_stage}{Life stage of the test organism (chr).}
#'\item{Phylum}{The Phylum name (chr).}
#'\item{Source}{The endpoint primary data source (chr).}
#'\item{Species}{The species names name (chr).}
#'\item{Test_endpoint}{Endpoint statistic, EC10, NEC etc (chr).}
#'\item{Toxicity_measure}{Type of toxicity measure used (chr).}
#' }
#'
#' @name aims_aluminium_marine
#' @docType data
#' @format An object of class `tbl_df` (inherits from `tbl`,
#' `data.frame`) with 20 rows and 9 columns.
#' @keywords datasets
#' @examples
#'
#' print(aims_aluminium_marine, n=Inf)
#'
"aims_aluminium_marine"
#' Species Sensitivity Data for aluminium_marine
#'
#' Species Sensitivity Data provided by the Australian Institute of Marine
#' Science for \strong{\emph{aluminium}} in marine water.
#'
#' These data were sourced from:
#'\insertRef{VanDam2018}{ssddata}
#'
#'
#' The columns are as follows:
#'
#' \describe{
#'\item{Common}{The species common name (chr).}
#'\item{Conc}{The chemical concentration in micrograms per Litre (dbl).}
#'\item{Domain}{Tropical, temperate or other filter (chr).}
#'\item{Life_stage}{Life stage of the test organism (chr).}
#'\item{Phylum}{The Phylum name (chr).}
#'\item{Source}{The endpoint primary data source (chr).}
#'\item{Species}{The species names name (chr).}
#'\item{Test_endpoint}{Endpoint statistic, EC10, NEC etc (chr).}
#'\item{Toxicity_measure}{Type of toxicity measure used (chr).}
#' }
#'
#' @name aims_aluminium_marine
#' @docType data
#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
#' \code{data.frame}) with 20 rows and 9 columns.
#' @keywords datasets
#' @examples
#'
#' data(aims_aluminium_marine)
#' print(aims_aluminium_marine, n=Inf)
#'
NULL
Loading
Loading