open-AIMS · beckyfisher · May 25, 2026 · Oct 31, 2025 · Oct 31, 2025 · Nov 4, 2025
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -9,3 +9,8 @@
 ^codecov\.yml$
 ^scripts$
 ^ignore$
+^\.vscode$
+^\.positai$
+^\.claude$
+^vignettes/\.quarto$
+ssddata_builder_research.md
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,11 @@
+# Normalise line endings to LF in the repository
+* text=auto eol=lf
+
+# Force binary treatment for R data files
+*.rda binary
+*.rds binary
+*.RData binary
+
+# Windows-specific
+*.bat text eol=crlf
+*.cmd text eol=crlf
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,121 @@
+# ssddata — Copilot Instructions
+
+## Project Overview
+`ssddata` is a data-only R package providing Species Sensitivity Distribution (SSD) benchmark datasets for evaluating SSD software (`ssdtools`, `Burrlioz`). It packages chemical toxicity datasets from CCME, AIMS, CSIRO, ANZG, and the US EPA ECOTOX database.
+
+## Architecture
+- `R/` — Roxygen2 documentation files only; no logic except three utility functions (`get_ssddata()`, `gm_mean()`, `ssd_data_sets()`)
+- `data/` — Pre-built `.rda` files (one per dataset); loaded via `LazyData: true`
+- `data-raw/` — Source CSVs by org + build scripts; **not part of package build**; maintainer-only
+- `tests/testthat/` — testthat edition 3 tests
+- `vignettes/` — Quarto-based vignettes
+- `inst/REFERENCES.bib` — Rdpack bibliography for `\insertRef{}` citations in roxygen docs
+
+## Dataset Conventions
+- Individual datasets: `{prefix}_{chemical}` (e.g., `ccme_boron`, `aims_ddt`)
+- Combined datasets: `{prefix}_data` (e.g., `ccme_data`, `aims_data`)
+- Prefixes: `aims`, `ccme`, `csiro`, `anzg`, `anon`, `wqbench`
+- All datasets stored as `tbl_df`
+- Individual dataset `.R` doc files end with `NULL`; aggregate `*_data.R` files end with `"dataset_name"`
+
+## Code Style
+- **Pipe**: `%>%` (magrittr) in `data-raw/` scripts; no pipe in `R/` package code
+- **Imports**: Only `chk`, `dplyr`, `Rdpack`, `utils` — keep imports minimal
+- **Input validation**: Use `chk` package for all user-facing function arguments (e.g., `chk::chk_string()`, `chk::chk_flag()`)
+- **Side-effects**: Use `message()` only — no `print()` or `cat()` in exported functions
+- **Docs**: Roxygen2 with markdown; use `Rdpack::reprompt()` and `\insertRef{key}{ssddata}` for literature citations
+
+## Build and Test
+```r
+devtools::load_all()
+devtools::document()   # regenerates roxygen2 + Rdpack docs
+devtools::test()       # runs testthat suite
+devtools::check()      # full R CMD CHECK
+
+# Rebuild all datasets from source
+source("data-raw/source_all.R")
+```
+CI runs `R-CMD-check` and `pkgdown` via GitHub Actions.
+
+## Testing Conventions
+- Use `chk::check_data()` to validate dataset structure and value ranges
+- Assert `message()` side-effects with `expect_message()`
+- Test files mirror `R/` structure: `tests/testthat/test-{function}.R`
+
+## Adding a New Dataset
+1. Add source CSV(s) to `data-raw/{prefix}/`
+2. Add build code to `data-raw/create_data.R` using `usethis::use_data(..., overwrite = TRUE)`
+3. Create `R/{prefix}_{chemical}.R` with roxygen2 docs ending in `NULL`
+4. Update the relevant `{prefix}_data` aggregate and its `.R` doc file
+5. Run `source("data-raw/source_all.R")` then `devtools::document()`
+
+---
+
+## envirotox Integration
+
+### Source Package
+[`poissonconsulting/envirotox`](https://github.com/poissonconsulting/envirotox) (v0.0.0.9003) is a separate data-only R package providing SSD datasets derived from the **EnviroTox database 2.0.0** (Connors et al. 2019, doi:10.1002/etc.4382). Its data will be integrated into `ssddata` under the `envirotox` prefix.
+
+### envirotox Datasets
+Three datasets are exported from the package:
+
+#### `envirotox_acute` — 14,949 rows × 6 columns
+Acute toxicity records (EC50/LC50) aggregated to one geometric mean concentration per species per chemical.
+
+| Column | Type | Description |
+|---|---|---|
+| `Chemical` | chr | Chemical name (short name before first `;`) |
+| `Conc` | dbl | Geometric mean concentration (µg/L) |
+| `Species` | chr | Latin species name |
+| `Group` | chr | Taxonomic group (sentence case): `Fish`, `Invertebrate`, `Algae`, `Amphibian`, `Plant`, etc. |
+| `Yanagihara24` | lgl | Meets Yanagihara et al. (2024) criteria: ≥10 species, ≥3 trophic groups, bimodality coefficient ≤ 0.555 |
+| `Iwasaki25` | lgl | Meets Iwasaki et al. (2025) criteria: >50 species, ≥3 trophic groups (excludes certain metals) |
+
+Key constraint: `chk::check_key(envirotox_acute, c("Chemical", "Species"))` — unique per chemical × species.
+
+#### `envirotox_chronic` — 1,721 rows × 5 columns
+Chronic toxicity records (NOEC/NOEL) aggregated similarly.
+
+| Column | Type | Description |
+|---|---|---|
+| `Chemical` | chr | Chemical name |
+| `Conc` | dbl | Geometric mean concentration (µg/L) |
+| `Species` | chr | Latin species name |
+| `Group` | chr | Taxonomic group (sentence case) |
+| `Yanagihara24` | lgl | Meets Yanagihara et al. (2024) criteria |
+
+Key constraint: `chk::check_key(envirotox_chronic, c("Chemical", "Species"))`.
+
+#### `envirotox_chemical` — 744 rows × 2 columns
+Chemical lookup table joining the two datasets above.
+
+| Column | Type | Description |
+|---|---|---|
+| `Chemical` | chr | Chemical name (primary key) |
+| `OriginalCAS` | int | Original CAS Registry Number (also a key) |
+
+### Data Processing Pipeline (data-raw/envirotox.R)
+The build script (modified from Yanagihara et al. 2024 code) processes `envirotox.xlsx` (three sheets: `test`, `substance`, `taxonomy`):
+
+1. **Filter** records: acute = EC50/LC50 with `Test.type == "A"`; chronic = NOEC/NOEL with `Test.type == "C"`; exclude records where `Effect.is.5X.above.water.solubility == "1"`
+2. **Unit conversion**: mg/L → µg/L (× 1000)
+3. **Geometric mean** per `original.CAS × Latin.name` using `EnvStats::geoMean()`
+4. **Minimum species/group thresholds**: ≥6 species, ≥2 trophic groups per chemical
+5. **Bimodality coefficient** (BC) via `mousetrap::bimodality_coefficient(log10(Conc))` — used for `Yanagihara24` flag
+6. **Group normalisation**: `"Invert"` → `"Invertebrate"`, `"Amphib"` → `"Amphibian"`, all sentence case
+7. **`OriginalCAS` dropped** from acute/chronic before saving; kept only in `envirotox_chemical`
+
+### Integration into ssddata
+When adding envirotox data to `ssddata`:
+
+- **Prefix**: `envirotox` (e.g., `envirotox_acute`, `envirotox_chronic`, `envirotox_chemical`)
+- **Column mapping**: `ssddata` uses `Conc` (µg/L), `Species`, `Group` — matches envirotox columns directly; no `Chemical` column in individual `ssddata` datasets (chemical is encoded in dataset name for per-chemical sets)
+- **Aggregate datasets**: `envirotox_acute` and `envirotox_chronic` are already multi-chemical aggregates; treat like `ccme_data` / `aims_data` — doc files end with `"envirotox_acute"` / `"envirotox_chronic"`
+- **Source file**: place `envirotox.xlsx` in `data-raw/envirotox/` and the build script at `data-raw/envirotox/create_envirotox.R`
+- **References to add to `inst/REFERENCES.bib`**:
+  - `Connors2019` — EnviroTox database paper (doi:10.1002/etc.4382)
+  - `Yanagihara2024` — distribution comparison paper (doi:10.1016/j.ecoenv.2024.116379)
+  - `Iwasaki2025` — model-averaging paper (doi:10.1093/etojnl/vgae060)
+
+### Key Difference from Other ssddata Sources
+Unlike CCME/AIMS/CSIRO datasets (one dataset per chemical), envirotox datasets are pre-aggregated multi-chemical tables with a `Chemical` column acting as the grouping variable. The `Yanagihara24` and `Iwasaki25` logical flags allow subsetting to published benchmark subsets without creating separate datasets.
diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml
@@ -1,8 +1,6 @@
-# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
-# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
 on:
   push:
-    branches: [main, master]
+    branches: [main, master, dev]
   pull_request:
   release:
     types: [published]

diff --git a/.gitignore b/.gitignore
@@ -1,24 +1,37 @@
-.Rproj.user
-.Rhistory
-.RData
-.Ruserdata
-
-.DS_Store
-.Renviron
-.RProfile
-.httr-oauth
-
-*.html
-*.Rproj
-
-inst/docs/
-
-ignore/
-
-output/
-output[0-9]/
-output-20[0-9][0-9]-[0-9][0-9]*
-
-report/
-ignore/
-docs
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
+
+
+.DS_Store
+.Renviron
+.RProfile
+.httr-oauth
+
+*.html
+*.Rproj
+
+inst/docs/
+
+ignore/
+
+output/
+output[0-9]/
+output-20[0-9][0-9]-[0-9][0-9]*
+
+report/
+ignore/
+docs
+
+data-raw/wqbench
+
+data-raw/**/raw/
+data-raw/**/*.sql
+data-raw/**/*.sqlite
+data-raw/**/*.db
+
+data-raw/**/raw/
+.positai
+
+data-raw/anztox/toxicityvalue_combined_clean.csv
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,10 +1,12 @@
 Package: ssddata
 Title: Species Sensitivity Distribution Data
-Version: 1.0.0
+Version: 1.0.0.9000
 Authors@R: c(
     person("Rebecca", "Fisher", , "R.Fisher@aims.gov.au", role = c("aut", "cre")),
     person("Joe", "Thorley", , "joe@poissonconsulting.ca", role = "aut",
            comment = c(ORCID = "0000-0002-7683-4592")),
+    person("Ayla", "Pearson", , "ayla@poissonconsulting.ca", role = "aut",
+           comment = c(ORCID = "0000-0001-7388-1222")),
     person("Carl", "Schwarz", role = "ctb"),
     person("David", "Fox", role = "ctb")
   )
@@ -15,6 +17,7 @@ Description: Reference data sets of species sensitivities to compare the
     as five datasets from anonymous sources. It also includes a data set
     of the results of fitting various distributions using different
     software.
+URL: https://open-aims.github.io/ssddata/, https://github.com/open-AIMS/ssddata
 License: Apache License (== 2.0)
 Depends: 
     R (>= 3.5)
@@ -23,13 +26,27 @@ Imports:
     dplyr,
     Rdpack,
     utils
-Suggests: 
+Suggests:
+    quarto,
+    tibble,
+    knitr,
+    readr,
+    kableExtra,
+    here,
+    tidyverse,
+    rprojroot,
     covr,
-    testthat (>= 3.0.0)
+    pkgdown,
+    testthat (>= 3.0.0),
+    EnvStats,
+    mousetrap,
+    openxlsx,
+    stringr
+VignetteBuilder: quarto
 RdMacros: 
     Rdpack
 Config/testthat/edition: 3
 Encoding: UTF-8
 LazyData: true
 Roxygen: list(markdown = TRUE)
-RoxygenNote: 7.3.2
+RoxygenNote: 7.3.3
diff --git a/NAMESPACE b/NAMESPACE
@@ -1,7 +1,9 @@
 # Generated by roxygen2: do not edit by hand
 
+export(envirotox_data_sets)
 export(get_ssddata)
 export(gm_mean)
+export(list_datasets)
 export(ssd_data_sets)
 import(chk)
 importFrom(Rdpack,reprompt)

diff --git a/R/aims_aluminium_marine.R b/R/aims_aluminium_marine.R
@@ -1,33 +1,34 @@
-#' Species Sensitivity Data for aluminium_marine
-#' 
-#' Species Sensitivity Data provided by the Australian Institute of Marine
-#' Science for aluminium in marine water.
-#' 
-#' These data were sourced from: 
-#'\insertRef{VanDam2018}{ssddata} 
-#'
-#' 
-#' The columns are as follows:
-#' 
-#' \describe{ 
-#'\item{Common}{The species common name (chr).}
-#'\item{Conc}{The chemical concentration in micrograms per Litre (dbl).}
-#'\item{Domain}{Tropical, temperate or other filter (chr).}
-#'\item{Life_stage}{Life stage of the test organism (chr).}
-#'\item{Phylum}{The Phylum name (chr).}
-#'\item{Source}{The endpoint primary data source (chr).}
-#'\item{Species}{The species names name (chr).}
-#'\item{Test_endpoint}{Endpoint statistic, EC10, NEC etc (chr).}
-#'\item{Toxicity_measure}{Type of toxicity measure used (chr).} 
-#' }
-#' 
-#' @name aims_aluminium_marine
-#' @docType data
-#' @format An object of class `tbl_df` (inherits from `tbl`,
-#' `data.frame`) with 20 rows and 9 columns.
-#' @keywords datasets
-#' @examples
-#' 
-#' print(aims_aluminium_marine, n=Inf)
-#' 
-"aims_aluminium_marine"
+#' Species Sensitivity Data for aluminium_marine
+#' 
+#' Species Sensitivity Data provided by the Australian Institute of Marine
+#' Science for \strong{\emph{aluminium}} in marine water.
+#' 
+#' These data were sourced from: 
+#'\insertRef{VanDam2018}{ssddata} 
+#'
+#' 
+#' The columns are as follows:
+#' 
+#' \describe{ 
+#'\item{Common}{The species common name (chr).}
+#'\item{Conc}{The chemical concentration in micrograms per Litre (dbl).}
+#'\item{Domain}{Tropical, temperate or other filter (chr).}
+#'\item{Life_stage}{Life stage of the test organism (chr).}
+#'\item{Phylum}{The Phylum name (chr).}
+#'\item{Source}{The endpoint primary data source (chr).}
+#'\item{Species}{The species names name (chr).}
+#'\item{Test_endpoint}{Endpoint statistic, EC10, NEC etc (chr).}
+#'\item{Toxicity_measure}{Type of toxicity measure used (chr).} 
+#' }
+#' 
+#' @name aims_aluminium_marine
+#' @docType data
+#' @format An object of class \code{tbl_df} (inherits from \code{tbl},
+#' \code{data.frame}) with 20 rows and 9 columns.
+#' @keywords datasets
+#' @examples
+#' 
+#' data(aims_aluminium_marine)
+#' print(aims_aluminium_marine, n=Inf)
+#' 
+NULL