nf-core · pinin4fjords · May 21, 2026 · May 21, 2026 · May 22, 2026 · May 22, 2026
diff --git a/README.md b/README.md
@@ -458,6 +458,12 @@ The earth sciences folder contain subfolders for different data formats encounte
         - SRX11780888_chr20.bam.bai index for filtered and trimmed reads from SRX11780888, aligned to human Chr20
         - SRX11780887.Aligned.toTranscriptome.out.bam filtered and trimmed reads from SRX11780887, aligned to human Chr20, transcriptomic coordinates
         - SRX11780888.Aligned.toTranscriptome.out.bam filtered and trimmed reads from SRX11780888, aligned to human Chr20, transcriptomic coordinates
+      - dotseq
+        - counts.tsv.gz: per-ORF count matrix (6642 ORFs x 12 samples; 6 Ribo + 6 RNA) for the chx subset of the GSE231096 cell-cycle cohort. First column is `orf_id`, remaining columns are SRR accessions. Derived from the DOTSeq package's bundled `featureCounts.cell_cycle_subset.txt.gz` by stripping the featureCounts annotation columns and BAM-path sample names.
+        - annotation.tsv.gz: per-ORF annotation (6642 rows). Columns: `orf_id, gene_id, chrom, start, end, strand, orf_type`. ORF ids match `counts.tsv.gz`; `orf_type` is `mORF`, `uORF`, or `dORF`. Joined from the DOTSeq package's bundled flattened GENCODE v47 GTF + BED.
+        - metadata.txt.gz: DOTSeq's headerless 24-sample metadata covering both chx and har treatment arms, columns `run strategy replicate treatment condition`. Kept for upstream traceability; not consumed by the nf-test directly.
+        - samplesheet.csv: headered, chx-only subset of `metadata.txt.gz` (12 rows: 6 Ribo + 6 RNA) - what the nf-test consumes directly.
+        - README.md: full derivation recipe + verification.
       - plastid
         - Homo_sapiens.GRCh38.111_chr20_rois.txt: metagene generated from Homo_sapiens.GRCh38.111_chr20.gtf using plastid `metagene generate` command
         - SRX11780887_p_offsets.txt: p-site offsets genereated from SRX11780887_chr20.bam and Homo_sapiens.GRCh38.111_chr20.gtf using plastid `psite` command

diff --git a/data/genomics/homo_sapiens/riboseq_expression/dotseq/README.md b/data/genomics/homo_sapiens/riboseq_expression/dotseq/README.md
@@ -0,0 +1,30 @@
+# Test data for `dotseq/dotseq`
+
+A small cohort of ORF-level Ribo-seq + RNA-seq counts with matched ORF annotation, derived from the `cell_cycle_subset` example data shipped in the [DOTSeq Bioconductor package](https://bioconductor.org/packages/release/bioc/html/DOTSeq.html) (`inst/extdata`).
+
+## Why this set?
+
+The `dotseq/dotseq` module wraps `DOTSeqDataSetsFromSummarizeOverlaps()`, which takes a per-ORF count matrix plus a `GRanges` ORF annotation. These fixtures provide both in tidy TSV form (one ORF per row), aligned on a stable `orf_id` so the module's R template can rebuild the `GRanges` and run end-to-end against a known-good cohort the package author already validated against the API.
+
+## Files
+
+| File | Size | Description |
+|---|---|---|
+| `counts.tsv.gz` | 73 KB | Per-ORF count matrix (6642 ORFs x 12 samples). First column is `orf_id`, remaining columns are sample IDs (6 Ribo-seq + 6 RNA-seq). Sample IDs match the `run` column of `samplesheet.csv`. |
+| `annotation.tsv.gz` | 72 KB | Per-ORF annotation (6642 rows). Columns: `orf_id, gene_id, chrom, start, end, strand, orf_type`. `orf_type` is one of `mORF`, `uORF`, `dORF`. |
+| `samplesheet.csv` | 423 B | 12-sample condition table covering 3 conditions (Mitotic_Cycling, Mitotic_Arrest, Interphase) x 2 strategies (ribo, rna). Columns: `run, strategy, replicate, condition`. |
+| `metadata.txt.gz` | 211 B | DOTSeq's headerless 24-sample metadata covering both chx + har treatment arms, columns: `run strategy replicate treatment condition`. Kept verbatim from the upstream package for traceability; not consumed by the nf-test directly. |
+
+## How they were derived
+
+The fixtures are derived from `system.file("extdata", package = "DOTSeq")` in the Bioconductor 3.23 release of DOTSeq (v1.0.0), itself sourced from the `compgenom/DOTSeq` GitHub repository at the same tag. Upstream cohort: GSE231096 cell-cycle Ribo-seq + RNA-seq (Ly et al. 2024), restricted to the chx (cycloheximide) treatment arm.
+
+- `counts.tsv.gz`: extracted from `featureCounts.cell_cycle_subset.txt.gz` by dropping the featureCounts header banner and the 5 annotation columns (`Chr, Start, End, Strand, Length`), renaming `Geneid` to `orf_id` (with a within-gene `:O<###>` suffix that matches DOTSeq's internal ORF naming), and stripping the BAM-path sample columns down to their SRR accessions.
+- `annotation.tsv.gz`: joined from `gencode.v47.orf_flattened_subset.gtf.gz` (genomic coordinates + `gene_id`) and `gencode.v47.orf_flattened_subset.bed.gz` (`orf_type` per ORF), keyed on `chrom + start + end + strand + gene_id`. ORF ids match `counts.tsv.gz`.
+- `samplesheet.csv`: built from `metadata.txt.gz` by adding column headers (`run, strategy, replicate, treatment, condition`) and filtering to `treatment == "chx"` (matching the 12 sample columns the count table actually contains). The redundant `treatment` column is then dropped.
+
+## Source & licence
+
+[compgenom/DOTSeq](https://github.com/compgenom/DOTSeq), MIT licence. Upstream sample accessions: `SRR24230462` to `SRR24230485` (GSE231096, Ly et al. 2024).
+
+Used by `modules/nf-core/dotseq/dotseq/tests/main.nf.test`.
diff --git a/data/genomics/homo_sapiens/riboseq_expression/dotseq/annotation.tsv.gz b/data/genomics/homo_sapiens/riboseq_expression/dotseq/annotation.tsv.gz
diff --git a/data/genomics/homo_sapiens/riboseq_expression/dotseq/annotation_non_morf.tsv.gz b/data/genomics/homo_sapiens/riboseq_expression/dotseq/annotation_non_morf.tsv.gz
diff --git a/data/genomics/homo_sapiens/riboseq_expression/dotseq/counts.tsv.gz b/data/genomics/homo_sapiens/riboseq_expression/dotseq/counts.tsv.gz
diff --git a/data/genomics/homo_sapiens/riboseq_expression/dotseq/metadata.txt.gz b/data/genomics/homo_sapiens/riboseq_expression/dotseq/metadata.txt.gz
diff --git a/data/genomics/homo_sapiens/riboseq_expression/dotseq/samplesheet.csv b/data/genomics/homo_sapiens/riboseq_expression/dotseq/samplesheet.csv
@@ -0,0 +1,13 @@
+run,strategy,replicate,condition
+SRR24230462,rna,2,Mitotic_Cycling
+SRR24230465,ribo,2,Mitotic_Cycling
+SRR24230466,rna,1,Mitotic_Cycling
+SRR24230467,ribo,1,Mitotic_Cycling
+SRR24230469,ribo,2,Mitotic_Arrest
+SRR24230471,ribo,1,Mitotic_Arrest
+SRR24230472,rna,1,Mitotic_Arrest
+SRR24230474,rna,2,Mitotic_Arrest
+SRR24230477,rna,2,Interphase
+SRR24230479,rna,1,Interphase
+SRR24230480,ribo,1,Interphase
+SRR24230482,ribo,2,Interphase