Add DOTSeq cell_cycle subset test data for modules by pinin4fjords · Pull Request #2072 · nf-core/test-datasets

pinin4fjords · 2026-05-21T22:16:14Z

Summary

Test fixtures for nf-core/modules#11742 (dotseq/dotseq) under data/genomics/homo_sapiens/riboseq_expression/dotseq/.

DOTSeq is a Bioconductor package for differential ORF usage (DOU) + ORF-level differential translation efficiency (DTE). Its API takes a featureCounts-format ORF count table plus a flattened ORF GTF/BED pair, so the existing riboseq_expression/ salmon-gene-level fixtures don't satisfy the input contract. Rather than build new featureCounts + ORF annotations from the existing BAMs (which would require an ORF caller + featureCounts pass), this PR drops in the small cell_cycle_subset bundled with the DOTSeq package - DOTSeq's own example data, already in the exact shape its functions expect.

Files

File	Size	Notes
`featureCounts.cell_cycle_subset.txt.gz`	117 KB	ORF-level featureCounts output, 6644 ORF rows × 12 sample columns (chx-treatment subset of Ly et al. 2024)
`gencode.v47.orf_flattened_subset.gtf.gz`	81 KB	Flattened GENCODE v47 ORF annotation matching the count table, 6945 lines
`gencode.v47.orf_flattened_subset.bed.gz`	53 KB	Matching BED (6642 lines)
`metadata.txt.gz`	211 B	Headerless 24-sample metadata: `run strategy replicate treatment condition`
`samplesheet.csv`	423 B	Headered, chx-only subset of `metadata.txt.gz` (12 rows) - derived in this PR so the module can consume it without an extra header step

Total: 252 KB across 5 files.

Why these specific files

DOTSeq's DOTSeqDataSetsFromFeatureCounts() requires four inputs:

count_table with featureCounts annotation columns (Geneid, Chr, Start, End, Strand, Length) followed by per-sample counts.
flattened_gtf with gene_id + exon_number attributes (DOTSeq uses these to name ORFs as gene_id:O###).
flattened_bed matching the GTF.
condition_table with run, strategy, replicate, condition columns.

The first four files are reused verbatim from DOTSeq's inst/extdata/. samplesheet.csv is the only derived file: metadata.txt.gz ships as a 5-column whitespace table with no header, so the module test would need to add a header before consuming it. To keep the nf-test minimal we ship the headered, chx-filtered version (12 rows matching the 12 sample columns in the featureCounts file) directly.

Source & licence

Bundled inst/extdata of compgenom/DOTSeq (Bioconductor 3.23 release). Upstream data is from Ly et al. 2024 (GSE231096 / SRR242304XX cell-cycle Ribo-seq + RNA-seq cohort), captured as the DOTSeq author's example dataset. Package licence: MIT.

Test plan

modules/nf-core/dotseq/dotseq/tests/main.nf.test in nf-core/modules#11742 consumes these fixtures via raw.githubusercontent.com URLs pinned to this PR's branch; URLs to be updated to modules after this PR merges
nf-core modules test --profile docker dotseq/dotseq runs DOTSeq's DOU + DTE + per-strategy DESeq2 fits + plots end-to-end in ~4 min on a c5.9xlarge against this fixture set

Bundled inst/extdata files from the DOTSeq Bioconductor package, copied into homo_sapiens/riboseq_expression/dotseq for use by the nf-core/modules dotseq/dotseq nf-test: - featureCounts.cell_cycle_subset.txt.gz: ORF-level featureCounts output - gencode.v47.orf_flattened_subset.gtf.gz: flattened ORF annotation - gencode.v47.orf_flattened_subset.bed.gz: matching BED - metadata.txt.gz: sample condition table (run/strategy/replicate/treatment/condition) Source: https://github.com/compgenom/DOTSeq, MIT license.

Derived from metadata.txt.gz, filtered to treatment=chx samples (which is what the bundled featureCounts table contains) and headered as run,strategy,replicate,condition for direct consumption by the dotseq/dotseq nf-test.

pinin4fjords · 2026-05-22T07:40:39Z

Companion to nf-core/modules#11742 - the dotseq module nf-test 404s on this samplesheet URL until this PR merges, otherwise green.

Document the file set, derivation, and verification of the DOTSeq cell_cycle_subset fixtures, matching the gedi/price test data PR convention.

Lets CI verify the module is actually green; revert this commit once nf-core/test-datasets#2072 merges and the canonical modules-branch URL resolves.

Replaces the upstream featureCounts table and flattened GTF/BED with a tidy per-ORF count matrix (counts.tsv.gz) and per-ORF annotation TSV (annotation.tsv.gz) so the dotseq module can call DOTSeqDataSetsFromSummarizeOverlaps() directly. ORF identifiers and sample columns are preserved; both files are derived from the same DOTSeq cell_cycle_subset cohort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns the module's input contract with deltate / anota2seq so that consumers can dispatch between the three ORF-DTE methods without maintaining a separate prep step for dotseq. The four featureCounts/GTF/BED inputs collapse to a per-ORF count matrix (orf_id + sample columns) plus a per-ORF annotation TSV (orf_id + gene_id + optional orf_type/coords). The R template now calls DOTSeqDataSetsFromSummarizeOverlaps() and builds the required GRanges in-process from the annotation TSV; the model fit, contrast tables, and plotDOT outputs are unchanged. Test fixtures updated alongside in nf-core/test-datasets#2072 (commit 8c9b27c). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add dotseq/dotseq module DOTSeq is a Bioconductor package for detecting differential ORF usage (DOU) and ORF-level differential translation efficiency (DTE) from Ribo-seq with matched RNA-seq. Module wraps DOTSeqDataSetsFromFeatureCounts + DOTSeq() + getContrasts() and emits per-ORF TSVs for the DOU and DTE interaction contrasts plus the serialised DOTSeqDataSets object. Pre-requisites (in flight): - Bioconda recipe: bioconda/bioconda-recipes#65677 - Test data: nf-core/test-datasets#2072 * Use Wave community container for bioconductor-dotseq 1.0.0 Bioconda recipe (bioconda/bioconda-recipes#65677) merged; biocontainer image is not yet built so swap the placeholder quay.io/depot URLs for a Wave community container built from the now-merged bioconda package. Also widen the singularity guard to include 'apptainer' and add the versions topic block in meta.yml (via nf-core modules lint --fix). * Add native plotDOT outputs, simplify template, tidyverse syntax - Restructure the R template around optparse + readr + dplyr + purrr + ggplot2; drop the homemade parse_args / read_delim_flexible helpers in favour of the standard package idioms and native pipe. - Output set is now what DOTSeq itself emits natively: per-ORF DTE contrasts (translation.dotseq.results.tsv), DOU contrasts (dou.dotseq.results.tsv), optional dou_strategy / dte_strategy per-condition Ribo-vs-RNA contrasts, plus the four plotDOT() PNGs (volcano / composite / venn / heatmap) and a DTE p-value distribution histogram drawn directly from DOTSeq's padj column. - Container picks up r-eulerr + r-ggsignif (required for plotDOT venn) and explicit r-ggplot2 so the histogram has a stable ggplot version. - plotDOT() default of force_new_device=TRUE was killing our png() device on each call; pass FALSE so the PNGs land where Nextflow expects them. * Simplify R template helpers, add heatmap sorf_type fallback - Drop the homemade read_delim_flexible() and write_results_tsv() wrappers in favour of read_tsv() / read_csv() / write_tsv() directly. The earlier to_orf_tibble() conditional is also gone now that we know getContrasts() always returns a frame with orf_id as a column (per the DOTSeq source in posthoc.R + main.R). - plotDOT(heatmap) requires gene-paired mORF + sorf entries; try uORF first (the package default) and fall back to dORF when no significant gene has both. tryCatch in safe_plot_dot still makes either a no-op when neither succeeds. * Address code-review feedback: stub block, validation hardening, plot fallback robustness - Add stub: block to main.nf matching the proteus/readproteingroups precedent. - Read sample sheet with read_delim() picking comma/tab from the file extension so the meta.yml-advertised TSV variant actually works. - Refuse to clobber an existing canonical column (e.g. an existing 'condition' column when --contrast_variable=treatment is supplied). - Dedupe multi-lane sample sheets and validate that both Ribo and RNA strategies are present (DOTSeq's interaction design is unestimable otherwise). - Add an is_set() predicate that catches NULL / empty stringent + required options before the tri-state switch silently returns NULL. - safe_plot_dot now unlinks the partially-written PNG on plotDOT error and returns success so the heatmap fallback (uORF then dORF) keys off whether the first call actually drew, not file.exists() of a stale handle. - getContrasts(type='interaction') errors propagate (headline outputs); type='strategy' stays tryCatch'd because absence is legitimate. - Cache getDOU(d) / getDTE(d) once and share across contrasts + plotDOT. - Drop redundant file.exists() walk - Nextflow's path staging already guarantees the inputs exist. - Expand the test to assert volcano / composite / venn plot emission and add a -stub test. * TEMPORARY: point test at the pending test-datasets PR fork branch Lets CI verify the module is actually green; revert this commit once nf-core/test-datasets#2072 merges and the canonical modules-branch URL resolves. * refactor(dotseq/dotseq): take a count-matrix shape for consumer parity Aligns the module's input contract with deltate / anota2seq so that consumers can dispatch between the three ORF-DTE methods without maintaining a separate prep step for dotseq. The four featureCounts/GTF/BED inputs collapse to a per-ORF count matrix (orf_id + sample columns) plus a per-ORF annotation TSV (orf_id + gene_id + optional orf_type/coords). The R template now calls DOTSeqDataSetsFromSummarizeOverlaps() and builds the required GRanges in-process from the annotation TSV; the model fit, contrast tables, and plotDOT outputs are unchanged. Test fixtures updated alongside in nf-core/test-datasets#2072 (commit 8c9b27c). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dotseq/dotseq): synthesize `replicate` column when absent DOTSeq's parse_condition_table() requires a `replicate` column for stable ordering of samples within strategy+condition. Pipeline samplesheets often have a `pair` column (or none at all), so the R template now treats the column as optional: when present it is renamed to `replicate` as before; when absent the template assigns a per-(strategy, condition) row counter so the model fit is unaffected. This matches how anota2seq/deltate consume the same samplesheets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dotseq/dotseq): support running a single module Dropping a module from --modules left DOTSeq()'s skipped slot unfitted (a bare DESeqDataSet for DTE), and getContrasts() has no method for it, so a DOU-only run crashed when extracting the DTE interaction table. Gate interaction and strategy contrast extraction on the selected modules, and write each module's interaction table only when that module ran. Mark the translation and dou outputs optional to match, and add a DOU-only regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pinin4fjords added 2 commits May 21, 2026 23:15

Add headered samplesheet.csv (chx subset) for DOTSeq tests

1779c8c

Derived from metadata.txt.gz, filtered to treatment=chx samples (which is what the bundled featureCounts table contains) and headered as run,strategy,replicate,condition for direct consumption by the dotseq/dotseq nf-test.

pinin4fjords mentioned this pull request May 21, 2026

Add dotseq/dotseq nf-core/modules#11742

Merged

7 tasks

pinin4fjords added 2 commits May 22, 2026 09:51

Add README for DOTSeq cell_cycle test data

a59fbdb

Document the file set, derivation, and verification of the DOTSeq cell_cycle_subset fixtures, matching the gedi/price test data PR convention.

Document dotseq fixture set in top-level README

e25da10

pinin4fjords added the Ready to review label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add DOTSeq cell_cycle subset test data for modules#2072

Add DOTSeq cell_cycle subset test data for modules#2072
pinin4fjords wants to merge 5 commits into
nf-core:modulesfrom
pinin4fjords:add-dotseq-testdata

pinin4fjords commented May 21, 2026 •

edited

Loading

Uh oh!

pinin4fjords commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pinin4fjords commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Why these specific files

Source & licence

Test plan

Uh oh!

pinin4fjords commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pinin4fjords commented May 21, 2026 •

edited

Loading