Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,12 @@ The earth sciences folder contain subfolders for different data formats encounte
- SRX11780888_chr20.bam.bai index for filtered and trimmed reads from SRX11780888, aligned to human Chr20
- SRX11780887.Aligned.toTranscriptome.out.bam filtered and trimmed reads from SRX11780887, aligned to human Chr20, transcriptomic coordinates
- SRX11780888.Aligned.toTranscriptome.out.bam filtered and trimmed reads from SRX11780888, aligned to human Chr20, transcriptomic coordinates
- dotseq
- counts.tsv.gz: per-ORF count matrix (6642 ORFs x 12 samples; 6 Ribo + 6 RNA) for the chx subset of the GSE231096 cell-cycle cohort. First column is `orf_id`, remaining columns are SRR accessions. Derived from the DOTSeq package's bundled `featureCounts.cell_cycle_subset.txt.gz` by stripping the featureCounts annotation columns and BAM-path sample names.
- annotation.tsv.gz: per-ORF annotation (6642 rows). Columns: `orf_id, gene_id, chrom, start, end, strand, orf_type`. ORF ids match `counts.tsv.gz`; `orf_type` is `mORF`, `uORF`, or `dORF`. Joined from the DOTSeq package's bundled flattened GENCODE v47 GTF + BED.
- metadata.txt.gz: DOTSeq's headerless 24-sample metadata covering both chx and har treatment arms, columns `run strategy replicate treatment condition`. Kept for upstream traceability; not consumed by the nf-test directly.
- samplesheet.csv: headered, chx-only subset of `metadata.txt.gz` (12 rows: 6 Ribo + 6 RNA) - what the nf-test consumes directly.
- README.md: full derivation recipe + verification.
- plastid
- Homo_sapiens.GRCh38.111_chr20_rois.txt: metagene generated from Homo_sapiens.GRCh38.111_chr20.gtf using plastid `metagene generate` command
- SRX11780887_p_offsets.txt: p-site offsets genereated from SRX11780887_chr20.bam and Homo_sapiens.GRCh38.111_chr20.gtf using plastid `psite` command
Expand Down
30 changes: 30 additions & 0 deletions data/genomics/homo_sapiens/riboseq_expression/dotseq/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Test data for `dotseq/dotseq`

A small cohort of ORF-level Ribo-seq + RNA-seq counts with matched ORF annotation, derived from the `cell_cycle_subset` example data shipped in the [DOTSeq Bioconductor package](https://bioconductor.org/packages/release/bioc/html/DOTSeq.html) (`inst/extdata`).

## Why this set?

The `dotseq/dotseq` module wraps `DOTSeqDataSetsFromSummarizeOverlaps()`, which takes a per-ORF count matrix plus a `GRanges` ORF annotation. These fixtures provide both in tidy TSV form (one ORF per row), aligned on a stable `orf_id` so the module's R template can rebuild the `GRanges` and run end-to-end against a known-good cohort the package author already validated against the API.

## Files

| File | Size | Description |
|---|---|---|
| `counts.tsv.gz` | 73 KB | Per-ORF count matrix (6642 ORFs x 12 samples). First column is `orf_id`, remaining columns are sample IDs (6 Ribo-seq + 6 RNA-seq). Sample IDs match the `run` column of `samplesheet.csv`. |
| `annotation.tsv.gz` | 72 KB | Per-ORF annotation (6642 rows). Columns: `orf_id, gene_id, chrom, start, end, strand, orf_type`. `orf_type` is one of `mORF`, `uORF`, `dORF`. |
| `samplesheet.csv` | 423 B | 12-sample condition table covering 3 conditions (Mitotic_Cycling, Mitotic_Arrest, Interphase) x 2 strategies (ribo, rna). Columns: `run, strategy, replicate, condition`. |
| `metadata.txt.gz` | 211 B | DOTSeq's headerless 24-sample metadata covering both chx + har treatment arms, columns: `run strategy replicate treatment condition`. Kept verbatim from the upstream package for traceability; not consumed by the nf-test directly. |

## How they were derived

The fixtures are derived from `system.file("extdata", package = "DOTSeq")` in the Bioconductor 3.23 release of DOTSeq (v1.0.0), itself sourced from the `compgenom/DOTSeq` GitHub repository at the same tag. Upstream cohort: GSE231096 cell-cycle Ribo-seq + RNA-seq (Ly et al. 2024), restricted to the chx (cycloheximide) treatment arm.

- `counts.tsv.gz`: extracted from `featureCounts.cell_cycle_subset.txt.gz` by dropping the featureCounts header banner and the 5 annotation columns (`Chr, Start, End, Strand, Length`), renaming `Geneid` to `orf_id` (with a within-gene `:O<###>` suffix that matches DOTSeq's internal ORF naming), and stripping the BAM-path sample columns down to their SRR accessions.
- `annotation.tsv.gz`: joined from `gencode.v47.orf_flattened_subset.gtf.gz` (genomic coordinates + `gene_id`) and `gencode.v47.orf_flattened_subset.bed.gz` (`orf_type` per ORF), keyed on `chrom + start + end + strand + gene_id`. ORF ids match `counts.tsv.gz`.
- `samplesheet.csv`: built from `metadata.txt.gz` by adding column headers (`run, strategy, replicate, treatment, condition`) and filtering to `treatment == "chx"` (matching the 12 sample columns the count table actually contains). The redundant `treatment` column is then dropped.

## Source & licence

[compgenom/DOTSeq](https://github.com/compgenom/DOTSeq), MIT licence. Upstream sample accessions: `SRR24230462` to `SRR24230485` (GSE231096, Ly et al. 2024).

Used by `modules/nf-core/dotseq/dotseq/tests/main.nf.test`.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
run,strategy,replicate,condition
SRR24230462,rna,2,Mitotic_Cycling
SRR24230465,ribo,2,Mitotic_Cycling
SRR24230466,rna,1,Mitotic_Cycling
SRR24230467,ribo,1,Mitotic_Cycling
SRR24230469,ribo,2,Mitotic_Arrest
SRR24230471,ribo,1,Mitotic_Arrest
SRR24230472,rna,1,Mitotic_Arrest
SRR24230474,rna,2,Mitotic_Arrest
SRR24230477,rna,2,Interphase
SRR24230479,rna,1,Interphase
SRR24230480,ribo,1,Interphase
SRR24230482,ribo,2,Interphase