Add ONT BAM and bedMethyl test data: HG002 GIAB 10-read subset (PAW70337)#1969
Add ONT BAM and bedMethyl test data: HG002 GIAB 10-read subset (PAW70337)#1969sahuno wants to merge 1 commit intonf-core:modulesfrom
Conversation
…337) Adds 5 files derived from the GIAB HG002 ONT run PAW70337 (5kHz, R10.4.1), companion to the pod5 file added in nf-core#1968. New files: - nanopore/bam/HG002_PAW70337_giab_10reads.unaligned.bam (178 KB) Raw dorado basecaller output, no reference alignment. Used by: dorado/summary, dorado/trim, dorado/correct - nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam (314 KB) - nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam.bai (566 KB) Coordinate-sorted alignment to hg38 via minimap2. Used by: modkit/pileup, modkit/dmr, dorado/basecaller functional test - nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz (69 KB) - nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz.tbi (1.7 KB) modkit pileup output (5mCG+5hmCG), bgzipped + tabix indexed. Used by: modkit/localize, modkit/localize/plot, modkit/pileup/plot Source: s3://ont-open-data/giab_2025.01/flowcells/HG002/PAW70337/pod5/ Sample: HG002 (NIST/Genome in a Bottle), public domain data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hey @dialvarezs 👋 , when you have a moment, could you pls review data inclusion pr? This PR adds the unaligned/aligned HG002 BAM + bedMethyl test data (same GIAB 10-read subset as #1968, which you approved). It's the test-data dependency for two downstream module PRs:
Thanks! |
dialvarezs
left a comment
There was a problem hiding this comment.
Hi @sahuno, sorry for the delay on this review.
Looking at the individual files, I think the unaligned BAM and aligned BAM + BAI make sense as reusable fixtures, especially if we want to avoid chaining through GPU-heavy dorado/basecaller steps.
The one I’m less sure about is the pre-generated bedmethyl.gz + .tbi, since that is already a downstream modkit pileup output. Would it be feasible to generate this via setup() from the aligned BAM for the other modkit tests instead? That way, we could preserve some chaining between dependent modules and help catch breaking changes more effectively over time.
|
Also, what do you mean by "dorado/basecaller functional test`? |
Summary
Adds 5 test data files derived from the GIAB HG002 ONT run PAW70337 (5kHz, R10.4.1), companion to the pod5 file merged in #1968.
New files
Unaligned BAM — raw dorado basecaller output, no reference:
data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.unaligned.bam(178 KB)Aligned sorted BAM + index — coordinate-sorted alignment to hg38 via minimap2:
data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam(314 KB)data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam.bai(566 KB)bedMethyl + tabix index — modkit pileup output (5mCG+5hmCG), bgzipped:
data/genomics/homo_sapiens/nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz(69 KB)data/genomics/homo_sapiens/nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz.tbi(1.7 KB)Modules that will use these files
unaligned.bamdorado/summary,dorado/trim,dorado/correctaligned.sorted.bam+.baimodkit/pileup,dorado/basecallerfunctional testbedmethyl.gz+.tbimodkit/localize,modkit/localize/plot,modkit/pileup/plotSource
Test plan
samtools viewsamtools idxstats🤖 Generated with Claude Code