Skip to content

Add ONT BAM and bedMethyl test data: HG002 GIAB 10-read subset (PAW70337)#1969

Open
sahuno wants to merge 1 commit intonf-core:modulesfrom
sahuno:add-giab-ont-bam-testdata
Open

Add ONT BAM and bedMethyl test data: HG002 GIAB 10-read subset (PAW70337)#1969
sahuno wants to merge 1 commit intonf-core:modulesfrom
sahuno:add-giab-ont-bam-testdata

Conversation

@sahuno
Copy link
Copy Markdown

@sahuno sahuno commented Apr 6, 2026

Summary

Adds 5 test data files derived from the GIAB HG002 ONT run PAW70337 (5kHz, R10.4.1), companion to the pod5 file merged in #1968.

New files

Unaligned BAM — raw dorado basecaller output, no reference:

  • data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.unaligned.bam (178 KB)

Aligned sorted BAM + index — coordinate-sorted alignment to hg38 via minimap2:

  • data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam (314 KB)
  • data/genomics/homo_sapiens/nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam.bai (566 KB)

bedMethyl + tabix index — modkit pileup output (5mCG+5hmCG), bgzipped:

  • data/genomics/homo_sapiens/nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz (69 KB)
  • data/genomics/homo_sapiens/nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz.tbi (1.7 KB)

Modules that will use these files

File Module(s)
unaligned.bam dorado/summary, dorado/trim, dorado/correct
aligned.sorted.bam + .bai modkit/pileup, dorado/basecaller functional test
bedmethyl.gz + .tbi modkit/localize, modkit/localize/plot, modkit/pileup/plot

Source

s3://ont-open-data/giab_2025.01/flowcells/HG002/PAW70337/pod5/PAW70337_66b2eea5_de8117b1_33.pod5
  • Sample: HG002 (NIST/Genome in a Bottle reference sample, public domain)
  • Chemistry: 5kHz, R10.4.1 (PAW70337, released 2025-01-14)
  • Subset: 10 reads — basecalled with dorado 1.4.0, aligned with minimap2, methylation called with modkit 0.6.1

Test plan

  • BAM opens with samtools view
  • BAI index validates with samtools idxstats
  • bedMethyl decompresses and tabix queries correctly
  • All files used in passing nf-test stub tests for the modules above

🤖 Generated with Claude Code

…337)

Adds 5 files derived from the GIAB HG002 ONT run PAW70337 (5kHz, R10.4.1),
companion to the pod5 file added in nf-core#1968.

New files:
- nanopore/bam/HG002_PAW70337_giab_10reads.unaligned.bam (178 KB)
  Raw dorado basecaller output, no reference alignment.
  Used by: dorado/summary, dorado/trim, dorado/correct

- nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam (314 KB)
- nanopore/bam/HG002_PAW70337_giab_10reads.aligned.sorted.bam.bai (566 KB)
  Coordinate-sorted alignment to hg38 via minimap2.
  Used by: modkit/pileup, modkit/dmr, dorado/basecaller functional test

- nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz (69 KB)
- nanopore/methylation/HG002_PAW70337_giab_10reads.aligned.sorted.bedmethyl.gz.tbi (1.7 KB)
  modkit pileup output (5mCG+5hmCG), bgzipped + tabix indexed.
  Used by: modkit/localize, modkit/localize/plot, modkit/pileup/plot

Source: s3://ont-open-data/giab_2025.01/flowcells/HG002/PAW70337/pod5/
Sample: HG002 (NIST/Genome in a Bottle), public domain data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sahuno
Copy link
Copy Markdown
Author

sahuno commented Apr 18, 2026

Hey @dialvarezs 👋 , when you have a moment, could you pls review data inclusion pr?

This PR adds the unaligned/aligned HG002 BAM + bedMethyl test data (same GIAB 10-read subset as #1968, which you approved). It's the test-data dependency for two downstream module PRs:

Thanks!

Copy link
Copy Markdown
Member

@dialvarezs dialvarezs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sahuno, sorry for the delay on this review.

Looking at the individual files, I think the unaligned BAM and aligned BAM + BAI make sense as reusable fixtures, especially if we want to avoid chaining through GPU-heavy dorado/basecaller steps.

The one I’m less sure about is the pre-generated bedmethyl.gz + .tbi, since that is already a downstream modkit pileup output. Would it be feasible to generate this via setup() from the aligned BAM for the other modkit tests instead? That way, we could preserve some chaining between dependent modules and help catch breaking changes more effectively over time.

@dialvarezs
Copy link
Copy Markdown
Member

Also, what do you mean by "dorado/basecaller functional test`?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants