Skip to content

New test data for sourmash#2036

Merged
erikrikarddaniel merged 3 commits into
nf-core:magmapfrom
erikrikarddaniel:magmap
May 6, 2026
Merged

New test data for sourmash#2036
erikrikarddaniel merged 3 commits into
nf-core:magmapfrom
erikrikarddaniel:magmap

Conversation

@erikrikarddaniel
Copy link
Copy Markdown
Member

@erikrikarddaniel erikrikarddaniel commented May 5, 2026

This adds test files for improved sourmash tests for magmap.

Genomes

10 species-representative genomes were selected from release R10-RS226 of the GTDB each representing a phylum:

GCA_025449695.1_ASM2544969v1_genomic.fna.gz
GCA_036004205.1_ASM3600420v1_genomic.fna.gz
GCA_036380955.1_ASM3638095v1_genomic.fna.gz
GCA_934196075.1_ERR7738170_bin.192_genomic.fna.gz
GCF_010645065.1_ASM1064506v1_genomic.fna.gz
GCA_028291545.1_ASM2829154v1_genomic.fna.gz
GCA_036270055.1_ASM3627005v1_genomic.fna.gz
GCA_041659285.1_ASM4165928v1_genomic.fna.gz
GCA_944323475.1_BRZ_DT_bin83_genomic.fna.gz
GCF_964229035.1_CJ01B_3.20_genomic.fna.gz

Simulated reads

Fastq files were created from the contig files using randomreads.sh from BBMap with 100,000 reads each.

Three sample files were created from the random read files, sampling only 100 read pairs from each.
To make sure the three sample files map to slightly but not completely overlapping sets of genomes, they were created from the following genome read files:

sourmash_10genomes_sample00_R{1,2}.fastq.gz:
GCA_025449695.1
GCA_028291545.1
GCA_036004205.1
GCA_036270055.1
GCA_036380955.1
GCA_041659285.1
GCA_934196075.1

sourmash_10genomes_sample01_R{1,2}.fastq.gz:
GCA_025449695.1
GCA_028291545.1
GCA_036004205.1
GCA_036270055.1
GCA_944323475.1
GCF_010645065.1
GCF_964229035.1

sourmash_10genomes_sample02_R{1,2}.fastq.gz:
GCA_025449695.1
GCA_028291545.1
GCA_036380955.1
GCA_041659285.1
GCF_010645065.1

The three read files are referred to in samplesheets/sourmash_10genomes_samples.csv.

Sourmash index

A Sourmash index file (sourmash_10genomes.index.sbt.zip) was also made from the 10 genomes.

@erikrikarddaniel erikrikarddaniel changed the title Magmap New test data for sourmash May 6, 2026
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Daniel. File sizes are down ~99% (~71 MB → ~700 KB), naming matches the branch convention, and the PR body fully documents the genome selection, simulation method, and sample-overlap design.

One small nit (non-blocking): the existing index file is sourmash_24genomes.k31.index.sbt.zip and encodes the k-mer size in the filename, while the new one is sourmash_10genomes.index.sbt.zip. Worth keeping .k31. in there for consistency, but happy to leave it to your call.

Could you also close #2035 once this lands so the trail is clear?

@erikrikarddaniel
Copy link
Copy Markdown
Member Author

Looks great Daniel. File sizes are down ~99% (~71 MB → ~700 KB), naming matches the branch convention, and the PR body fully documents the genome selection, simulation method, and sample-overlap design.

One small nit (non-blocking): the existing index file is sourmash_24genomes.k31.index.sbt.zip and encodes the k-mer size in the filename, while the new one is sourmash_10genomes.index.sbt.zip. Worth keeping .k31. in there for consistency, but happy to leave it to your call.

Thanks for noticing that!

Could you also close #2035 once this lands so the trail is clear?

I thought it was closed. Will do.

@erikrikarddaniel erikrikarddaniel merged commit cb41faf into nf-core:magmap May 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants