New test data for sourmash#2036
Conversation
pinin4fjords
left a comment
There was a problem hiding this comment.
Looks great Daniel. File sizes are down ~99% (~71 MB → ~700 KB), naming matches the branch convention, and the PR body fully documents the genome selection, simulation method, and sample-overlap design.
One small nit (non-blocking): the existing index file is sourmash_24genomes.k31.index.sbt.zip and encodes the k-mer size in the filename, while the new one is sourmash_10genomes.index.sbt.zip. Worth keeping .k31. in there for consistency, but happy to leave it to your call.
Could you also close #2035 once this lands so the trail is clear?
Thanks for noticing that!
I thought it was closed. Will do. |
This adds test files for improved sourmash tests for magmap.
Genomes
10 species-representative genomes were selected from release R10-RS226 of the GTDB each representing a phylum:
GCA_025449695.1_ASM2544969v1_genomic.fna.gz
GCA_036004205.1_ASM3600420v1_genomic.fna.gz
GCA_036380955.1_ASM3638095v1_genomic.fna.gz
GCA_934196075.1_ERR7738170_bin.192_genomic.fna.gz
GCF_010645065.1_ASM1064506v1_genomic.fna.gz
GCA_028291545.1_ASM2829154v1_genomic.fna.gz
GCA_036270055.1_ASM3627005v1_genomic.fna.gz
GCA_041659285.1_ASM4165928v1_genomic.fna.gz
GCA_944323475.1_BRZ_DT_bin83_genomic.fna.gz
GCF_964229035.1_CJ01B_3.20_genomic.fna.gz
Simulated reads
Fastq files were created from the contig files using
randomreads.shfrom BBMap with 100,000 reads each.Three sample files were created from the random read files, sampling only 100 read pairs from each.
To make sure the three sample files map to slightly but not completely overlapping sets of genomes, they were created from the following genome read files:
sourmash_10genomes_sample00_R{1,2}.fastq.gz:GCA_025449695.1
GCA_028291545.1
GCA_036004205.1
GCA_036270055.1
GCA_036380955.1
GCA_041659285.1
GCA_934196075.1
sourmash_10genomes_sample01_R{1,2}.fastq.gz:GCA_025449695.1
GCA_028291545.1
GCA_036004205.1
GCA_036270055.1
GCA_944323475.1
GCF_010645065.1
GCF_964229035.1
sourmash_10genomes_sample02_R{1,2}.fastq.gz:GCA_025449695.1
GCA_028291545.1
GCA_036380955.1
GCA_041659285.1
GCF_010645065.1
The three read files are referred to in
samplesheets/sourmash_10genomes_samples.csv.Sourmash index
A Sourmash index file (
sourmash_10genomes.index.sbt.zip) was also made from the 10 genomes.