Skip to content

Commit abc65e8

Browse files
feat: use multiple reference sequences for minimizer index generation
Background: Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates. Implementation: - Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths) - New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent - `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization - Backward compatible: existing datasets work unchanged Usage: In pathogen.json, add array of FASTA paths containing representative sequences for the dataset: ```json { "files": { "reference": "reference.fasta", "minimizerReferences": [ "clade_a.fasta", "clade_b.fasta" ] } } ``` Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry. Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 5214640 commit abc65e8

3 files changed

Lines changed: 45671 additions & 45609 deletions

File tree

0 commit comments

Comments
 (0)