Commit abc65e8
feat: use multiple reference sequences for minimizer index generation
Background:
Some pathogen datasets have significant genetic diversity that a single reference sequence cannot fully represent. This limits the accuracy of dataset auto-detection when query sequences are distant from the chosen reference. By allowing multiple reference sequences per dataset, the minimizer index can capture broader sequence diversity and improve detection rates.
Implementation:
- Add optional `files.minimizerReferences` field in pathogen.json (array of FASTA file paths)
- New `get_minimizer_refs()` function reads sequences from all listed files, falls back to main reference if field is absent
- `make_ref_search_index()` combines minimizers from all references using set union; uses average length for normalization
- Backward compatible: existing datasets work unchanged
Usage:
In pathogen.json, add array of FASTA paths containing representative sequences for the dataset:
```json
{
"files": {
"reference": "reference.fasta",
"minimizerReferences": [
"clade_a.fasta",
"clade_b.fasta"
]
}
}
```
Each FASTA file can contain one or more sequences. All sequences across all files contribute minimizers to the dataset's index entry.
Co-Authored-By: Claude <noreply@anthropic.com>1 parent 5214640 commit abc65e8
3 files changed
Lines changed: 45671 additions & 45609 deletions
0 commit comments