The split-fastq command performs alignment-free genotyping directly from raw FASTQ reads. Given a set of known SNP markers and a reference genome, it builds diagnostic k-mers for each marker and scans reads for evidence of reference vs. alternate alleles.
Marker TSV + Reference FASTA ──→ Build Diagnostic K-mers ──→ Build Index + Bloom Filter
│
FASTQ Reads ──→ Stream in batches of 8192 ──→ 2-bit encode k-mers ──→ Bloom check ──→ Index lookup
│
Aggregate counts
[ref_count, alt_count]
│
Lineage Classification
For each SNP marker at position P with REF allele and ALT allele:
- Extract flanking sequence from the reference genome
- Build two k-mers of length 31 (default):
- REF k-mer: left_flank + REF + right_flank
- ALT k-mer: left_flank + ALT + right_flank
Reference: ...ACTGACTG[C]TAGCTAGC...
↑ SNP position
REF k-mer: ACTGACTG[C]TAGCTAGCGATCGATCGATCG (k=31)
ALT k-mer: ACTGACTG[T]TAGCTAGCGATCGATCGATCG (k=31)
The flanking region is centered on the variant, with separate right-flank lengths for ref and alt k-mers when allele lengths differ (MNVs).
Indels (insertions/deletions) are intentionally skipped in the FASTQ workflow:
- Short reads spanning repetitive regions (e.g., PE/PPE genes in M. tuberculosis) produce k-mers that match indel signatures at incorrect loci
- This generates false positive variant calls
- Indels are reliably detected only via the FASTA assembly
classifyworkflow, which uses full-length alignment context
All diagnostic k-mers are hashed (2-bit encoding, forward + reverse complement) into an FxHashMap<u64, MarkerHitList>:
enum MarkerHitList {
Single((marker_id, is_alt)), // Most common: one marker per k-mer
Multi(Vec<(marker_id, is_alt)>), // Rare: k-mer collision between markers
}Each marker contributes 4 entries: REF forward, REF revcomp, ALT forward, ALT revcomp.
A 128 KB Bloom filter (2^20 = 1,048,576 bits) acts as a fast pre-check:
- 2 hash functions using multiply-shift:
key × c₁ mod BITSandkey × c₂ mod BITS - False positive rate: ~0.004% with ~32K entries and 2 hash functions
- Fits in L2 cache (128 KB vs 512 KB typical L2/core)
The Bloom filter rejects ~99.9% of non-marker k-mers in ~2 nanoseconds, before they reach the hash table lookup.
FASTQ file ──→ needletail parser ──→ ReadBatch (contiguous buffer)
│
┌─────┴─────┐
│ rayon │
│ chunks │
└─────┬─────┘
│
Thread-local counts
│
┌─────┴─────┐
│ reduce │
└─────┬─────┘
│
Global counts
-
ReadBatch: A contiguous byte buffer that stores reads back-to-back, avoiding per-read
Vec<u8>allocations. Memory is reused across batches viaclear(). -
Batch size: 8,192 reads per batch. Chosen to amortize rayon scheduling overhead while keeping memory bounded.
-
Chunk size: Each rayon chunk processes
max(64, batch_size / (2 × num_threads))reads with thread-local count arrays. -
Per-read scanning:
- Extract all k-mers (k=31) via
bit_kmers()— no canonical normalization because the index already contains both forward and reverse complement entries - Bloom filter check:
bloom_may_contain(bloom, kmer.0)— rejects ~99.9% of k-mers - Hash table lookup: if in Bloom, check exact match in
FxHashMap - Increment
counts[marker_id][is_alt]
- Extract all k-mers (k=31) via
-
Reduction: Thread-local counts are summed into global counts via rayon's
fold + reduce.
- Checked every 8,192 records (atomic flag, near-zero overhead)
AtomicBoolshared across threads for fast stop- Parse errors captured via
Arc<Mutex<Option<String>>>with poison recovery
The classify_split_fastq module takes the raw [ref_count, alt_count] per marker and determines lineages:
- For each marker, determine the called allele (REF or ALT) based on count ratios
- Walk the hierarchical lineage columns (e.g., L4 → L4.9 → L4.9.1)
- Report the deepest resolved lineage
#pos ref alt level1 level2 level3 <empty> annotation
761155 C T L4 L4.9 L4.9.1 gene_name
Lineage columns are read until the first empty cell. Columns after the empty cell are treated as annotations (gene name, drug resistance, etc.).
| Factor | pathotypr | fastlin |
|---|---|---|
| K-mer rejection | Bloom filter (2 ns) | Direct hash lookup |
| Read parallelism | rayon batch 8192, chunks per thread | Per-file parallelism |
| Gzip decompression | zlib-ng (SIMD) | miniz_oxide (pure Rust) |
| Memory allocations | Contiguous ReadBatch buffer | Per-read Vec |
| Sample | pathotypr | fastlin | Speedup |
|---|---|---|---|
| ERR551304 (L2, 158 MB PE) | 1.49 s | 3.23 s | 2.2× |
| ERR552797 (A4, 264 MB PE) | 2.30 s | 4.78 s | 2.1× |
pathotypr uses ~28 MB vs fastlin's ~3 MB. The difference comes from:
- Full reference genome in memory (4.4 MB for MTB)
- Larger binary (needletail, rayon, indicatif dependencies)
- mimalloc thread arenas (~8 MB)