Marker-Based Genotyping from FASTQ Reads

Overview

The split-fastq command performs alignment-free genotyping directly from raw FASTQ reads. Given a set of known SNP markers and a reference genome, it builds diagnostic k-mers for each marker and scans reads for evidence of reference vs. alternate alleles.

Architecture

Marker TSV + Reference FASTA ──→ Build Diagnostic K-mers ──→ Build Index + Bloom Filter
                                                                       │
FASTQ Reads ──→ Stream in batches of 8192 ──→ 2-bit encode k-mers ──→ Bloom check ──→ Index lookup
                                                                                           │
                                                                                    Aggregate counts
                                                                                    [ref_count, alt_count]
                                                                                           │
                                                                                    Lineage Classification

Step 1: Diagnostic K-mer Construction

For each SNP marker at position P with REF allele and ALT allele:

Extract flanking sequence from the reference genome
Build two k-mers of length 31 (default):
- REF k-mer: left_flank + REF + right_flank
- ALT k-mer: left_flank + ALT + right_flank

Reference:  ...ACTGACTG[C]TAGCTAGC...
                        ↑ SNP position

REF k-mer:  ACTGACTG[C]TAGCTAGCGATCGATCGATCG   (k=31)
ALT k-mer:  ACTGACTG[T]TAGCTAGCGATCGATCGATCG   (k=31)

The flanking region is centered on the variant, with separate right-flank lengths for ref and alt k-mers when allele lengths differ (MNVs).

Why No Indels?

Indels (insertions/deletions) are intentionally skipped in the FASTQ workflow:

Short reads spanning repetitive regions (e.g., PE/PPE genes in M. tuberculosis) produce k-mers that match indel signatures at incorrect loci
This generates false positive variant calls
Indels are reliably detected only via the FASTA assembly classify workflow, which uses full-length alignment context

Step 2: Hash Index + Bloom Filter

Hash Index

All diagnostic k-mers are hashed (2-bit encoding, forward + reverse complement) into an FxHashMap<u64, MarkerHitList>:

enum MarkerHitList {
    Single((marker_id, is_alt)),        // Most common: one marker per k-mer
    Multi(Vec<(marker_id, is_alt)>),    // Rare: k-mer collision between markers
}

Each marker contributes 4 entries: REF forward, REF revcomp, ALT forward, ALT revcomp.

Bloom Filter

A 128 KB Bloom filter (2^20 = 1,048,576 bits) acts as a fast pre-check:

2 hash functions using multiply-shift: key × c₁ mod BITS and key × c₂ mod BITS
False positive rate: ~0.004% with ~32K entries and 2 hash functions
Fits in L2 cache (128 KB vs 512 KB typical L2/core)

The Bloom filter rejects ~99.9% of non-marker k-mers in ~2 nanoseconds, before they reach the hash table lookup.

Step 3: FASTQ Scanning

Batched Parallel Processing

FASTQ file ──→ needletail parser ──→ ReadBatch (contiguous buffer)
                                          │
                                    ┌─────┴─────┐
                                    │  rayon     │
                                    │  chunks    │
                                    └─────┬─────┘
                                          │
                                    Thread-local counts
                                          │
                                    ┌─────┴─────┐
                                    │  reduce    │
                                    └─────┬─────┘
                                          │
                                    Global counts

ReadBatch: A contiguous byte buffer that stores reads back-to-back, avoiding per-read Vec<u8> allocations. Memory is reused across batches via clear().
Batch size: 8,192 reads per batch. Chosen to amortize rayon scheduling overhead while keeping memory bounded.
Chunk size: Each rayon chunk processes max(64, batch_size / (2 × num_threads)) reads with thread-local count arrays.
Per-read scanning:
- Extract all k-mers (k=31) via bit_kmers() — no canonical normalization because the index already contains both forward and reverse complement entries
- Bloom filter check: bloom_may_contain(bloom, kmer.0) — rejects ~99.9% of k-mers
- Hash table lookup: if in Bloom, check exact match in FxHashMap
- Increment counts[marker_id][is_alt]
Reduction: Thread-local counts are summed into global counts via rayon's fold + reduce.

Cancellation

Checked every 8,192 records (atomic flag, near-zero overhead)
AtomicBool shared across threads for fast stop
Parse errors captured via Arc<Mutex<Option<String>>> with poison recovery

Step 4: Lineage Classification

The classify_split_fastq module takes the raw [ref_count, alt_count] per marker and determines lineages:

For each marker, determine the called allele (REF or ALT) based on count ratios
Walk the hierarchical lineage columns (e.g., L4 → L4.9 → L4.9.1)
Report the deepest resolved lineage

Marker TSV Format

#pos    ref    alt    level1    level2    level3    <empty>    annotation
761155  C      T      L4        L4.9      L4.9.1              gene_name

Lineage columns are read until the first empty cell. Columns after the empty cell are treated as annotations (gene name, drug resistance, etc.).

Performance

Why pathotypr is ~2× faster than fastlin

Factor	pathotypr	fastlin
K-mer rejection	Bloom filter (2 ns)	Direct hash lookup
Read parallelism	rayon batch 8192, chunks per thread	Per-file parallelism
Gzip decompression	zlib-ng (SIMD)	miniz_oxide (pure Rust)
Memory allocations	Contiguous ReadBatch buffer	Per-read Vec

Benchmarks (real MTB data)

Sample	pathotypr	fastlin	Speedup
ERR551304 (L2, 158 MB PE)	1.49 s	3.23 s	2.2×
ERR552797 (A4, 264 MB PE)	2.30 s	4.78 s	2.1×

Memory Trade-off

pathotypr uses ~28 MB vs fastlin's ~3 MB. The difference comes from:

Full reference genome in memory (4.4 MB for MTB)
Larger binary (needletail, rayon, indicatif dependencies)
mimalloc thread arenas (~8 MB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marker-Based Genotyping from FASTQ Reads

Overview

Architecture

Step 1: Diagnostic K-mer Construction

Why No Indels?

Step 2: Hash Index + Bloom Filter

Hash Index

Bloom Filter

Step 3: FASTQ Scanning

Batched Parallel Processing

Cancellation

Step 4: Lineage Classification

Marker TSV Format

Performance

Why pathotypr is ~2× faster than fastlin

Benchmarks (real MTB data)

Memory Trade-off

FilesExpand file tree

marker-genotyping.md

Latest commit

History

marker-genotyping.md

File metadata and controls

Marker-Based Genotyping from FASTQ Reads

Overview

Architecture

Step 1: Diagnostic K-mer Construction

Why No Indels?

Step 2: Hash Index + Bloom Filter

Hash Index

Bloom Filter

Step 3: FASTQ Scanning

Batched Parallel Processing

Cancellation

Step 4: Lineage Classification

Marker TSV Format

Performance

Why pathotypr is ~2× faster than fastlin

Benchmarks (real MTB data)

Memory Trade-off