22title : " Using Rust typestates for BCF writing"
33publishDate : " 2026-05-08"
44updatedAt : " 2026-05-08"
5- draft : true
65categories :
76 - rust
87 - bioinformatics
98 - type-system
109---
11-
1210If you've ever needed to produce a typed binary format
1311where the header constrains what the body can contain,
1412you've probably written validation code that runs at runtime
@@ -17,24 +15,23 @@ With Rust, you can turn this into compile errors (my favorite!)
1715using typestates and phantom types.
1816These patterns generalize well beyond any single format.
1917I've previously [ mentioned this] [ elegant-apis-in-rust ] 10 years ago in a post,
20- and I still think its such a cool pattern
21- that it deserves another (proper) post.
18+ and I still think it's such a cool pattern
19+ that it deserves another post.
2220
2321[ elegant-apis-in-rust ] : https://deterministic.space/elegant-apis-in-rust.html#session-types
2422
25- The format I'm working with is BCF,
26- the binary encoding of VCF used in genomics.
27- It's the main output of [ Rastair] , our variant caller.
28- I [ wrote previously about seqair] [ seqair-post ] ,
29- our reimplementation of core ` htslib ` functionality in Rust.
30- Writing BCF records through [ ` rust-htslib ` ] was
31- what originally motivated that work,
32- the bindings worked but were (out of the box) inefficient[ ^ forked ] ,
23+ The format I'm working with is VCF (and its binary version BCF).
24+ It's the main output of [ Rastair] , the variant caller I'm working on.
25+ I [ previously wrote about seqair] [ seqair-post ] ,
26+ my reimplementation of core ` htslib ` functionality in Rust.
27+ Writing VCF/BCF records through [ ` rust-htslib ` ] was
28+ what originally motivated that work:
29+ The bindings worked but were (out of the box) inefficient[ ^ forked ] ,
3330and I wanted to see if I could make it both correct and fast.
3431
3532[ ^ forked ] : In Rastair, I used a fork that replaces some ` CString ` usage and it got much faster.
3633
37- [ Rastair ] : https://www. rastair.com/ " Rastair website "
34+ [ Rastair ] : https://deterministic.space/ rastair.html " Notes on Rastair, a variant and methylation caller "
3835[ `rust-htslib` ] : https://github.com/rust-bio/rust-htslib " HTSlib bindings and a high level Rust API for reading and writing BAM files. "
3936[ seqair-post ] : https://deterministic.space/seqair.html " Seqair, a custom htslib reimplementation "
4037
@@ -92,7 +89,7 @@ let gt: FormatKey<Gt> = header.register_format(>_DEF)?;
9289The type parameter (` Scalar<i32> ` , ` Arr<f32> ` , ` Gt ` )
9390flows from the field definition to the key,
9491and it's what makes the rest of the system type-safe.
95- This is the format's actual invariants expressed in the type system.
92+ These are the format's actual invariants expressed in the type system.
9693But the key also carries the BCF dictionary index (a ` u32 ` )
9794and the VCF field name[ ^ smolstr ] ,
9895resolved once and reused for every record.
@@ -193,9 +190,9 @@ without being generic over the encoder.
193190Writing a record walks another state machine.
194191As a diagram, it looks like this:
195192
196- ``` mermaid
193+ ``` mermaid {.wide}
197194flowchart LR
198- Start --> Begun --> Filtered --> Samples --> Done
195+ Start --> Begun --> Filtered --> WithSamples --> Done
199196```
200197
201198The states are zero-sized marker structs,
@@ -383,14 +380,17 @@ Each width has its own sentinel value for "missing":
383380 Why -120 and not -128?
384381 The BCF 2.2 spec reserves the 8 most-negative values of each integer type
385382 for sentinels.
386- Two are currently defined — "missing" and "end of vector" —
383+ Two are currently defined ( "missing" and "end of vector")
387384 and six are reserved for future use.
388385 So for ` INT8 ` , -128 through -121 are off-limits,
389386 leaving ` [-120, 127] ` as the usable range.
387+ (A great use-case for [ niches] !)
388+
389+ [ niches ] : https://deterministic.space/niche-int-types-in-rust.html " Niches for integer types in Rust "
390390
391391An early version of seqair had a bug
392392where the type selection scanned all values including placeholders,
393- and the missing-value sentinel was always ` i32::MIN ` .
393+ and the missing-value sentinel was always set as ` i32::MIN ` .
394394An array like ` [1, 2, MISSING] ` would
395395correctly pick INT8 as the encoding (` 1 ` and ` 2 ` fit),
396396but then always emit ` i32::MIN ` as the sentinel,
@@ -420,7 +420,7 @@ and the tests immediately tell you whether the generated code is right
420420leading to a quick feedback loop.
421421
422422[ ^ layers ] :
423- We have: property tests, cross-validation against reference implementations, round-trip checks, and fuzz targets.
423+ We have: property tests, cross-validation against reference implementations, round-trip checks, and [ fuzz targets] [ fuzz ] .
424424 And what this doesn't catch will hopefully be caught when we run this in Rastair
425425 against many different inputs.
426426
@@ -458,7 +458,7 @@ That was reassuring!
458458
459459I also wrote some benchmarks to make sure that seqair is not way slower
460460than our comparison points ` htslib ` and ` noodles ` .
461- Since this was my initial complained about ` rust-htslib ` ,
461+ Since this was my initial complaint about ` rust-htslib ` ,
462462I had to make sure to pick this up again.
463463Like the experiment with a columnar storage layout for BAM records
464464(see [ previous post] [ seqair-post ] ),
@@ -495,7 +495,8 @@ because that's the entire point of implementing seqair.
495495 Then I added some more features and asked to extend the benchmark again.
496496 But what I totally missed: The ` htslib ` benchmark wrote to a temp file,
497497 while seqair and ` noodles ` wrote to a buffer!
498- Only when I was writing this post did I review the code again and found this bug.
498+ Only when writing this post did I review the code again and found this bug.
499+ The numbers here reflect the fixed version.
499500
500501## What I'd do differently
501502
0 commit comments