Skip to content

Commit d626bb6

Browse files
committed
seqair bcf edits
1 parent 8a5429e commit d626bb6

1 file changed

Lines changed: 22 additions & 21 deletions

File tree

content/posts/2026-05-08-seqair-bcf.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,11 @@
22
title: "Using Rust typestates for BCF writing"
33
publishDate: "2026-05-08"
44
updatedAt: "2026-05-08"
5-
draft: true
65
categories:
76
- rust
87
- bioinformatics
98
- type-system
109
---
11-
1210
If you've ever needed to produce a typed binary format
1311
where the header constrains what the body can contain,
1412
you've probably written validation code that runs at runtime
@@ -17,24 +15,23 @@ With Rust, you can turn this into compile errors (my favorite!)
1715
using typestates and phantom types.
1816
These patterns generalize well beyond any single format.
1917
I've previously [mentioned this][elegant-apis-in-rust] 10 years ago in a post,
20-
and I still think its such a cool pattern
21-
that it deserves another (proper) post.
18+
and I still think it's such a cool pattern
19+
that it deserves another post.
2220

2321
[elegant-apis-in-rust]: https://deterministic.space/elegant-apis-in-rust.html#session-types
2422

25-
The format I'm working with is BCF,
26-
the binary encoding of VCF used in genomics.
27-
It's the main output of [Rastair], our variant caller.
28-
I [wrote previously about seqair][seqair-post],
29-
our reimplementation of core `htslib` functionality in Rust.
30-
Writing BCF records through [`rust-htslib`] was
31-
what originally motivated that work,
32-
the bindings worked but were (out of the box) inefficient[^forked],
23+
The format I'm working with is VCF (and its binary version BCF).
24+
It's the main output of [Rastair], the variant caller I'm working on.
25+
I [previously wrote about seqair][seqair-post],
26+
my reimplementation of core `htslib` functionality in Rust.
27+
Writing VCF/BCF records through [`rust-htslib`] was
28+
what originally motivated that work:
29+
The bindings worked but were (out of the box) inefficient[^forked],
3330
and I wanted to see if I could make it both correct and fast.
3431

3532
[^forked]: In Rastair, I used a fork that replaces some `CString` usage and it got much faster.
3633

37-
[Rastair]: https://www.rastair.com/ "Rastair website"
34+
[Rastair]: https://deterministic.space/rastair.html "Notes on Rastair, a variant and methylation caller"
3835
[`rust-htslib`]: https://github.com/rust-bio/rust-htslib "HTSlib bindings and a high level Rust API for reading and writing BAM files."
3936
[seqair-post]: https://deterministic.space/seqair.html "Seqair, a custom htslib reimplementation"
4037

@@ -92,7 +89,7 @@ let gt: FormatKey<Gt> = header.register_format(&GT_DEF)?;
9289
The type parameter (`Scalar<i32>`, `Arr<f32>`, `Gt`)
9390
flows from the field definition to the key,
9491
and it's what makes the rest of the system type-safe.
95-
This is the format's actual invariants expressed in the type system.
92+
These are the format's actual invariants expressed in the type system.
9693
But the key also carries the BCF dictionary index (a `u32`)
9794
and the VCF field name[^smolstr],
9895
resolved once and reused for every record.
@@ -193,9 +190,9 @@ without being generic over the encoder.
193190
Writing a record walks another state machine.
194191
As a diagram, it looks like this:
195192

196-
```mermaid
193+
```mermaid {.wide}
197194
flowchart LR
198-
Start --> Begun --> Filtered --> Samples --> Done
195+
Start --> Begun --> Filtered --> WithSamples --> Done
199196
```
200197

201198
The states are zero-sized marker structs,
@@ -383,14 +380,17 @@ Each width has its own sentinel value for "missing":
383380
Why -120 and not -128?
384381
The BCF 2.2 spec reserves the 8 most-negative values of each integer type
385382
for sentinels.
386-
Two are currently defined "missing" and "end of vector"
383+
Two are currently defined ("missing" and "end of vector")
387384
and six are reserved for future use.
388385
So for `INT8`, -128 through -121 are off-limits,
389386
leaving `[-120, 127]` as the usable range.
387+
(A great use-case for [niches]!)
388+
389+
[niches]: https://deterministic.space/niche-int-types-in-rust.html "Niches for integer types in Rust"
390390

391391
An early version of seqair had a bug
392392
where the type selection scanned all values including placeholders,
393-
and the missing-value sentinel was always `i32::MIN`.
393+
and the missing-value sentinel was always set as `i32::MIN`.
394394
An array like `[1, 2, MISSING]` would
395395
correctly pick INT8 as the encoding (`1` and `2` fit),
396396
but then always emit `i32::MIN` as the sentinel,
@@ -420,7 +420,7 @@ and the tests immediately tell you whether the generated code is right
420420
leading to a quick feedback loop.
421421

422422
[^layers]:
423-
We have: property tests, cross-validation against reference implementations, round-trip checks, and fuzz targets.
423+
We have: property tests, cross-validation against reference implementations, round-trip checks, and [fuzz targets][fuzz].
424424
And what this doesn't catch will hopefully be caught when we run this in Rastair
425425
against many different inputs.
426426

@@ -458,7 +458,7 @@ That was reassuring!
458458

459459
I also wrote some benchmarks to make sure that seqair is not way slower
460460
than our comparison points `htslib` and `noodles`.
461-
Since this was my initial complained about `rust-htslib`,
461+
Since this was my initial complaint about `rust-htslib`,
462462
I had to make sure to pick this up again.
463463
Like the experiment with a columnar storage layout for BAM records
464464
(see [previous post][seqair-post]),
@@ -495,7 +495,8 @@ because that's the entire point of implementing seqair.
495495
Then I added some more features and asked to extend the benchmark again.
496496
But what I totally missed: The `htslib` benchmark wrote to a temp file,
497497
while seqair and `noodles` wrote to a buffer!
498-
Only when I was writing this post did I review the code again and found this bug.
498+
Only when writing this post did I review the code again and found this bug.
499+
The numbers here reflect the fixed version.
499500

500501
## What I'd do differently
501502

0 commit comments

Comments
 (0)