Skip to content

Commit 048a0d4

Browse files
committed
seqair bcf edits
1 parent d626bb6 commit 048a0d4

2 files changed

Lines changed: 16 additions & 15 deletions

File tree

content/posts/2026-04-30-seqair.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ we first need to write a header,
124124
which defines all the fields.
125125
In the binary version, we will refer to them by their ID.
126126
This header needs to be written in a specific order
127-
(configs, then filters, then info fields, then format fields, then samples).
127+
(contigs, then filters, then info fields, then format fields, then samples).
128128
In the same way,
129129
records (lines) need to be written in a specific order,
130130
so that we can stream them directly to an output buffer.

content/posts/2026-05-08-seqair-bcf.md

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ categories:
77
- bioinformatics
88
- type-system
99
---
10+
1011
If you've ever needed to produce a typed binary format
1112
where the header constrains what the body can contain,
1213
you've probably written validation code that runs at runtime
@@ -471,21 +472,22 @@ All these numbers are in "elements per second", so higher is better:
471472

472473
[`criterion`]: https://docs.rs/criterion/0.8.2/criterion/ "A statistics-driven micro-benchmarking library written in Rust."
473474

474-
| format | complexity | rows | htslib | noodles | seqair |
475-
| ------ | ---------- | ---: | -----: | ------: | -----: |
476-
| BCF | minimal | 1k | 1.25M | 559k | 2.83M |
477-
| BCF | minimal | 10k | 1.30M | 564k | 2.82M |
478-
| BCF | full | 1k | 626k | 321k | 1.47M |
479-
| BCF | full | 10k | 646k | 321k | 1.50M |
480-
481-
Using [`criterion`], I set up a couple benchmarks[^bench] for different use cases.
482-
To me, writing semi-complex BCF files was the most interesting one
475+
| format | complexity | htslib | noodles | seqair |
476+
| ------ | ---------- | -----: | ------: | -----: |
477+
| BCF | minimal | 1.30M | 564k | 2.82M |
478+
| | full | 646k | 321k | 1.50M |
479+
| VCF | minimal | 2.38M | 793k | 3.99M |
480+
| | full | 942k | 505k | 1.36M |
481+
| VCF.GZ | minimal | 1.25M | 667k | 2.41M |
482+
| | full | 538k | 369k | 760k |
483+
484+
Using [`criterion`], I set up a couple benchmarks for different use cases.
485+
To me, writing semi-complex VCF and BCF files was the most interesting one
483486
(that's what Rastair does),
484487
and that's what the table above shows.
485-
There are also benchmarks for writing `.vcf` and `.vcf.gz`.
486488

487489
A couple notes for the "all benchmarks are lies" crowd:
488-
This was run on a MacBook, all implementations write to `/dev/null`,
490+
This was run on a MacBook, all implementations write to `/dev/null`[^bench],
489491
and, yes, `htslib` and `noodles` allocate per-row
490492
because that's the entire point of implementing seqair.
491493

@@ -513,9 +515,8 @@ It worked, but the allocation profile was (obviously) worse,
513515
the API surface was larger,
514516
and adding the streaming encoder made it redundant.
515517
I had both implementations for a while before deleting the owned path.
516-
In hindsight I should have committed to streaming earlier
517-
and prototyped with real workloads
518-
before investing in the owned-record API.
518+
Clearly, I should have committed to streaming
519+
and prototyped with real workloads earlier.
519520

520521
The phantom type boilerplate is another… choice.
521522
Every new field type needs a marker enum, an impl block on the key,

0 commit comments

Comments
 (0)