@@ -7,6 +7,7 @@ categories:
77 - bioinformatics
88 - type-system
99---
10+
1011If you've ever needed to produce a typed binary format
1112where the header constrains what the body can contain,
1213you've probably written validation code that runs at runtime
@@ -471,21 +472,22 @@ All these numbers are in "elements per second", so higher is better:
471472
472473[ `criterion` ] : https://docs.rs/criterion/0.8.2/criterion/ " A statistics-driven micro-benchmarking library written in Rust. "
473474
474- | format | complexity | rows | htslib | noodles | seqair |
475- | ------ | ---------- | ---: | -----: | ------: | -----: |
476- | BCF | minimal | 1k | 1.25M | 559k | 2.83M |
477- | BCF | minimal | 10k | 1.30M | 564k | 2.82M |
478- | BCF | full | 1k | 626k | 321k | 1.47M |
479- | BCF | full | 10k | 646k | 321k | 1.50M |
480-
481- Using [ ` criterion ` ] , I set up a couple benchmarks[ ^ bench ] for different use cases.
482- To me, writing semi-complex BCF files was the most interesting one
475+ | format | complexity | htslib | noodles | seqair |
476+ | ------ | ---------- | -----: | ------: | -----: |
477+ | BCF | minimal | 1.30M | 564k | 2.82M |
478+ | | full | 646k | 321k | 1.50M |
479+ | VCF | minimal | 2.38M | 793k | 3.99M |
480+ | | full | 942k | 505k | 1.36M |
481+ | VCF.GZ | minimal | 1.25M | 667k | 2.41M |
482+ | | full | 538k | 369k | 760k |
483+
484+ Using [ ` criterion ` ] , I set up a couple benchmarks for different use cases.
485+ To me, writing semi-complex VCF and BCF files was the most interesting one
483486(that's what Rastair does),
484487and that's what the table above shows.
485- There are also benchmarks for writing ` .vcf ` and ` .vcf.gz ` .
486488
487489A couple notes for the "all benchmarks are lies" crowd:
488- This was run on a MacBook, all implementations write to ` /dev/null ` ,
490+ This was run on a MacBook, all implementations write to ` /dev/null ` [ ^ bench ] ,
489491and, yes, ` htslib ` and ` noodles ` allocate per-row
490492because that's the entire point of implementing seqair.
491493
@@ -513,9 +515,8 @@ It worked, but the allocation profile was (obviously) worse,
513515the API surface was larger,
514516and adding the streaming encoder made it redundant.
515517I had both implementations for a while before deleting the owned path.
516- In hindsight I should have committed to streaming earlier
517- and prototyped with real workloads
518- before investing in the owned-record API.
518+ Clearly, I should have committed to streaming
519+ and prototyped with real workloads earlier.
519520
520521The phantom type boilerplate is another… choice.
521522Every new field type needs a marker enum, an impl block on the key,
0 commit comments