|
| 1 | +# PurlValidator data structure evaluation |
| 2 | + |
| 3 | +This document details the research and evaluation of various efficient data |
| 4 | +structures for compact PURLs storage and lookup. |
| 5 | + |
| 6 | +It contains: |
| 7 | + |
| 8 | +- reference to evaluation/bench scripts |
| 9 | +- documentation on the various libraries and data structures under consideration |
| 10 | +- the final choice (spoiler an FST, aka. finite state transducer) |
| 11 | + |
| 12 | + |
| 13 | +## Context and Problem |
| 14 | + |
| 15 | +PurlValidator needs a local queryable dataset of known PURLs to answer one question: |
| 16 | + |
| 17 | +> Does this PURL exist in the reference dataset? |
| 18 | +
|
| 19 | +The lookup index should be built for each release, and shipped with the library |
| 20 | +for access without a network connection. And we want a Go, Rust and Python |
| 21 | +implementation. The PURls themselves are collected using PurlDB and FederatedCode. |
| 22 | + |
| 23 | + |
| 24 | +## Solution |
| 25 | + |
| 26 | +### High level design |
| 27 | + |
| 28 | +The lookup key is a PURL, cleaned to only keep type, namespace, and name, |
| 29 | +(without version, qualifiers and subpath) |
| 30 | + |
| 31 | +This keeps validation focused for now. Version validation could come later by |
| 32 | +extending indexed PURLs with version or baking in support VERS version parsing |
| 33 | +for validation |
| 34 | + |
| 35 | +### Solution elements: Data structures considered |
| 36 | + |
| 37 | +- Built-in set and map |
| 38 | +- FST |
| 39 | +- DAWG |
| 40 | +- Bloom filter |
| 41 | +- SQLite |
| 42 | + |
| 43 | +Considered but not evaluated: |
| 44 | + |
| 45 | +- Minimal perfect hash: no compression |
| 46 | +- Trie or radix tree: DAWG and FST are similar, but are more compact. Suffix |
| 47 | + trees are way too big. |
| 48 | + |
| 49 | +#### Built-in set and map |
| 50 | + |
| 51 | +Built-in sets and maps are the simplest baseline in each language, they are as |
| 52 | +fast as can be, but they have no compression and no built-in serialization or |
| 53 | +memory mapping, and memory use grows quickly for large datasets. |
| 54 | + |
| 55 | +An interesting path could be to use built-in sets in Rust and Go generating the |
| 56 | +code with all the PURL strings so that there is no specific deserialization. The |
| 57 | +porblem there is the size as the data is not compressed. |
| 58 | + |
| 59 | +Built-ins structures are useful for benchmarks as reference but are not suitable |
| 60 | +as the main packaged data structure because they are too big. |
| 61 | + |
| 62 | + |
| 63 | +#### FST: finite state transducer |
| 64 | + |
| 65 | +<https://en.wikipedia.org/wiki/Finite-state_transducer> |
| 66 | + |
| 67 | +An FST stores a sorted set of strings in a compact automaton. PURLs share common |
| 68 | +prefixes such as `pkg:npm/`, `pkg:pypi/`, and `pkg:maven/`. This sharing helps |
| 69 | +reduce stored data. |
| 70 | + |
| 71 | +FST lookup is exact for this use case. The Rust and Go implementations already |
| 72 | +ship an FST file. The library opens or embeds that file and performs membership |
| 73 | +checks without rebuilding the index. |
| 74 | + |
| 75 | +The main cost is build complexity. Input must be prepared, sorted, and encoded |
| 76 | +when the package data is refreshed. |
| 77 | + |
| 78 | + |
| 79 | +#### DAWG: directed acyclic word graph |
| 80 | + |
| 81 | +See <https://stevehanov.ca/blog/compressing-dictionaries-with-a-dawg> |
| 82 | + |
| 83 | +this is aka. DAFSA |
| 84 | +<https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton> |
| 85 | + |
| 86 | +A DAWG is a compact data structure for a set of strings. It can merge repeated |
| 87 | +prefixes and suffixes like an FST. The DAWG is interesting in that it can |
| 88 | +support prefix lookup, but in general the DAWG is bigger and slower than an FST, |
| 89 | +and has fewer mature/maintained library support. |
| 90 | + |
| 91 | + |
| 92 | +#### Bloom filter |
| 93 | + |
| 94 | +<https://en.wikipedia.org/wiki/Bloom_filter> |
| 95 | + |
| 96 | +A Bloom filter can store a large set in a small space, but it is a probalistic |
| 97 | +structure and can answer that a value is surely absent or maybe present. In that |
| 98 | +later case, you need an extra full dataset to validate further the "maybe": this |
| 99 | +is the problem of false positives with these filters, hence a Bloom filter |
| 100 | +cannot not be used as the only lookup structure, and does not make sense here. |
| 101 | +Instead, a Bloom filter could be used before an exact structure to skip some |
| 102 | +exact lookups as performance optimization, but outside of the validator. |
| 103 | + |
| 104 | + |
| 105 | +#### SQLite |
| 106 | + |
| 107 | +<https://sqlite.org/> |
| 108 | + |
| 109 | +SQLite can store PURLs in a SQL table with an index for exact lookup. |
| 110 | + |
| 111 | +The tradeoff is operational weight. Each SQLite language binding adds a |
| 112 | +dependency (though this is built in Python). The validator only needs immutable |
| 113 | +membership checks, not SQL full power with queries, and update transactions; but |
| 114 | +on the other hand the SQLite DB could be the same across all languages. |
| 115 | + |
| 116 | +SQLite could useful as a benchmark and debugging format. It is not the first |
| 117 | +choice for a small language library because this is not compressed. But it will |
| 118 | +be a future enhancement for sure. |
| 119 | + |
| 120 | + |
| 121 | +### Preferred solution: FST |
| 122 | + |
| 123 | +Based on the benchmark and otrher criteria, let's use an FST-backed lookup for |
| 124 | +every languages. Do not use a Bloom filter (probalistic). Do not use native |
| 125 | +structures that use too much memory. |
| 126 | + |
| 127 | +And for the library selection, we have these high level requirements: |
| 128 | + |
| 129 | +- We want exact result without false positives, e.g., no bloom filter. |
| 130 | +- Offline use, with no network is a must: the dataset must be bundled in the |
| 131 | + releases. |
| 132 | +- With build time index construction, the construction time is not critical. |
| 133 | +- The bundled index should be small enough to ship below crates, and Pypi |
| 134 | + archive size limits. |
| 135 | +- No rebuild at startup/runtime, and fast enough load time from disk, ideally |
| 136 | + memory-mapped. |
| 137 | +- Fast enough lookup. |
| 138 | +- Libraries should be maintained, active FOSS for Rust/Go/Python. |
| 139 | + |
| 140 | +The final selected FST libraries are: |
| 141 | + |
| 142 | +- Rust: fst crate with a memory-mapped set <https://github.com/BurntSushi/fst/> |
| 143 | +- Python: ducer with a memory-mapped map, dict-like |
| 144 | + <https://github.com/jfolz/ducer> (ducer uses the Rust fst crate inside) |
| 145 | +- Go: vellum "fst" module (originally from |
| 146 | + <https://github.com/couchbase/vellum> now at |
| 147 | + <https://github.com/blevesearch/vellum>) which is mostly inspired from the |
| 148 | + Rust fst crate |
| 149 | + |
| 150 | + |
| 151 | +## Appendix: Benchmarks |
| 152 | + |
| 153 | +This directory contains evaluation and benchmark files for PurlValidator. |
| 154 | + |
| 155 | +It compares structures for offline PURL membership checks with these |
| 156 | +implementations use: |
| 157 | + |
| 158 | +- Python: memory-mapped `ducer`. |
| 159 | +- Rust: crate `fst`. |
| 160 | +- Go: embedded Vellum FST. |
| 161 | + |
| 162 | +... as well as the builtin Python set and dict, SQLite and a Rust DAWG |
| 163 | + |
| 164 | +### Expected checkout layout |
| 165 | + |
| 166 | +Run the scripts from a directory with these repositories checkouts: |
| 167 | + |
| 168 | +- `/purl-validator` |
| 169 | +- `/purl-validator.rs` |
| 170 | +- `/purlvalidator-go` |
| 171 | + |
| 172 | +### benchmarking FST vs. DAWG |
| 173 | + |
| 174 | +There is a good benchmarch in Go comparing FST and DAWG data structures (and |
| 175 | +other structures) that highlights why an FST is a better structure for our cases |
| 176 | +than a DAWG: |
| 177 | + |
| 178 | +<https://github.com/timurgarif/go-fsa-trie-bench> |
| 179 | + |
| 180 | +We also did a simple synthetic benchmark of the Rust fst and dawg crates using |
| 181 | +actual base PURLs using the data in |
| 182 | +<https://github.com/aboutcode-org/purl-validator.rs/tree/main/fst_builder/data> |
| 183 | + |
| 184 | +The `etc/bench/rust-fst-dawg-bench` code compare these fst and dawg crates. |
| 185 | + |
| 186 | +The dataset profile has 2,324,119 unique sorted base PURL. The benchmark is to |
| 187 | +run 1M queries, where 500K are expected to fail. |
| 188 | + |
| 189 | +- The fst crate index was built in 11s, with a 26MB serialized file, and took |
| 190 | + 0.703s for 1M lookups. |
| 191 | +- The dawg crate index was built in 18s, with a 831MB serialized file, and took |
| 192 | + 28s for 1M lookups. |
| 193 | + |
| 194 | +The outcome is that the preferred structure is an FST over a DAWG (at least |
| 195 | +with these implementations). |
| 196 | + |
| 197 | +### benchmarking FST against builtin and SQLite |
| 198 | + |
| 199 | +Since we picked the FST as the winner, additional review has been focused on |
| 200 | +Python by comparing the ducer fst library against other approaches. Since it is |
| 201 | +based on the Rust fst and Go's vellum is also based on the fst design, we cover |
| 202 | +essentially the three languages at once. |
| 203 | + |
| 204 | +The `etc/scripts/bench/alternative_benchmark.py` script compares Python lookup |
| 205 | +using a text file with one PURL per line for these candidates: |
| 206 | + |
| 207 | +- Python `set`. |
| 208 | +- Python `dict`. |
| 209 | +- Python Sorted list plus `bisect`. |
| 210 | +- In-memory SQLite. |
| 211 | +- FST using a `ducer.Map`. |
| 212 | + |
| 213 | +Data is from `purl-validator.rs/fst_builder/data/` |
| 214 | + |
| 215 | +Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs: |
| 216 | + |
| 217 | +```text |
| 218 | +structure build (secs) lookup (secs) storage size |
| 219 | +-------------------- ------------ -------------- --------------------------- |
| 220 | +python set 0.206540 0.275906 304MB in RAM |
| 221 | +python dict 0.449625 0.429034 298MB in RAM |
| 222 | +ducer FST 3.700943 1.805585 26MB on disk |
| 223 | +sorted list+bisect 0.017540 2.783555 236MB in RAM |
| 224 | +sqlite in memory 4.855480 4.220032 207MB on disk (or 65MB with zstd) |
| 225 | +``` |
| 226 | + |
| 227 | +### benchmarking FST in Python vs. Go vs. Rust |
| 228 | + |
| 229 | +This benchmark runs each of the three validator released implementations. The |
| 230 | +script is in `etc/scripts/bench/go-rust-py_benchmark.py` |
| 231 | + |
| 232 | +Data is from `purl-validator.rs/fst_builder/data/` |
| 233 | + |
| 234 | +Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs: |
| 235 | + |
| 236 | +```text |
| 237 | +structure build (secs) lookup (secs) storage size (ondisk) |
| 238 | +-------------------- ------------ -------------- --------------------------- |
| 239 | +Python purl-validator 16.664847 4.926029 25MB |
| 240 | +Rust purl-validator.rs 11.849877 0.348128 25MB |
| 241 | +Go purlvalidator-go 2.325181 0.704749 25MB |
| 242 | +``` |
| 243 | + |
| 244 | +### Evaluation |
| 245 | + |
| 246 | +The results are consistent with expectations: Rust is faster than Go and Python. |
| 247 | + |
| 248 | +And the Python on disk fst is the same size as the Rust fst (since this is the |
| 249 | +same backing code). |
| 250 | + |
| 251 | +Some surprises: |
| 252 | + |
| 253 | +- The build of the Go index is the fastest which is surprising and could be an |
| 254 | + avenue of improvement for the Rust fst crate. |
| 255 | + |
| 256 | +- Leaving aside the 10x larger RAM need, the Python set and dict are competitive |
| 257 | + speed wise (faster than the on-disk Rust FST) ans super fast to build too. |
| 258 | + |
0 commit comments