Skip to content

Commit 597f24a

Browse files
committed
Add eval report+bench for data structures
Reference: #19 Signed-off-by: Philippe Ombredanne <pombredanne@aboutcode.org>
1 parent e0cf712 commit 597f24a

7 files changed

Lines changed: 1252 additions & 0 deletions

File tree

etc/bench/README.md

Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# PurlValidator data structure evaluation
2+
3+
This document details the research and evaluation of various efficient data
4+
structures for compact PURLs storage and lookup.
5+
6+
It contains:
7+
8+
- reference to evaluation/bench scripts
9+
- documentation on the various libraries and data structures under consideration
10+
- the final choice (spoiler an FST, aka. finite state transducer)
11+
12+
13+
## Context and Problem
14+
15+
PurlValidator needs a local queryable dataset of known PURLs to answer one question:
16+
17+
> Does this PURL exist in the reference dataset?
18+
19+
The lookup index should be built for each release, and shipped with the library
20+
for access without a network connection. And we want a Go, Rust and Python
21+
implementation. The PURls themselves are collected using PurlDB and FederatedCode.
22+
23+
24+
## Solution
25+
26+
### High level design
27+
28+
The lookup key is a PURL, cleaned to only keep type, namespace, and name,
29+
(without version, qualifiers and subpath)
30+
31+
This keeps validation focused for now. Version validation could come later by
32+
extending indexed PURLs with version or baking in support VERS version parsing
33+
for validation
34+
35+
### Solution elements: Data structures considered
36+
37+
- Built-in set and map
38+
- FST
39+
- DAWG
40+
- Bloom filter
41+
- SQLite
42+
43+
Considered but not evaluated:
44+
45+
- Minimal perfect hash: no compression
46+
- Trie or radix tree: DAWG and FST are similar, but are more compact. Suffix
47+
trees are way too big.
48+
49+
#### Built-in set and map
50+
51+
Built-in sets and maps are the simplest baseline in each language, they are as
52+
fast as can be, but they have no compression and no built-in serialization or
53+
memory mapping, and memory use grows quickly for large datasets.
54+
55+
An interesting path could be to use built-in sets in Rust and Go generating the
56+
code with all the PURL strings so that there is no specific deserialization. The
57+
porblem there is the size as the data is not compressed.
58+
59+
Built-ins structures are useful for benchmarks as reference but are not suitable
60+
as the main packaged data structure because they are too big.
61+
62+
63+
#### FST: finite state transducer
64+
65+
<https://en.wikipedia.org/wiki/Finite-state_transducer>
66+
67+
An FST stores a sorted set of strings in a compact automaton. PURLs share common
68+
prefixes such as `pkg:npm/`, `pkg:pypi/`, and `pkg:maven/`. This sharing helps
69+
reduce stored data.
70+
71+
FST lookup is exact for this use case. The Rust and Go implementations already
72+
ship an FST file. The library opens or embeds that file and performs membership
73+
checks without rebuilding the index.
74+
75+
The main cost is build complexity. Input must be prepared, sorted, and encoded
76+
when the package data is refreshed.
77+
78+
79+
#### DAWG: directed acyclic word graph
80+
81+
See <https://stevehanov.ca/blog/compressing-dictionaries-with-a-dawg>
82+
83+
this is aka. DAFSA
84+
<https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton>
85+
86+
A DAWG is a compact data structure for a set of strings. It can merge repeated
87+
prefixes and suffixes like an FST. The DAWG is interesting in that it can
88+
support prefix lookup, but in general the DAWG is bigger and slower than an FST,
89+
and has fewer mature/maintained library support.
90+
91+
92+
#### Bloom filter
93+
94+
<https://en.wikipedia.org/wiki/Bloom_filter>
95+
96+
A Bloom filter can store a large set in a small space, but it is a probalistic
97+
structure and can answer that a value is surely absent or maybe present. In that
98+
later case, you need an extra full dataset to validate further the "maybe": this
99+
is the problem of false positives with these filters, hence a Bloom filter
100+
cannot not be used as the only lookup structure, and does not make sense here.
101+
Instead, a Bloom filter could be used before an exact structure to skip some
102+
exact lookups as performance optimization, but outside of the validator.
103+
104+
105+
#### SQLite
106+
107+
<https://sqlite.org/>
108+
109+
SQLite can store PURLs in a SQL table with an index for exact lookup.
110+
111+
The tradeoff is operational weight. Each SQLite language binding adds a
112+
dependency (though this is built in Python). The validator only needs immutable
113+
membership checks, not SQL full power with queries, and update transactions; but
114+
on the other hand the SQLite DB could be the same across all languages.
115+
116+
SQLite could useful as a benchmark and debugging format. It is not the first
117+
choice for a small language library because this is not compressed. But it will
118+
be a future enhancement for sure.
119+
120+
121+
### Preferred solution: FST
122+
123+
Based on the benchmark and otrher criteria, let's use an FST-backed lookup for
124+
every languages. Do not use a Bloom filter (probalistic). Do not use native
125+
structures that use too much memory.
126+
127+
And for the library selection, we have these high level requirements:
128+
129+
- We want exact result without false positives, e.g., no bloom filter.
130+
- Offline use, with no network is a must: the dataset must be bundled in the
131+
releases.
132+
- With build time index construction, the construction time is not critical.
133+
- The bundled index should be small enough to ship below crates, and Pypi
134+
archive size limits.
135+
- No rebuild at startup/runtime, and fast enough load time from disk, ideally
136+
memory-mapped.
137+
- Fast enough lookup.
138+
- Libraries should be maintained, active FOSS for Rust/Go/Python.
139+
140+
The final selected FST libraries are:
141+
142+
- Rust: fst crate with a memory-mapped set <https://github.com/BurntSushi/fst/>
143+
- Python: ducer with a memory-mapped map, dict-like
144+
<https://github.com/jfolz/ducer> (ducer uses the Rust fst crate inside)
145+
- Go: vellum "fst" module (originally from
146+
<https://github.com/couchbase/vellum> now at
147+
<https://github.com/blevesearch/vellum>) which is mostly inspired from the
148+
Rust fst crate
149+
150+
151+
## Appendix: Benchmarks
152+
153+
This directory contains evaluation and benchmark files for PurlValidator.
154+
155+
It compares structures for offline PURL membership checks with these
156+
implementations use:
157+
158+
- Python: memory-mapped `ducer`.
159+
- Rust: crate `fst`.
160+
- Go: embedded Vellum FST.
161+
162+
... as well as the builtin Python set and dict, SQLite and a Rust DAWG
163+
164+
### Expected checkout layout
165+
166+
Run the scripts from a directory with these repositories checkouts:
167+
168+
- `/purl-validator`
169+
- `/purl-validator.rs`
170+
- `/purlvalidator-go`
171+
172+
### benchmarking FST vs. DAWG
173+
174+
There is a good benchmarch in Go comparing FST and DAWG data structures (and
175+
other structures) that highlights why an FST is a better structure for our cases
176+
than a DAWG:
177+
178+
<https://github.com/timurgarif/go-fsa-trie-bench>
179+
180+
We also did a simple synthetic benchmark of the Rust fst and dawg crates using
181+
actual base PURLs using the data in
182+
<https://github.com/aboutcode-org/purl-validator.rs/tree/main/fst_builder/data>
183+
184+
The `etc/bench/rust-fst-dawg-bench` code compare these fst and dawg crates.
185+
186+
The dataset profile has 2,324,119 unique sorted base PURL. The benchmark is to
187+
run 1M queries, where 500K are expected to fail.
188+
189+
- The fst crate index was built in 11s, with a 26MB serialized file, and took
190+
0.703s for 1M lookups.
191+
- The dawg crate index was built in 18s, with a 831MB serialized file, and took
192+
28s for 1M lookups.
193+
194+
The outcome is that the preferred structure is an FST over a DAWG (at least
195+
with these implementations).
196+
197+
### benchmarking FST against builtin and SQLite
198+
199+
Since we picked the FST as the winner, additional review has been focused on
200+
Python by comparing the ducer fst library against other approaches. Since it is
201+
based on the Rust fst and Go's vellum is also based on the fst design, we cover
202+
essentially the three languages at once.
203+
204+
The `etc/scripts/bench/alternative_benchmark.py` script compares Python lookup
205+
using a text file with one PURL per line for these candidates:
206+
207+
- Python `set`.
208+
- Python `dict`.
209+
- Python Sorted list plus `bisect`.
210+
- In-memory SQLite.
211+
- FST using a `ducer.Map`.
212+
213+
Data is from `purl-validator.rs/fst_builder/data/`
214+
215+
Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs:
216+
217+
```text
218+
structure build (secs) lookup (secs) storage size
219+
-------------------- ------------ -------------- ---------------------------
220+
python set 0.206540 0.275906 304MB in RAM
221+
python dict 0.449625 0.429034 298MB in RAM
222+
ducer FST 3.700943 1.805585 26MB on disk
223+
sorted list+bisect 0.017540 2.783555 236MB in RAM
224+
sqlite in memory 4.855480 4.220032 207MB on disk (or 65MB with zstd)
225+
```
226+
227+
### benchmarking FST in Python vs. Go vs. Rust
228+
229+
This benchmark runs each of the three validator released implementations. The
230+
script is in `etc/scripts/bench/go-rust-py_benchmark.py`
231+
232+
Data is from `purl-validator.rs/fst_builder/data/`
233+
234+
Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs:
235+
236+
```text
237+
structure build (secs) lookup (secs) storage size (ondisk)
238+
-------------------- ------------ -------------- ---------------------------
239+
Python purl-validator 16.664847 4.926029 25MB
240+
Rust purl-validator.rs 11.849877 0.348128 25MB
241+
Go purlvalidator-go 2.325181 0.704749 25MB
242+
```
243+
244+
### Evaluation
245+
246+
The results are consistent with expectations: Rust is faster than Go and Python.
247+
248+
And the Python on disk fst is the same size as the Rust fst (since this is the
249+
same backing code).
250+
251+
Some surprises:
252+
253+
- The build of the Go index is the fastest which is surprising and could be an
254+
avenue of improvement for the Rust fst crate.
255+
256+
- Leaving aside the 10x larger RAM need, the Python set and dict are competitive
257+
speed wise (faster than the on-disk Rust FST) ans super fast to build too.
258+

0 commit comments

Comments
 (0)