Skip to content

Commit 9a9656d

Browse files
authored
Fix typos README.rst
Signed-off-by: Philippe Ombredanne <pombredanne@aboutcode.org
1 parent 56d15cf commit 9a9656d

1 file changed

Lines changed: 23 additions & 281 deletions

File tree

etc/bench/README.rst

Lines changed: 23 additions & 281 deletions
Original file line numberDiff line numberDiff line change
@@ -1,84 +1,4 @@
11
PurlValidator data structure evaluation
2-
=======================================
3-
4-
This document details the research and evaluation of various efficient
5-
data structures for compact PURLs storage and lookup.
6-
7-
It contains:
8-
9-
- reference to evaluation/bench scripts
10-
- documentation on the various libraries and data structures under
11-
consideration
12-
- the final choice (spoiler an FST, aka. finite state transducer)
13-
14-
Context and Problem
15-
-------------------
16-
17-
PurlValidator needs a local queryable dataset of known PURLs to answer
18-
one question:
19-
20-
Does this PURL exist in the reference dataset?
21-
22-
The lookup index should be built for each release, and shipped with the
23-
library for access without a network connection. And we want a Go, Rust
24-
and Python implementation. The PURls themselves are collected using
25-
PurlDB and FederatedCode.
26-
27-
Solution
28-
--------
29-
30-
High level design
31-
~~~~~~~~~~~~~~~~~
32-
33-
The lookup key is a PURL, cleaned to only keep type, namespace, and
34-
name, (without version, qualifiers and subpath)
35-
36-
This keeps validation focused for now. Version validation could come
37-
later by extending indexed PURLs with version or baking in support VERS
38-
version parsing for validation
39-
40-
Solution elements: Data structures considered
41-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
42-
43-
- Built-in set and map
44-
- FST
45-
- DAWG
46-
- Bloom filter
47-
- SQLite
48-
49-
Considered but not evaluated:
50-
51-
- Minimal perfect hash: no compression
52-
- Trie or radix tree: DAWG and FST are similar, but are more compact.
53-
Suffix trees are way too big.
54-
55-
Built-in set and map
56-
^^^^^^^^^^^^^^^^^^^^
57-
58-
Built-in sets and maps are the simplest baseline in each language, they
59-
are as fast as can be, but they have no compression and no built-in
60-
serialization or memory mapping, and memory use grows quickly for large
61-
datasets.
62-
63-
An interesting path could be to use built-in sets in Rust and Go
64-
generating the code with all the PURL strings so that there is no
65-
specific deserialization. The porblem there is the size as the data is
66-
not compressed.
67-
68-
Built-ins structures are useful for benchmarks as reference but are not
69-
suitable as the main packaged data structure because they are too big.
70-
71-
FST: finite state transducer
72-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
73-
74-
https://en.wikipedia.org/wiki/Finite-state_transducer
75-
76-
An FST stores a sorted set of strings in a compact automaton. PURLs
77-
share common prefixes such as ``pkg:npm/``, ``pkg:pypi/``, and
78-
``pkg:maven/``. This sharing helps reduce stored data.
79-
80-
FST lookup is exact for this use case. The Rust and Go implementations
81-
already ship an FST file. The library opens or embeds that file and
822
performs membership checks without rebuilding the index.
833

844
The main cost is build complexity. Input must be prepared, sorted, and
@@ -103,12 +23,12 @@ Bloom filter
10323
https://en.wikipedia.org/wiki/Bloom_filter
10424

10525
A Bloom filter can store a large set in a small space, but it is a
106-
probalistic structure and can answer that a value is surely absent or
26+
probabilistic structure and can answer that a value is surely absent or
10727
maybe present. In that later case, you need an extra full dataset to
10828
validate further the “maybe”: this is the problem of false positives
10929
with these filters, hence a Bloom filter cannot not be used as the only
11030
lookup structure, and does not make sense here. Instead, a Bloom filter
111-
could be used before an exact structure to skip some exact lookups as
31+
could be used before an exact structure to skip some exact lookup as
11232
performance optimization, but outside of the validator.
11333

11434
SQLite
@@ -118,7 +38,7 @@ https://sqlite.org/
11838

11939
SQLite can store PURLs in a SQL table with an index for exact lookup.
12040

121-
The tradeoff is operational weight. Each SQLite language binding adds a
41+
The trade-off is operational weight. Each SQLite language binding adds a
12242
dependency (though this is built in Python). The validator only needs
12343
immutable membership checks, not SQL full power with queries, and update
12444
transactions; but on the other hand the SQLite DB could be the same
@@ -131,208 +51,30 @@ compressed. But it will be a future enhancement for sure.
13151
Preferred solution: FST
13252
~~~~~~~~~~~~~~~~~~~~~~~
13353

134-
Based on the benchmark and otrher criteria, let’s use an FST-backed
135-
lookup for every languages. Do not use a Bloom filter (probalistic). Do
54+
Based on the benchmark and other criteria, let’s use an FST-backed
55+
lookup for every languages. Do not use a Bloom filter (probabilistic). Do
13656
not use native structures that use too much memory.
13757

13858
And for the library selection, we have these high level requirements:
13959

140-
- We want exact result without false positives, e.g., no bloom filter.
141-
- Offline use, with no network is a must: the dataset must be bundled
142-
in the releases.
143-
- With build time index construction, the construction time is not
144-
critical.
145-
- The bundled index should be small enough to ship below crates, and
146-
Pypi archive size limits.
147-
- No rebuild at startup/runtime, and fast enough load time from disk,
148-
ideally memory-mapped.
149-
- Fast enough lookup.
150-
- Libraries should be maintained, active FOSS for Rust/Go/Python.
60+
- We want exact result without false positives, e.g., no bloom filter.
61+
- Offline use, with no network is a must: the dataset must be bundled
62+
in the releases.
63+
- With build time index construction, the construction time is not
64+
critical.
65+
- The bundled index should be small enough to ship below crates, and
66+
Pypi archive size limits.
67+
- No rebuild at startup/runtime, and fast enough load time from disk,
68+
ideally memory-mapped.
69+
- Fast enough lookup.
70+
- Libraries should be maintained, active FOSS for Rust/Go/Python.
15171

15272
The final selected FST libraries are:
15373

154-
- Rust: fst crate with a memory-mapped set
155-
https://github.com/BurntSushi/fst/
156-
- Python: ducer with a memory-mapped map, dict-like
157-
https://github.com/jfolz/ducer (ducer uses the Rust fst crate inside)
158-
- Go: vellum “fst” module (originally from
159-
https://github.com/couchbase/vellum now at
160-
https://github.com/blevesearch/vellum) which is mostly inspired from
161-
the Rust fst crate
162-
163-
Appendix: Benchmarks
164-
--------------------
165-
166-
This directory contains the benchmark Python scripts and mini benchmark
167-
projects used for PurlValidator evaluation in Go and Rust.
168-
169-
The benchmarks compare offline PURL existence checks using:
170-
171-
- Python: memory-mapped ``ducer``.
172-
- Rust: crate ``fst``.
173-
- Go: embedded Vellum FST.
174-
- Python built-in ``set`` and ``dict``,
175-
- Python sorted list,
176-
- Python embedded SQLite,
177-
- and a Rust DAWG.
178-
179-
180-
Expected checkout layout
181-
~~~~~~~~~~~~~~~~~~~~~~~~
182-
183-
We assume you have Python, Go and Rust pre-installed:
184-
``python``, Rust ``cargo`` and Go ``go`` must be on the ``PATH``.
185-
186-
187-
First clone the three repos, in the same ``workspace`` directory:
188-
189-
- ``git clone https://github.com/aboutcode-org/purl-validator``
190-
- ``git clone https://github.com/aboutcode-org/purl-validator.rs``
191-
- ``git clone https://github.com/aboutcode-org/purlvalidator-go``
192-
193-
The scripts derive the workspace path from this ``purl-validator`` clone.
194-
Use ``--workspace`` only when the three clones are not in the same parent
195-
directory.
196-
197-
Install the Python dependencies from the Python repo:
198-
199-
.. code:: sh
200-
201-
cd purl-validator
202-
python3 -m venv venv
203-
. venv/bin/activate
204-
python -m pip install -U pip
205-
python -m pip install -r requirements.txt packageurl-python
206-
207-
208-
The Rust and Go lookup test project code is checked in under:
209-
210-
- ``etc/bench/rust-lookup-bench``
211-
- ``etc/bench/go-lookup-bench``
212-
213-
The Python benchmark driver builds and runs those projects.
214-
215-
216-
217-
Benchmarking FST vs. DAWG
218-
~~~~~~~~~~~~~~~~~~~~~~~~~
219-
220-
Note: there is a Go benchmark comparing FST and DAWG data structures, plus
221-
other structures:
222-
223-
https://github.com/timurgarif/go-fsa-trie-bench
224-
225-
The local Rust benchmark compares the ``fst`` and ``dawg`` crates using
226-
the base PURLs from ``purl-validator.rs/fst_builder/data``.
227-
228-
Run it from the purl-validator/ dir (with activated venv):
229-
230-
.. code:: sh
231-
232-
cargo build --release --manifest-path etc/bench/rust-fst-dawg-bench/Cargo.toml
233-
etc/bench/rust-fst-dawg-bench/target/release/rust-fst-dawg-bench
234-
235-
The dataset profile has 2,324,119 unique sorted base PURL. The benchmark
236-
is to run 1M queries, where 500K are expected to fail.
237-
238-
- The fst crate index was built in 11s, with a 25MB serialized file,
239-
and took 0.703s for 1M lookups.
240-
- The dawg crate index was built in 18s, with a 794MB serialized file,
241-
and took 28s for 1M lookups.
242-
243-
The outcome is that the preferred structure is an FST over a DAWG (at
244-
least with these implementations).
245-
246-
247-
Benchmarking FST against built-ins and SQLite
248-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
249-
250-
Additional review compares the Python ``ducer`` FST library against
251-
other approaches. ``ducer`` uses the Rust ``fst`` crate, and Go Vellum is
252-
based on the same FST design.
253-
254-
The ``etc/bench/alternative_benchmark.py`` script compares
255-
Python lookup using a list of PURLs (text file with one PURL per line)
256-
for these candidates:
257-
258-
- Python ``set``.
259-
- Python ``dict``.
260-
- Python sorted list plus ``bisect``.
261-
- In-memory SQLite.
262-
- FST using a ``ducer.Map`` (a Python wrapper on the Rust fst crate).
263-
264-
Data (PURL lists) is from ``purl-validator.rs/fst_builder/data/``
265-
266-
Run it from the purl-validator/ dir (with activated venv):
267-
268-
.. code:: sh
269-
270-
python etc/bench/alternative_benchmark.py \
271-
--input ../purl-validator.rs/fst_builder/data \
272-
--limit 0 \
273-
--queries 1000000 \
274-
--report tmp/alternative-structures.txt
275-
276-
Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing
277-
PURLs:
278-
279-
.. code:: text
280-
281-
structure build (secs) lookup (secs) storage size
282-
-------------------- ------------ -------------- ---------------------------
283-
python set 0.206540 0.275906 304MB in RAM
284-
python dict 0.449625 0.429034 298MB in RAM
285-
ducer FST 3.700943 1.805585 26MB on disk
286-
sorted list+bisect 0.017540 2.783555 236MB in RAM
287-
sqlite in memory 4.855480 4.220032 207MB on disk (or 65MB with zstd)
288-
289-
290-
Benchmarking FST in Python, Go, and Rust
291-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
292-
293-
This benchmark runs three PURL validators implementations.
294-
The script is ``etc/bench/go-rust-py_benchmark.py``.
295-
296-
The benchmark goes through these steps:
297-
298-
- loads PURL lists from ``purl-validator.rs/fst_builder/data``.
299-
- builds the Python ``ducer`` map (i.e., a Rust FST using the ``fst`` crate).
300-
- builds the Rust FST with ``purl-validator.rs``.
301-
- builds the Go FST with ``purlvalidator-go``.
302-
- runs 1M lookups, with 500K known PURLs and 500K unknown PURLs.
303-
304-
305-
Run it from the purl-validator/ dir (with activated venv):
306-
307-
.. code:: sh
308-
309-
python etc/bench/go-rust-py_benchmark.py \
310-
--queries 1000000 \
311-
--report tmp/go-rust-py-results.txt
312-
313-
PURL data is from ``purl-validator.rs/fst_builder/data/``
314-
315-
Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing
316-
PURLs:
317-
318-
.. code:: text
319-
320-
structure build (secs) lookup (secs) storage size (ondisk)
321-
-------------------- ------------ -------------- ---------------------------
322-
Python purl-validator 16.664847 4.926029 25MB
323-
Rust purl-validator.rs 11.849877 0.348128 25MB
324-
Go purlvalidator-go 2.325181 0.704749 25MB
325-
326-
327-
Evaluation and final solution
328-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
329-
330-
The Rust implementation has the fastest lookup in this run. The Python
331-
on-disk FST is about the same size as the Rust FST because both use the
332-
same backing FST implementation.
333-
334-
The Go index build is faster in this run. That may be worth checking
335-
against the Rust FST builder.
336-
337-
The Python ``set`` and ``dict`` are fast baselines, but they use much
338-
more RAM than the on-disk FST.
74+
- Rust: fst crate with a memory-mapped set
75+
https://github.com/BurntSushi/fst/
76+
- Python: ducer with a memory-mapped map, dict-like
77+
https://github.com/jfolz/ducer (ducer uses the Rust fst crate inside)
78+
- Go: vellum “fst” module (originally from
79+
https://github.com/couchbase/vellum now at
80+
https://github.com/blevesearch/vellum) which is mostly inspired from

0 commit comments

Comments
 (0)