|
| 1 | +.. _data_structure_rationale: |
| 2 | + |
| 3 | +FST Data Structure Rationale |
| 4 | +============================= |
| 5 | + |
| 6 | +PurlValidator needs exact membership lookup for a large list of base PURLs. The |
| 7 | +lookup data index is built before release and bundled with each library. |
| 8 | + |
| 9 | + |
| 10 | +See https://github.com/aboutcode-org/purl-validator/tree/main/etc/bench for |
| 11 | +actual detailed rationale and bench for the choice of an FST. |
| 12 | + |
| 13 | + |
| 14 | +Why FSTs are used? |
| 15 | +------------------ |
| 16 | + |
| 17 | +Finite state transducers store sorted strings in a compact form. PURLs share |
| 18 | +prefixes such as ``pkg:npm/``, ``pkg:pypi/``, and ``pkg:maven/``. This makes an |
| 19 | +FST useful for exact package identity queries. |
| 20 | + |
| 21 | +FST can be memory-mapped and are super compact. They are not as fast as native |
| 22 | +set, but the memory consumption is so much lower than this make them the most |
| 23 | +attractive solution, even if it takes more time to build. |
| 24 | + |
| 25 | + |
| 26 | +Requirements |
| 27 | +--------------- |
| 28 | + |
| 29 | +The index structure should provide: |
| 30 | + |
| 31 | +And for the library selection, we have these high level requirements: |
| 32 | + |
| 33 | +- We want exact result without false positives, e.g., no bloom filter. |
| 34 | +- Offline use, with no network is a must: the dataset must be bundled |
| 35 | + in the releases. |
| 36 | +- With build time index construction, the construction time is not |
| 37 | + critical. |
| 38 | +- The bundled index should be small enough to ship below crates, and |
| 39 | + Pypi archive size limits. |
| 40 | +- No rebuild at startup/runtime, and fast enough load time from disk, |
| 41 | + ideally memory-mapped. |
| 42 | +- Fast enough lookup. |
| 43 | +- Libraries should be maintained, active FOSS for Rust/Go/Python. |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | + |
| 48 | +Selected FST libraries |
| 49 | +-------------------------- |
| 50 | + |
| 51 | +Python uses ``ducer.Map`` with ``mmap``. The map is stored on disk and opened |
| 52 | +without loading the full catalog into Python objects. |
| 53 | + |
| 54 | +Rust uses ``fst::Set``. The generated FST is embedded into the crate. |
| 55 | + |
| 56 | +Go uses Vellum FST. The generated FST is embedded into the module. |
| 57 | + |
| 58 | +Alternatives |
| 59 | +------------ |
| 60 | + |
| 61 | +We considered also built-in sets and maps as a baseline: |
| 62 | + |
| 63 | +- Python: ``set`` and ``dict``. |
| 64 | +- Rust: ``HashSet`` and ``HashMap``. |
| 65 | +- Go: ``map[string]struct{}`` and ``map[string]bool``. |
| 66 | + |
| 67 | +These structures are simple and fast. They require loading all keys into |
| 68 | +runtime memory, so they are less useful as the packaged lookup format. |
| 69 | + |
| 70 | +Sorted arrays or slices can use binary search. They are simple and exact, but |
| 71 | +lookup takes repeated string comparisons and the strings still need to be |
| 72 | +loaded. |
| 73 | + |
| 74 | +SQLite can store the PURLs in an indexed table. It gives exact results, but it |
| 75 | +adds a database dependency for a read-only membership check. It has way more |
| 76 | +features than needed and is overkill for our use case. |
| 77 | + |
| 78 | +Bloom filters are small and fast, but they can return false positives. They |
| 79 | +should cannot be used as validation index. |
| 80 | + |
| 81 | +A DAWG can store a set of strings by sharing prefixes and suffixes. It may be a |
| 82 | +valid alternative to an FST (it is very similar to) but there are few maintained |
| 83 | +libraries in the target languages. |
0 commit comments