Skip to content

Commit 9302af7

Browse files
authored
Merge pull request #38 from aboutcode-org/docs
2 parents 160c5f0 + caf5703 commit 9302af7

17 files changed

Lines changed: 840 additions & 457 deletions

.github/workflows/docs-ci.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ name: CI Documentation
22

33
on: [push, pull_request]
44

5+
permissions: {}
56
jobs:
67
build:
78
runs-on: ubuntu-24.04
@@ -13,10 +14,12 @@ jobs:
1314

1415
steps:
1516
- name: Checkout code
16-
uses: actions/checkout@v4
17+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
18+
with:
19+
persist-credentials: false
1720

1821
- name: Set up Python ${{ matrix.python-version }}
19-
uses: actions/setup-python@v5
22+
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405
2023
with:
2124
python-version: ${{ matrix.python-version }}
2225

.github/workflows/pypi-release.yml

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,49 +18,55 @@ on:
1818
tags:
1919
- "v*.*.*"
2020

21+
permissions: {}
2122
jobs:
2223
build-pypi-distribs:
2324
name: Build and publish library to PyPI
2425
runs-on: ubuntu-24.04
2526

2627
steps:
27-
- uses: actions/checkout@v4
28+
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
29+
with:
30+
persist-credentials: false
2831
- name: Set up Python
29-
uses: actions/setup-python@v5
32+
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405
3033
with:
31-
python-version: 3.12
34+
python-version: 3.13
3235

3336
- name: Install pypa/build and twine
3437
run: python -m pip install --user --upgrade build twine pkginfo
3538

3639
- name: Build a binary wheel and a source tarball
3740
run: python -m build --wheel --sdist --outdir dist/
3841

39-
- name: Validate wheel and sdis for Pypi
42+
- name: Validate wheels and sdists for Pypi
4043
run: python -m twine check dist/*
4144

4245
- name: Upload built archives
43-
uses: actions/upload-artifact@v4
46+
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f
4447
with:
4548
name: pypi_archives
4649
path: dist/*
4750

4851

4952
create-gh-release:
53+
# Sets permissions of the GITHUB_TOKEN to allow release upload
54+
permissions:
55+
contents: write
5056
name: Create GH release
5157
needs:
5258
- build-pypi-distribs
5359
runs-on: ubuntu-24.04
5460

5561
steps:
5662
- name: Download built archives
57-
uses: actions/download-artifact@v4
63+
uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131
5864
with:
5965
name: pypi_archives
6066
path: dist
6167

6268
- name: Create GH release
63-
uses: softprops/action-gh-release@v2
69+
uses: softprops/action-gh-release@b4309332981a82ec1c5618f44dd2e27cc8bfbfda
6470
with:
6571
draft: true
6672
files: dist/*
@@ -77,11 +83,11 @@ jobs:
7783

7884
steps:
7985
- name: Download built archives
80-
uses: actions/download-artifact@v4
86+
uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131
8187
with:
8288
name: pypi_archives
8389
path: dist
8490

8591
- name: Publish to PyPI
8692
if: startsWith(github.ref, 'refs/tags')
87-
uses: pypa/gh-action-pypi-publish@release/v1
93+
uses: pypa/gh-action-pypi-publish@cef221092ed1bacb1cc03d23a2d87d1d172e277b

.github/workflows/zizmor.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
name: GitHub Actions Security Analysis with zizmor 🌈
2+
3+
on:
4+
push:
5+
branches: ["main"]
6+
pull_request:
7+
branches: ["**"]
8+
9+
permissions: {}
10+
11+
jobs:
12+
zizmor:
13+
name: Run zizmor 🌈
14+
runs-on: ubuntu-latest
15+
permissions:
16+
security-events: write # Required for upload-sarif (used by zizmor-action) to upload SARIF files.
17+
steps:
18+
- name: Checkout repository
19+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
20+
with:
21+
persist-credentials: false
22+
23+
- name: Run zizmor 🌈
24+
uses: zizmorcore/zizmor-action@b1d7e1fb5de872772f31590499237e7cce841e8e # v0.5.3

README.md

Lines changed: 56 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,28 @@
44
[![Version](https://img.shields.io/github/v/release/aboutcode-org/purl-validator?style=for-the-badge)](https://github.com/aboutcode-org/purl-validator/releases)
55
[![Test](https://img.shields.io/github/actions/workflow/status/aboutcode-org/purl-validator/ci.yml?style=for-the-badge&logo=github)](https://github.com/aboutcode-org/purl-validator/actions)
66

7-
**purl-validator** is a Python library for validating [Package URLs (PURLs)](https://github.com/package-url/purl-spec). It works fully offline, including in **air-gapped** or **restricted environments**, and answers one key question: **Does the package this PURL represents actually exist?**
7+
**purl-validator** is a Python library for validating [Package URLs (PURLs)](https://github.com/package-url/purl-spec).
8+
It works fully offline, including in **air-gapped** or **restricted environments**,
9+
and answers one key question: **Does the package this PURL represents actually exist?**
810

911
## How Does It Work?
1012

1113
**purl-validator** is shipped with a pre-built FST (Finite State Transducer), a set of compact automata containing latest Package URLs mined by the MineCode[^1]. Library uses this FST to perform lookups and confirm whether the **base PURL**[^2] exists.
1214

1315
## Currently Supported Ecosystems
1416

15-
- **apk**
16-
- **cargo**
17-
- **composer**
18-
- **conan**
19-
- **cpan**
20-
- **cran**
21-
- **debian**
22-
- **maven**
23-
- **npm**
24-
- **nuget**
25-
- **pypi**
26-
- **swift**
17+
- apk
18+
- cargo
19+
- composer
20+
- conan
21+
- cpan
22+
- cran
23+
- debian
24+
- maven
25+
- npm
26+
- nuget
27+
- pypi
28+
- swift
2729

2830
## Usage
2931

@@ -47,6 +49,46 @@ PurlValidator.validate_purl("pkg:nuget/FluentValidation")
4749
PurlValidator.validate_purl("pkg:nuget/non-existent-foo-bar")
4850
>>> False
4951
```
52+
The validator accepts a PURL string or a `packageurl.PackageURL` object:
53+
54+
```python
55+
from packageurl import PackageURL
56+
from purl_validator import PurlValidator
57+
58+
validator = PurlValidator()
59+
purl = PackageURL(type="npm", namespace="@angular", name="core")
60+
61+
exists = validator.validate_purl(purl)
62+
print(exists)
63+
```
64+
65+
Only the base PURL is used for queries (e.g., oonly package type/namespace/name.)
66+
Version, qualifiers, and subpath are not part of the query:
67+
68+
```python
69+
from purl_validator import create_purl_map_entry
70+
71+
assert create_purl_map_entry("pkg:pypi/django@5.0.0") == b"pypi/django"
72+
```
73+
74+
You can also build and load a custom index for tests or experiments:
75+
76+
```python
77+
from purl_validator import PurlValidator
78+
from purl_validator import create_purl_map
79+
80+
purl_map_location = create_purl_map([
81+
"pkg:pypi/django",
82+
"pkg:npm/%40angular/core",
83+
])
84+
85+
validator = PurlValidator(purl_map_location)
86+
assert validator.validate_purl("pkg:pypi/django") is True
87+
assert validator.validate_purl("pkg:pypi/not-a-real-package-name") is False
88+
```
89+
90+
Use one `PurlValidator` instance for many lookups. Creating the instance loads
91+
the packaged map, while each validation is an exact membership check.
5092

5193
## Contribution
5294

@@ -91,4 +133,4 @@ limitations under the License.
91133
```
92134

93135
[^1]: MineCode continuously collects package metadata from various package ecosystems to maintain an up-to-date catalog of known packages.
94-
[^2]: A Base Package URL is a Package URL without a version, qualifiers or subpath.
136+
[^2]: A Base Package URL is a Package URL without a version, qualifiers, or subpath.

docs/source/conf.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
# -- Project information -----------------------------------------------------
1919

20-
project = "nexb-skeleton"
20+
project = "purl-validator"
2121
copyright = "nexB Inc., AboutCode and others."
2222
author = "AboutCode.org authors and contributors"
2323

@@ -79,9 +79,9 @@
7979

8080
html_context = {
8181
"display_github": True,
82-
"github_user": "nexB",
83-
"github_repo": "nexb-skeleton",
84-
"github_version": "develop", # branch
82+
"github_user": "aboutcode-org",
83+
"github_repo": "purl-validator",
84+
"github_version": "main", # branch
8585
"conf_py_path": "/docs/source/", # path in the checkout to the docs root
8686
}
8787

docs/source/contribute/contrib_doc.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ Style Conventions for the Documentaion
187187

188188
(`Refer <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections>`_)
189189
Normally, there are no heading levels assigned to certain characters as the structure is
190-
determined from the succession of headings. However, this convention is used in Pythons Style
190+
determined from the succession of headings. However, this convention is used in Python's Style
191191
Guide for documenting which you may follow:
192192

193193
# with overline, for parts
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
.. _data_structure_rationale:
2+
3+
FST Data Structure Rationale
4+
=============================
5+
6+
PurlValidator needs exact membership lookup for a large list of base PURLs. The
7+
lookup data index is built before release and bundled with each library.
8+
9+
10+
See https://github.com/aboutcode-org/purl-validator/tree/main/etc/bench for
11+
actual detailed rationale and bench for the choice of an FST.
12+
13+
14+
Why FSTs are used?
15+
------------------
16+
17+
Finite state transducers store sorted strings in a compact form. PURLs share
18+
prefixes such as ``pkg:npm/``, ``pkg:pypi/``, and ``pkg:maven/``. This makes an
19+
FST useful for exact package identity queries.
20+
21+
FST can be memory-mapped and are super compact. They are not as fast as native
22+
set, but the memory consumption is so much lower than this make them the most
23+
attractive solution, even if it takes more time to build.
24+
25+
26+
Requirements
27+
---------------
28+
29+
The index structure should provide:
30+
31+
And for the library selection, we have these high level requirements:
32+
33+
- We want exact result without false positives, e.g., no bloom filter.
34+
- Offline use, with no network is a must: the dataset must be bundled
35+
in the releases.
36+
- With build time index construction, the construction time is not
37+
critical.
38+
- The bundled index should be small enough to ship below crates, and
39+
Pypi archive size limits.
40+
- No rebuild at startup/runtime, and fast enough load time from disk,
41+
ideally memory-mapped.
42+
- Fast enough lookup.
43+
- Libraries should be maintained, active FOSS for Rust/Go/Python.
44+
45+
46+
47+
48+
Selected FST libraries
49+
--------------------------
50+
51+
Python uses ``ducer.Map`` with ``mmap``. The map is stored on disk and opened
52+
without loading the full catalog into Python objects.
53+
54+
Rust uses ``fst::Set``. The generated FST is embedded into the crate.
55+
56+
Go uses Vellum FST. The generated FST is embedded into the module.
57+
58+
Alternatives
59+
------------
60+
61+
We considered also built-in sets and maps as a baseline:
62+
63+
- Python: ``set`` and ``dict``.
64+
- Rust: ``HashSet`` and ``HashMap``.
65+
- Go: ``map[string]struct{}`` and ``map[string]bool``.
66+
67+
These structures are simple and fast. They require loading all keys into
68+
runtime memory, so they are less useful as the packaged lookup format.
69+
70+
Sorted arrays or slices can use binary search. They are simple and exact, but
71+
lookup takes repeated string comparisons and the strings still need to be
72+
loaded.
73+
74+
SQLite can store the PURLs in an indexed table. It gives exact results, but it
75+
adds a database dependency for a read-only membership check. It has way more
76+
features than needed and is overkill for our use case.
77+
78+
Bloom filters are small and fast, but they can return false positives. They
79+
should cannot be used as validation index.
80+
81+
A DAWG can store a set of strings by sharing prefixes and suffixes. It may be a
82+
valid alternative to an FST (it is very similar to) but there are few maintained
83+
libraries in the target languages.

docs/source/explanations.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
.. _explanations:
2+
3+
Explanations
4+
============
5+
6+
Syntax validation and identity validation
7+
-----------------------------------------
8+
9+
The Package-URL spec defines the PURL format. A PURL can follow the spec
10+
format and still name a package that is not known in the package ecosystems.
11+
12+
PurlValidator checks the package PURL against reference data of known PURLs. This
13+
helps find misspelled names, wrong package types, and PURL that
14+
do not appear in the reference upstream ecosystem package repositories.
15+
16+
17+
Offline validation
18+
------------------
19+
20+
SBOM and compliance workflows may run in CI systems, private networks, or
21+
air-gapped environments. PurlValidator packages lookup data with each released
22+
library so validation does not need a network registry access at runtime.
23+
24+
25+
Base PURL validation
26+
--------------------
27+
28+
PURL existence is checked before version existence.
29+
30+
The current libraries validate base PURLs only, no versions. Version support
31+
can be a future enhancement.

0 commit comments

Comments
 (0)