Skip to content

Commit a4bc86d

Browse files
authored
feat(data): shared download manifest for artefacts + binaries (#16)
* feat(data): shared download manifest for artefacts and binaries Introduce a single, language-agnostic manifest (data/manifest.schema.json) that lists every downloadable data artefact and external-binary bundle with a SHA256, consumed by both raven-python and (via the same JSON) MATLAB RAVEN. The manifest is a superset of the two runtime registries: * manifest["data"] -> raven_python.data._DATA_REGISTRY * manifest["binaries"] -> raven_python.binaries._REGISTRY Added: * data/manifest.schema.json (JSON Schema) + data/manifest.example.json (worked example) + data/manifest.json (empty, the live source of truth until assets are published). * raven_python.manifest — load_manifest / to_*_registry / load_into_registries. * Lazy autoload: data.ensure_* and binaries.ensure_binary populate themselves from $RAVEN_PYTHON_MANIFEST on first use when their registry is still empty (guarded; no effect when a registry is passed explicitly or the env var is unset). * scripts/make_registry_snippet.py: a `manifest` subcommand that computes url+sha256+bytes and writes/updates manifest.json. * tests/test_manifest.py (round-trip, converters, lazy autoload via file:// URLs, repo manifests valid). * docs/maintenance/data_manifest.md — format, Python + MATLAB consumers, GitHub-Releases vs Zenodo hosting (incl. a release→Zenodo GitHub Action), and per-asset recommendations. * docs(data): host assets on existing-repo releases; KEGG redistribution permitted Reflect the chosen distribution model: GitHub release assets live outside the git tree, so a separate data repository is optional — attach assets to dedicated tags (e.g. kegg-kegg116, diamond-2.1.9) on an existing RAVEN repo and reuse the same URLs across raven-python and MATLAB RAVEN. Use Zenodo only for DOIs or files >2 GB. KEGG artefacts are redistributed with permission, so the prior 'confirm rights' caveat is removed. Example/schema URLs repointed from a hypothetical raven-data repo to raven-python.
1 parent 587ab4d commit a4bc86d

11 files changed

Lines changed: 667 additions & 4 deletions

File tree

data/manifest.example.json

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"manifest_version": 1,
3+
"generated": "2026-05-30",
4+
"data": {
5+
"kegg": {
6+
"version": "kegg116",
7+
"description": "KEGG reference model, KO/reaction tables, and prokaryote/eukaryote HMM libraries for getKEGGModelForOrganism.",
8+
"license": "Derived from the KEGG database; redistributed with permission from KEGG.",
9+
"doi": "10.5281/zenodo.0000000",
10+
"source": "https://github.com/SysBioChalmers/raven-python/releases/tag/kegg-kegg116",
11+
"files": {
12+
"reference_model.yml.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/reference_model.yml.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
13+
"ko_reaction.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_reaction.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
14+
"ko_names.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_names.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
15+
"organism_gene_ko.tsv.xz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/organism_gene_ko.tsv.xz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
16+
"rxn_flags.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/rxn_flags.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
17+
"prokaryotes.hmm": { "url": "https://zenodo.org/records/0000000/files/prokaryotes.hmm", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
18+
}
19+
}
20+
},
21+
"binaries": {
22+
"diamond": {
23+
"version": "2.1.9",
24+
"provides": ["diamond"],
25+
"description": "DIAMOND protein aligner (homology-based reconstruction).",
26+
"license": "GPL-3.0-only — ship the upstream COPYING alongside each ZIP.",
27+
"platforms": {
28+
"linux-x86_64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-linux-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
29+
"macos-arm64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-macos-arm64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
30+
"windows-x86_64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-windows-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
31+
}
32+
}
33+
}
34+
}

data/manifest.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"manifest_version": 1,
3+
"generated": "2026-05-30",
4+
"data": {},
5+
"binaries": {}
6+
}

data/manifest.schema.json

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
{
2+
"$schema": "https://json-schema.org/draft/2020-12/schema",
3+
"$id": "https://github.com/SysBioChalmers/raven-python/manifest.schema.json",
4+
"title": "RAVEN data/binary manifest",
5+
"description": "Language-agnostic registry of downloadable raven-python / RAVEN data artefacts and external binary bundles. Consumed by the Python resolvers (raven_python.data / raven_python.binaries) and by MATLAB RAVEN. Every file carries a SHA256 so consumers verify integrity after download.",
6+
"type": "object",
7+
"required": ["manifest_version"],
8+
"additionalProperties": false,
9+
"properties": {
10+
"manifest_version": {
11+
"type": "integer",
12+
"const": 1,
13+
"description": "Format version of this manifest document."
14+
},
15+
"generated": {
16+
"type": "string",
17+
"description": "ISO-8601 date the manifest was generated (informational)."
18+
},
19+
"data": {
20+
"type": "object",
21+
"description": "Data-artefact datasets, keyed by dataset id (e.g. 'kegg'). Maps onto raven_python.data._DATA_REGISTRY.",
22+
"additionalProperties": { "$ref": "#/$defs/dataset" }
23+
},
24+
"binaries": {
25+
"type": "object",
26+
"description": "External command-line tool bundles, keyed by bundle id (e.g. 'blast', 'diamond', 'hmmer'). Maps onto raven_python.binaries._REGISTRY.",
27+
"additionalProperties": { "$ref": "#/$defs/bundle" }
28+
}
29+
},
30+
"$defs": {
31+
"file": {
32+
"type": "object",
33+
"required": ["url", "sha256"],
34+
"additionalProperties": false,
35+
"properties": {
36+
"url": { "type": "string", "format": "uri", "description": "Direct download URL (GitHub release asset, Zenodo file, etc.)." },
37+
"sha256": { "type": "string", "pattern": "^[0-9a-f]{64}$", "description": "Lowercase hex SHA256 of the file." },
38+
"bytes": { "type": "integer", "minimum": 0, "description": "File size in bytes (informational; for progress bars / sanity checks)." }
39+
}
40+
},
41+
"dataset": {
42+
"type": "object",
43+
"required": ["version", "files"],
44+
"additionalProperties": false,
45+
"properties": {
46+
"version": { "type": "string", "description": "Dataset version tag, e.g. 'kegg116'. Used in the cache path." },
47+
"description": { "type": "string" },
48+
"license": { "type": "string", "description": "SPDX id or free text. NOTE: KEGG-derived artefacts are subject to KEGG's terms — confirm redistribution rights before publishing." },
49+
"doi": { "type": "string", "description": "Zenodo (or other) DOI for this dataset version, if archived." },
50+
"source": { "type": "string", "format": "uri", "description": "Human-facing page for the release/record (GitHub release or Zenodo landing page)." },
51+
"files": {
52+
"type": "object",
53+
"minProperties": 1,
54+
"description": "Artefact files keyed by filename.",
55+
"additionalProperties": { "$ref": "#/$defs/file" }
56+
}
57+
}
58+
},
59+
"bundle": {
60+
"type": "object",
61+
"required": ["version", "provides", "platforms"],
62+
"additionalProperties": false,
63+
"properties": {
64+
"version": { "type": "string", "description": "Upstream tool version, e.g. '2.16.0'." },
65+
"provides": {
66+
"type": "array",
67+
"items": { "type": "string" },
68+
"minItems": 1,
69+
"description": "Executable names this bundle provides, e.g. ['blastp', 'makeblastdb']."
70+
},
71+
"description": { "type": "string" },
72+
"license": { "type": "string", "description": "Upstream tool license (e.g. DIAMOND is GPL-3.0-only — ship its license text alongside the ZIP)." },
73+
"platforms": {
74+
"type": "object",
75+
"minProperties": 1,
76+
"description": "One entry per platform, keyed '<os>-<arch>' (e.g. 'linux-x86_64', 'macos-arm64', 'windows-x86_64'). Matches raven_python.binaries._platform_key().",
77+
"additionalProperties": { "$ref": "#/$defs/file" }
78+
}
79+
}
80+
}
81+
}
82+
}

docs/maintenance/data_manifest.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Data & binary manifest
2+
3+
Large artefacts (KEGG tables / HMMs, template models) and external-binary bundles
4+
(BLAST / DIAMOND / HMMER) are **not** committed to the code repository. They are published
5+
as downloadable assets and described by a single, language-agnostic **manifest** that both
6+
raven-python and MATLAB RAVEN read. Every file carries a **SHA256**, so consumers verify
7+
integrity after download.
8+
9+
- Format: [`data/manifest.schema.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.schema.json) (JSON Schema)
10+
- Worked example: [`data/manifest.example.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.example.json)
11+
- Live manifest: [`data/manifest.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.json) (empty until assets are published)
12+
13+
The manifest is a superset of the two runtime registries:
14+
15+
| Manifest section | Runtime registry |
16+
| --- | --- |
17+
| `data` | {data}`raven_python.data._DATA_REGISTRY` |
18+
| `binaries` | `raven_python.binaries._REGISTRY` |
19+
20+
```json
21+
{
22+
"manifest_version": 1,
23+
"data": { "<dataset>": { "version": "...", "doi": "...", "files": { "<name>": {"url": "...", "sha256": "...", "bytes": 0} } } },
24+
"binaries": { "<bundle>": { "version": "...", "provides": ["..."], "platforms": { "<os>-<arch>": {"url": "...", "sha256": "...", "bytes": 0} } } }
25+
}
26+
```
27+
28+
## Consuming it — Python
29+
30+
Point raven-python at a manifest and the resolvers populate themselves on first use,
31+
verifying each download's checksum:
32+
33+
```bash
34+
export RAVEN_PYTHON_MANIFEST=https://github.com/SysBioChalmers/raven-python/releases/download/manifest-v1/manifest.json
35+
```
36+
37+
```python
38+
from raven_python import manifest
39+
manifest.load_into_registries() # or load_into_registries("/path/or/url")
40+
# now data.ensure_kegg_data() / binaries.ensure_binary("diamond") resolve from the manifest
41+
```
42+
43+
If `RAVEN_PYTHON_MANIFEST` is set, `data.ensure_*` and `binaries.ensure_binary` load it
44+
lazily — no explicit call needed.
45+
46+
## Consuming it — MATLAB
47+
48+
The same JSON is trivial to read from MATLAB (`webread` + `jsondecode`), download
49+
(`websave`), and verify (Java's `MessageDigest`, always available in MATLAB):
50+
51+
```matlab
52+
function file = ensureDataFile(manifestUrl, dataset, name, cacheDir)
53+
m = jsondecode(webread(manifestUrl, weboptions('ContentType','text')));
54+
entry = m.data.(dataset).files.(matlab.lang.makeValidName(name));
55+
file = fullfile(cacheDir, name);
56+
if ~isfile(file)
57+
websave(file, entry.url);
58+
end
59+
assert(strcmp(sha256(file), entry.sha256), 'SHA256 mismatch for %s', name);
60+
end
61+
62+
function hex = sha256(file)
63+
fid = fopen(file, 'r'); raw = fread(fid, Inf, '*uint8'); fclose(fid);
64+
md = java.security.MessageDigest.getInstance('SHA-256');
65+
md.update(raw);
66+
hex = lower(reshape(dec2hex(typecast(md.digest(), 'uint8'))', 1, []));
67+
end
68+
```
69+
70+
## Publishing — generating manifest entries
71+
72+
After uploading a release's files, add/update an entry with the maintainer script
73+
([`scripts/make_registry_snippet.py`](https://github.com/SysBioChalmers/raven-python/blob/develop/scripts/make_registry_snippet.py)),
74+
which computes each SHA256 and byte size:
75+
76+
```bash
77+
python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \
78+
--target data --dataset kegg --version kegg116 --dir artefacts \
79+
--base-url https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116 \
80+
--doi 10.5281/zenodo.0000000 --source https://zenodo.org/records/0000000
81+
82+
python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \
83+
--target binary --bundle diamond --version 2.1.9 --provides diamond --dir zips \
84+
--base-url https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9 \
85+
--license GPL-3.0-only
86+
```
87+
88+
## Where to host
89+
90+
Release **assets are stored separately from the git tree** (GitHub keeps them in a blob
91+
store), so attaching them to a release does **not** bloat the repository. A dedicated assets
92+
repository is therefore **optional** — attach the assets to releases on an existing RAVEN
93+
repo (this one, or MATLAB [RAVEN](https://github.com/SysBioChalmers/RAVEN)) and have **both
94+
packages reuse the same release-asset URLs** via this manifest.
95+
96+
Use **dedicated tags** for the assets — e.g. `kegg-kegg116`, `diamond-2.1.9` — rather than
97+
attaching them to code-milestone releases like `v0.1.0a1`. KEGG data updates roughly yearly
98+
while the code changes often; dedicated tags keep the two cadences decoupled while still
99+
living in one repository. The manifest's per-dataset `version` does the rest (it namespaces
100+
the download cache).
101+
102+
Both GitHub Releases and Zenodo are just URLs in the manifest, so consumers don't care —
103+
mix them per file:
104+
105+
- **GitHub Releases** — simplest, free, language-agnostic, up to **~2 GB per file**. The
106+
default home for the manifest and most assets.
107+
- **Zenodo** — adds a citable **DOI**, long-term archival, and handles files **larger than
108+
2 GB** (up to 50 GB/record). Use it for individual large HMM libraries or anything you want
109+
citable; point just that file's `url` at the Zenodo record.
110+
111+
### Auto-publishing to Zenodo from GitHub (only if you need DOIs / >2 GB files)
112+
113+
:::{important}
114+
The **native GitHub↔Zenodo integration** (flip a switch, publish a Release → DOI) archives
115+
the **repository source zipball** at the tag — it does **not** capture files attached to the
116+
Release. So it only works for assets *committed into the repo*, which defeats the purpose for
117+
multi-GB binaries. Use it for a *software* DOI, not for the data assets.
118+
:::
119+
120+
If you do want Zenodo DOIs (or need to host files >2 GB), keep it GitHub-driven with a small
121+
**GitHub Action** that, on release, uploads the assets to Zenodo via its REST API (e.g.
122+
[`zenodraft`](https://github.com/zenodraft/zenodraft)). You cut a normal GitHub Release with
123+
the files attached; the Action mirrors them to Zenodo and mints a new version DOI. Drop this
124+
into whichever repo hosts the asset releases as `.github/workflows/zenodo.yml`:
125+
126+
```yaml
127+
name: Mirror release assets to Zenodo
128+
on:
129+
release:
130+
types: [published]
131+
jobs:
132+
zenodo:
133+
runs-on: ubuntu-latest
134+
steps:
135+
- uses: actions/checkout@v4
136+
- uses: actions/setup-node@v4
137+
with: { node-version: "20" }
138+
- name: Download this release's assets
139+
run: gh release download "${{ github.event.release.tag_name }}" --dir assets
140+
env: { GH_TOKEN: "${{ github.token }}" }
141+
- name: Deposit a new version on Zenodo
142+
run: npx zenodraft@latest version create --publish ${{ vars.ZENODO_CONCEPT_DOI }} assets/*
143+
env: { ZENODO_ACCESS_TOKEN: "${{ secrets.ZENODO_TOKEN }}" }
144+
```
145+
146+
Then record the resulting DOI in the manifest via the `--doi` flag above. Net result: you only
147+
ever interact with GitHub Releases; Zenodo archiving + DOIs happen automatically.
148+
149+
## Per-asset recommendations
150+
151+
| Asset | Home | Notes |
152+
| --- | --- | --- |
153+
| **Software binaries** (BLAST / DIAMOND / HMMER) | **bioconda** preferred; or release ZIPs via the resolver | DIAMOND is **GPL-3.0** — ship its license text in the ZIP; keep it as a separate asset, never bundled into the MIT wheel. |
154+
| **KEGG HMMs / tables** | GitHub release (dedicated `kegg-*` tag); Zenodo for libraries >2 GB | Derived from the KEGG dump and **redistributed with permission from KEGG**. Note the provenance in the release notes / manifest `license`. |
155+
| **Template models** (Human-GEM, yeast-GEM) | **Don't re-host** | Fetch from their canonical repos by pinned release tag — respects their licenses and avoids stale copies. |

docs/maintenance/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,15 @@ rebuild and release them.
88
artefact releases.
99
- **[Maintaining binaries](maintaining_binaries.md)** — building and publishing the
1010
external-binary (BLAST / DIAMOND / HMMER) ZIP releases.
11+
- **[Data & binary manifest](data_manifest.md)** — the shared manifest that lists every
12+
published artefact / binary (consumed by raven-python and MATLAB RAVEN), where to host
13+
assets (GitHub Releases vs Zenodo), and the GitHub→Zenodo auto-publish setup.
1114

1215
```{toctree}
1316
:hidden:
1417
1518
kegg_data_format
1619
maintaining_kegg_data
1720
maintaining_binaries
21+
data_manifest
1822
```

docs/reference/api/resolvers.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,13 @@ Data-bundle resolver (KEGG artefacts and template-model data).
2020
.. automodule:: raven_python.data
2121
:members:
2222
```
23+
24+
## `raven_python.manifest`
25+
26+
Loads a shared [data/binary manifest](../../maintenance/data_manifest.md) into the two
27+
registries above (and is consulted lazily via `$RAVEN_PYTHON_MANIFEST`).
28+
29+
```{eval-rst}
30+
.. automodule:: raven_python.manifest
31+
:members:
32+
```

0 commit comments

Comments
 (0)