Skip to content

Commit f5042b2

Browse files
committed
docs(data): host assets on existing-repo releases; KEGG redistribution permitted
Reflect the chosen distribution model: GitHub release assets live outside the git tree, so a separate data repository is optional — attach assets to dedicated tags (e.g. kegg-kegg116, diamond-2.1.9) on an existing RAVEN repo and reuse the same URLs across raven-python and MATLAB RAVEN. Use Zenodo only for DOIs or files >2 GB. KEGG artefacts are redistributed with permission, so the prior 'confirm rights' caveat is removed. Example/schema URLs repointed from a hypothetical raven-data repo to raven-python.
1 parent 651aa37 commit f5042b2

3 files changed

Lines changed: 42 additions & 27 deletions

File tree

data/manifest.example.json

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@
55
"kegg": {
66
"version": "kegg116",
77
"description": "KEGG reference model, KO/reaction tables, and prokaryote/eukaryote HMM libraries for getKEGGModelForOrganism.",
8-
"license": "Derived from KEGG (subscription-licensed bulk dump) — confirm redistribution rights before publishing publicly.",
8+
"license": "Derived from the KEGG database; redistributed with permission from KEGG.",
99
"doi": "10.5281/zenodo.0000000",
10-
"source": "https://github.com/SysBioChalmers/raven-data/releases/tag/kegg-kegg116",
10+
"source": "https://github.com/SysBioChalmers/raven-python/releases/tag/kegg-kegg116",
1111
"files": {
12-
"reference_model.yml.gz": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/kegg-kegg116/reference_model.yml.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
13-
"ko_reaction.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/kegg-kegg116/ko_reaction.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
14-
"ko_names.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/kegg-kegg116/ko_names.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
15-
"organism_gene_ko.tsv.xz": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/kegg-kegg116/organism_gene_ko.tsv.xz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
16-
"rxn_flags.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/kegg-kegg116/rxn_flags.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
12+
"reference_model.yml.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/reference_model.yml.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
13+
"ko_reaction.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_reaction.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
14+
"ko_names.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/ko_names.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
15+
"organism_gene_ko.tsv.xz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/organism_gene_ko.tsv.xz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
16+
"rxn_flags.tsv.gz": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116/rxn_flags.tsv.gz", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
1717
"prokaryotes.hmm": { "url": "https://zenodo.org/records/0000000/files/prokaryotes.hmm", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
1818
}
1919
}
@@ -25,9 +25,9 @@
2525
"description": "DIAMOND protein aligner (homology-based reconstruction).",
2626
"license": "GPL-3.0-only — ship the upstream COPYING alongside each ZIP.",
2727
"platforms": {
28-
"linux-x86_64": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/diamond-2.1.9/diamond-2.1.9-linux-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
29-
"macos-arm64": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/diamond-2.1.9/diamond-2.1.9-macos-arm64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
30-
"windows-x86_64": { "url": "https://github.com/SysBioChalmers/raven-data/releases/download/diamond-2.1.9/diamond-2.1.9-windows-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
28+
"linux-x86_64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-linux-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
29+
"macos-arm64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-macos-arm64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 },
30+
"windows-x86_64": { "url": "https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9/diamond-2.1.9-windows-x86_64.zip", "sha256": "0000000000000000000000000000000000000000000000000000000000000000", "bytes": 0 }
3131
}
3232
}
3333
}

data/manifest.schema.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"$schema": "https://json-schema.org/draft/2020-12/schema",
3-
"$id": "https://github.com/SysBioChalmers/raven-data/manifest.schema.json",
4-
"title": "raven-data manifest",
3+
"$id": "https://github.com/SysBioChalmers/raven-python/manifest.schema.json",
4+
"title": "RAVEN data/binary manifest",
55
"description": "Language-agnostic registry of downloadable raven-python / RAVEN data artefacts and external binary bundles. Consumed by the Python resolvers (raven_python.data / raven_python.binaries) and by MATLAB RAVEN. Every file carries a SHA256 so consumers verify integrity after download.",
66
"type": "object",
77
"required": ["manifest_version"],

docs/maintenance/data_manifest.md

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Point raven-python at a manifest and the resolvers populate themselves on first
3131
verifying each download's checksum:
3232

3333
```bash
34-
export RAVEN_PYTHON_MANIFEST=https://github.com/SysBioChalmers/raven-data/releases/download/manifest-v1/manifest.json
34+
export RAVEN_PYTHON_MANIFEST=https://github.com/SysBioChalmers/raven-python/releases/download/manifest-v1/manifest.json
3535
```
3636

3737
```python
@@ -76,25 +76,39 @@ which computes each SHA256 and byte size:
7676
```bash
7777
python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \
7878
--target data --dataset kegg --version kegg116 --dir artefacts \
79-
--base-url https://github.com/SysBioChalmers/raven-data/releases/download/kegg-kegg116 \
79+
--base-url https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116 \
8080
--doi 10.5281/zenodo.0000000 --source https://zenodo.org/records/0000000
8181

8282
python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \
8383
--target binary --bundle diamond --version 2.1.9 --provides diamond --dir zips \
84-
--base-url https://github.com/SysBioChalmers/raven-data/releases/download/diamond-2.1.9 \
84+
--base-url https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9 \
8585
--license GPL-3.0-only
8686
```
8787

88-
## Where to host: GitHub Releases vs Zenodo
88+
## Where to host
8989

90-
Both are just URLs in the manifest, so consumers don't care — choose per asset:
90+
Release **assets are stored separately from the git tree** (GitHub keeps them in a blob
91+
store), so attaching them to a release does **not** bloat the repository. A dedicated assets
92+
repository is therefore **optional** — attach the assets to releases on an existing RAVEN
93+
repo (this one, or MATLAB [RAVEN](https://github.com/SysBioChalmers/RAVEN)) and have **both
94+
packages reuse the same release-asset URLs** via this manifest.
9195

92-
- **GitHub Releases** — simplest, free, language-agnostic, up to ~2 GB per file. Good default,
93-
and you're already on GitHub for the code.
94-
- **Zenodo** — adds a citable **DOI**, long-term archival, and handles files larger than 2 GB
95-
(up to 50 GB/record). Right for the KEGG HMM bundle and anything you want citable.
96+
Use **dedicated tags** for the assets — e.g. `kegg-kegg116`, `diamond-2.1.9` — rather than
97+
attaching them to code-milestone releases like `v0.1.0a1`. KEGG data updates roughly yearly
98+
while the code changes often; dedicated tags keep the two cadences decoupled while still
99+
living in one repository. The manifest's per-dataset `version` does the rest (it namespaces
100+
the download cache).
96101

97-
### Auto-publishing to Zenodo from GitHub
102+
Both GitHub Releases and Zenodo are just URLs in the manifest, so consumers don't care —
103+
mix them per file:
104+
105+
- **GitHub Releases** — simplest, free, language-agnostic, up to **~2 GB per file**. The
106+
default home for the manifest and most assets.
107+
- **Zenodo** — adds a citable **DOI**, long-term archival, and handles files **larger than
108+
2 GB** (up to 50 GB/record). Use it for individual large HMM libraries or anything you want
109+
citable; point just that file's `url` at the Zenodo record.
110+
111+
### Auto-publishing to Zenodo from GitHub (only if you need DOIs / >2 GB files)
98112

99113
:::{important}
100114
The **native GitHub↔Zenodo integration** (flip a switch, publish a Release → DOI) archives
@@ -103,10 +117,11 @@ Release. So it only works for assets *committed into the repo*, which defeats th
103117
multi-GB binaries. Use it for a *software* DOI, not for the data assets.
104118
:::
105119

106-
For the data assets, keep everything GitHub-driven with a small **GitHub Action** that, on
107-
release, uploads the assets to Zenodo via its REST API (e.g. [`zenodraft`](https://github.com/zenodraft/zenodraft)).
108-
You cut a normal GitHub Release with the files attached; the Action mirrors them to Zenodo and
109-
mints a new version DOI. Drop this in the data repo as `.github/workflows/zenodo.yml`:
120+
If you do want Zenodo DOIs (or need to host files >2 GB), keep it GitHub-driven with a small
121+
**GitHub Action** that, on release, uploads the assets to Zenodo via its REST API (e.g.
122+
[`zenodraft`](https://github.com/zenodraft/zenodraft)). You cut a normal GitHub Release with
123+
the files attached; the Action mirrors them to Zenodo and mints a new version DOI. Drop this
124+
into whichever repo hosts the asset releases as `.github/workflows/zenodo.yml`:
110125

111126
```yaml
112127
name: Mirror release assets to Zenodo
@@ -136,5 +151,5 @@ ever interact with GitHub Releases; Zenodo archiving + DOIs happen automatically
136151
| Asset | Home | Notes |
137152
| --- | --- | --- |
138153
| **Software binaries** (BLAST / DIAMOND / HMMER) | **bioconda** preferred; or release ZIPs via the resolver | DIAMOND is **GPL-3.0** — ship its license text in the ZIP; keep it as a separate asset, never bundled into the MIT wheel. |
139-
| **KEGG HMMs / tables** | **Zenodo** (DOI, >2 GB, archival) | ⚠️ Derived from the subscription-licensed KEGG dump **confirm redistribution rights with KEGG before publishing publicly**. If not permitted, keep access-gated and have users build from their own dump (the resolver supports a local dir). |
154+
| **KEGG HMMs / tables** | GitHub release (dedicated `kegg-*` tag); Zenodo for libraries >2 GB | Derived from the KEGG dump and **redistributed with permission from KEGG**. Note the provenance in the release notes / manifest `license`. |
140155
| **Template models** (Human-GEM, yeast-GEM) | **Don't re-host** | Fetch from their canonical repos by pinned release tag — respects their licenses and avoids stale copies. |

0 commit comments

Comments
 (0)