|
| 1 | +# Data & binary manifest |
| 2 | + |
| 3 | +Large artefacts (KEGG tables / HMMs, template models) and external-binary bundles |
| 4 | +(BLAST / DIAMOND / HMMER) are **not** committed to the code repository. They are published |
| 5 | +as downloadable assets and described by a single, language-agnostic **manifest** that both |
| 6 | +raven-python and MATLAB RAVEN read. Every file carries a **SHA256**, so consumers verify |
| 7 | +integrity after download. |
| 8 | + |
| 9 | +- Format: [`data/manifest.schema.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.schema.json) (JSON Schema) |
| 10 | +- Worked example: [`data/manifest.example.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.example.json) |
| 11 | +- Live manifest: [`data/manifest.json`](https://github.com/SysBioChalmers/raven-python/blob/develop/data/manifest.json) (empty until assets are published) |
| 12 | + |
| 13 | +The manifest is a superset of the two runtime registries: |
| 14 | + |
| 15 | +| Manifest section | Runtime registry | |
| 16 | +| --- | --- | |
| 17 | +| `data` | {data}`raven_python.data._DATA_REGISTRY` | |
| 18 | +| `binaries` | `raven_python.binaries._REGISTRY` | |
| 19 | + |
| 20 | +```json |
| 21 | +{ |
| 22 | + "manifest_version": 1, |
| 23 | + "data": { "<dataset>": { "version": "...", "doi": "...", "files": { "<name>": {"url": "...", "sha256": "...", "bytes": 0} } } }, |
| 24 | + "binaries": { "<bundle>": { "version": "...", "provides": ["..."], "platforms": { "<os>-<arch>": {"url": "...", "sha256": "...", "bytes": 0} } } } |
| 25 | +} |
| 26 | +``` |
| 27 | + |
| 28 | +## Consuming it — Python |
| 29 | + |
| 30 | +Point raven-python at a manifest and the resolvers populate themselves on first use, |
| 31 | +verifying each download's checksum: |
| 32 | + |
| 33 | +```bash |
| 34 | +export RAVEN_PYTHON_MANIFEST=https://github.com/SysBioChalmers/raven-python/releases/download/manifest-v1/manifest.json |
| 35 | +``` |
| 36 | + |
| 37 | +```python |
| 38 | +from raven_python import manifest |
| 39 | +manifest.load_into_registries() # or load_into_registries("/path/or/url") |
| 40 | +# now data.ensure_kegg_data() / binaries.ensure_binary("diamond") resolve from the manifest |
| 41 | +``` |
| 42 | + |
| 43 | +If `RAVEN_PYTHON_MANIFEST` is set, `data.ensure_*` and `binaries.ensure_binary` load it |
| 44 | +lazily — no explicit call needed. |
| 45 | + |
| 46 | +## Consuming it — MATLAB |
| 47 | + |
| 48 | +The same JSON is trivial to read from MATLAB (`webread` + `jsondecode`), download |
| 49 | +(`websave`), and verify (Java's `MessageDigest`, always available in MATLAB): |
| 50 | + |
| 51 | +```matlab |
| 52 | +function file = ensureDataFile(manifestUrl, dataset, name, cacheDir) |
| 53 | + m = jsondecode(webread(manifestUrl, weboptions('ContentType','text'))); |
| 54 | + entry = m.data.(dataset).files.(matlab.lang.makeValidName(name)); |
| 55 | + file = fullfile(cacheDir, name); |
| 56 | + if ~isfile(file) |
| 57 | + websave(file, entry.url); |
| 58 | + end |
| 59 | + assert(strcmp(sha256(file), entry.sha256), 'SHA256 mismatch for %s', name); |
| 60 | +end |
| 61 | +
|
| 62 | +function hex = sha256(file) |
| 63 | + fid = fopen(file, 'r'); raw = fread(fid, Inf, '*uint8'); fclose(fid); |
| 64 | + md = java.security.MessageDigest.getInstance('SHA-256'); |
| 65 | + md.update(raw); |
| 66 | + hex = lower(reshape(dec2hex(typecast(md.digest(), 'uint8'))', 1, [])); |
| 67 | +end |
| 68 | +``` |
| 69 | + |
| 70 | +## Publishing — generating manifest entries |
| 71 | + |
| 72 | +After uploading a release's files, add/update an entry with the maintainer script |
| 73 | +([`scripts/make_registry_snippet.py`](https://github.com/SysBioChalmers/raven-python/blob/develop/scripts/make_registry_snippet.py)), |
| 74 | +which computes each SHA256 and byte size: |
| 75 | + |
| 76 | +```bash |
| 77 | +python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \ |
| 78 | + --target data --dataset kegg --version kegg116 --dir artefacts \ |
| 79 | + --base-url https://github.com/SysBioChalmers/raven-python/releases/download/kegg-kegg116 \ |
| 80 | + --doi 10.5281/zenodo.0000000 --source https://zenodo.org/records/0000000 |
| 81 | + |
| 82 | +python scripts/make_registry_snippet.py manifest --manifest data/manifest.json \ |
| 83 | + --target binary --bundle diamond --version 2.1.9 --provides diamond --dir zips \ |
| 84 | + --base-url https://github.com/SysBioChalmers/raven-python/releases/download/diamond-2.1.9 \ |
| 85 | + --license GPL-3.0-only |
| 86 | +``` |
| 87 | + |
| 88 | +## Where to host |
| 89 | + |
| 90 | +Release **assets are stored separately from the git tree** (GitHub keeps them in a blob |
| 91 | +store), so attaching them to a release does **not** bloat the repository. A dedicated assets |
| 92 | +repository is therefore **optional** — attach the assets to releases on an existing RAVEN |
| 93 | +repo (this one, or MATLAB [RAVEN](https://github.com/SysBioChalmers/RAVEN)) and have **both |
| 94 | +packages reuse the same release-asset URLs** via this manifest. |
| 95 | + |
| 96 | +Use **dedicated tags** for the assets — e.g. `kegg-kegg116`, `diamond-2.1.9` — rather than |
| 97 | +attaching them to code-milestone releases like `v0.1.0a1`. KEGG data updates roughly yearly |
| 98 | +while the code changes often; dedicated tags keep the two cadences decoupled while still |
| 99 | +living in one repository. The manifest's per-dataset `version` does the rest (it namespaces |
| 100 | +the download cache). |
| 101 | + |
| 102 | +Both GitHub Releases and Zenodo are just URLs in the manifest, so consumers don't care — |
| 103 | +mix them per file: |
| 104 | + |
| 105 | +- **GitHub Releases** — simplest, free, language-agnostic, up to **~2 GB per file**. The |
| 106 | + default home for the manifest and most assets. |
| 107 | +- **Zenodo** — adds a citable **DOI**, long-term archival, and handles files **larger than |
| 108 | + 2 GB** (up to 50 GB/record). Use it for individual large HMM libraries or anything you want |
| 109 | + citable; point just that file's `url` at the Zenodo record. |
| 110 | + |
| 111 | +### Auto-publishing to Zenodo from GitHub (only if you need DOIs / >2 GB files) |
| 112 | + |
| 113 | +:::{important} |
| 114 | +The **native GitHub↔Zenodo integration** (flip a switch, publish a Release → DOI) archives |
| 115 | +the **repository source zipball** at the tag — it does **not** capture files attached to the |
| 116 | +Release. So it only works for assets *committed into the repo*, which defeats the purpose for |
| 117 | +multi-GB binaries. Use it for a *software* DOI, not for the data assets. |
| 118 | +::: |
| 119 | + |
| 120 | +If you do want Zenodo DOIs (or need to host files >2 GB), keep it GitHub-driven with a small |
| 121 | +**GitHub Action** that, on release, uploads the assets to Zenodo via its REST API (e.g. |
| 122 | +[`zenodraft`](https://github.com/zenodraft/zenodraft)). You cut a normal GitHub Release with |
| 123 | +the files attached; the Action mirrors them to Zenodo and mints a new version DOI. Drop this |
| 124 | +into whichever repo hosts the asset releases as `.github/workflows/zenodo.yml`: |
| 125 | + |
| 126 | +```yaml |
| 127 | +name: Mirror release assets to Zenodo |
| 128 | +on: |
| 129 | + release: |
| 130 | + types: [published] |
| 131 | +jobs: |
| 132 | + zenodo: |
| 133 | + runs-on: ubuntu-latest |
| 134 | + steps: |
| 135 | + - uses: actions/checkout@v4 |
| 136 | + - uses: actions/setup-node@v4 |
| 137 | + with: { node-version: "20" } |
| 138 | + - name: Download this release's assets |
| 139 | + run: gh release download "${{ github.event.release.tag_name }}" --dir assets |
| 140 | + env: { GH_TOKEN: "${{ github.token }}" } |
| 141 | + - name: Deposit a new version on Zenodo |
| 142 | + run: npx zenodraft@latest version create --publish ${{ vars.ZENODO_CONCEPT_DOI }} assets/* |
| 143 | + env: { ZENODO_ACCESS_TOKEN: "${{ secrets.ZENODO_TOKEN }}" } |
| 144 | +``` |
| 145 | +
|
| 146 | +Then record the resulting DOI in the manifest via the `--doi` flag above. Net result: you only |
| 147 | +ever interact with GitHub Releases; Zenodo archiving + DOIs happen automatically. |
| 148 | + |
| 149 | +## Per-asset recommendations |
| 150 | + |
| 151 | +| Asset | Home | Notes | |
| 152 | +| --- | --- | --- | |
| 153 | +| **Software binaries** (BLAST / DIAMOND / HMMER) | **bioconda** preferred; or release ZIPs via the resolver | DIAMOND is **GPL-3.0** — ship its license text in the ZIP; keep it as a separate asset, never bundled into the MIT wheel. | |
| 154 | +| **KEGG HMMs / tables** | GitHub release (dedicated `kegg-*` tag); Zenodo for libraries >2 GB | Derived from the KEGG dump and **redistributed with permission from KEGG**. Note the provenance in the release notes / manifest `license`. | |
| 155 | +| **Template models** (Human-GEM, yeast-GEM) | **Don't re-host** | Fetch from their canonical repos by pinned release tag — respects their licenses and avoids stale copies. | |
0 commit comments