Skip to content

Commit 4b4216c

Browse files
authored
Foundation: utilities, model manipulation, binary/data resolvers (#2)
* Add the foundation utilities: GPR, balance, parse, sort, validate * Add the model-manipulation layer (add, remove, transport, merge, etc.) * Add binary + data resolvers for external tools and published artefacts
1 parent b7b69ac commit 4b4216c

41 files changed

Lines changed: 4969 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/maintaining_binaries.md

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Maintaining bundled binaries (BLAST+, DIAMOND, …)
2+
3+
Audience: **raven-python maintainers / the GitHub repo owner.** This explains how
4+
raven-python ships external command-line tools, how to update their versions, and how
5+
to build **minimal-footprint** ZIPs to attach to a GitHub release.
6+
7+
> End users never read this. They get a binary automatically via `ensure_binary`,
8+
> or use their own (system/conda) install. This doc is only for whoever publishes
9+
> the release assets.
10+
11+
---
12+
13+
## 1. How binary provisioning works
14+
15+
raven-python does **not** vendor binaries in the git repo or on PyPI. Instead:
16+
17+
1. For each tool we publish **version-pinned ZIPs as GitHub release assets**.
18+
2. A **registry** (`src/raven_python/binaries_registry.json`) maps each *bundle* to its
19+
version, the executables it provides, and per-platform `{asset, sha256}`.
20+
3. At run time `raven_python.binaries.ensure_binary("blastp")` resolves a tool in this
21+
order — and only reaches the download as a last resort:
22+
23+
```
24+
explicit binary= arg → env var (RAVEN_PYTHON_BLASTP / RAVEN_PYTHON_DIAMOND / …)
25+
→ shutil.which on PATH (system / conda / apt / brew)
26+
→ ensure_binary: download the pinned ZIP → verify SHA256 → cache → return path
27+
→ actionable error (with conda / manual instructions)
28+
```
29+
30+
So a pre-installed binary always wins; the bundle is the zero-setup fallback.
31+
Pinning the version makes reconstruction **reproducible**.
32+
33+
A *bundle* can provide several executables from one download (e.g. the `blast`
34+
bundle provides both `blastp` and `makeblastdb`), so they are fetched once.
35+
36+
---
37+
38+
## 2. What raven-python actually needs — ship only these
39+
40+
Distribute the **minimum** set of executables. Everything else (other suite
41+
tools, docs, examples, changelogs) must be excluded.
42+
43+
| Bundle | Executables to include | Everything else |
44+
|---|---|---|
45+
| `diamond` | `diamond` | — (it is a single static binary) |
46+
| `blast` | `blastp`, `makeblastdb` | **drop** `blastn`, `tblastn`, `psiblast`, `rpsblast`, `blast_formatter`, `*_vdb`, the `doc/`, `ChangeLog`, `README`, ~30 other tools |
47+
48+
(Confirmed against RAVEN `getBlast`/`getDiamond`: only `makeblastdb`+`blastp`, and
49+
`diamond` for its `makedb`/`blastp` subcommands, are ever invoked.)
50+
51+
For BLAST+ this is the big win: the full NCBI suite is ~hundreds of MB; two
52+
binaries (stripped) are a small fraction.
53+
54+
---
55+
56+
## 3. Asset & ZIP conventions
57+
58+
**Asset filename:** `<bundle>-<version>-<os>-<arch>.zip`
59+
60+
- `<os>``linux`, `macos`, `windows`
61+
- `<arch>``x86_64`, `arm64`
62+
- examples: `diamond-2.1.11-linux-x86_64.zip`, `blast-2.16.0-macos-arm64.zip`
63+
64+
**ZIP layout — flat, executables at the root, plus the upstream licence:**
65+
66+
```
67+
diamond-2.1.11-linux-x86_64.zip
68+
├── diamond
69+
└── LICENSE
70+
71+
blast-2.16.0-linux-x86_64.zip
72+
├── blastp
73+
├── makeblastdb
74+
└── LICENSE
75+
```
76+
77+
No nested `bin/`, no extra files. `ensure_binary` extracts the ZIP into the cache
78+
and expects the executable at the top level.
79+
80+
---
81+
82+
## 4. Step-by-step: add or update a version
83+
84+
Example: bump DIAMOND to a new version for Linux x86-64. Repeat per `(os, arch)`.
85+
86+
1. **Download the official upstream build** (never rebuild from source unless you
87+
must):
88+
- DIAMOND → <https://github.com/bbuchfink/diamond/releases>
89+
(`diamond-linux64.tar.gz`, `diamond-macos.tar.gz`)
90+
- BLAST+ → <https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/> or a
91+
pinned version dir (`ncbi-blast-<ver>+-x64-linux.tar.gz`,
92+
`-x64-macosx.tar.gz`, `-aarch64-linux.tar.gz`, `-x64-win64.tar.gz`).
93+
- Record the upstream URL **and** its published checksum for provenance.
94+
2. **Extract only the needed executables** (see §2) to a clean staging dir.
95+
3. **Strip debug symbols** to shrink (skip on Windows / signed macOS builds):
96+
```bash
97+
strip diamond # or: strip blastp makeblastdb
98+
```
99+
4. **Smoke-test the stripped binaries in a clean shell** (no other tools on PATH):
100+
```bash
101+
./diamond --version
102+
./blastp -version && ./makeblastdb -version
103+
```
104+
If they fail for a missing shared library, add that `.so`/`.dylib` to the ZIP
105+
(rare — NCBI/DIAMOND release builds are largely self-contained).
106+
5. **Add the upstream licence file** as `LICENSE` (see §6).
107+
6. **Zip with max compression, flat layout:**
108+
```bash
109+
zip -9 -j diamond-2.1.11-linux-x86_64.zip diamond LICENSE
110+
# -j junks paths so entries sit at the ZIP root
111+
```
112+
7. **Compute the SHA256:**
113+
```bash
114+
sha256sum diamond-2.1.11-linux-x86_64.zip # shasum -a 256 on macOS
115+
```
116+
8. **Attach the ZIP to a raven-python GitHub release** (a release tagged for the binary
117+
set, e.g. `binaries-2024.06`, keeps them independent of code releases).
118+
9. **Update the registry** `src/raven_python/binaries_registry.json` — bump `version`
119+
and set the per-platform `asset` + `sha256`:
120+
```json
121+
{
122+
"diamond": {
123+
"version": "2.1.11",
124+
"provides": ["diamond"],
125+
"platforms": {
126+
"linux-x86_64": {
127+
"asset": "diamond-2.1.11-linux-x86_64.zip",
128+
"url": "https://github.com/SysBioChalmers/raven-python/releases/download/binaries-2024.06/diamond-2.1.11-linux-x86_64.zip",
129+
"sha256": "<sha256>"
130+
}
131+
}
132+
},
133+
"blast": {
134+
"version": "2.16.0",
135+
"provides": ["blastp", "makeblastdb"],
136+
"platforms": { "linux-x86_64": { "asset": "...", "url": "...", "sha256": "..." } }
137+
}
138+
}
139+
```
140+
10. **Commit the registry change**, run the homology tests, and (if you have the
141+
binary) confirm `ensure_binary("diamond", version="2.1.11")` downloads,
142+
verifies, and runs.
143+
144+
---
145+
146+
## 5. Keeping the footprint minimal — checklist
147+
148+
- ✅ Only the executables in §2 (for BLAST+, exactly `blastp` + `makeblastdb`).
149+
-`strip` the binaries (often halves their size).
150+
-`zip -9 -j` (max compression, flat — no `bin/`, no folders).
151+
- ✅ Exactly one extra file: `LICENSE`.
152+
- ❌ No docs, examples, `ChangeLog`, `README`, man pages, test data, or sibling tools.
153+
- ❌ No `.dSYM`/debug bundles; no duplicate static `.a` libraries.
154+
- ➕ Only add a shared library if step-4 testing proves it is required.
155+
156+
---
157+
158+
## 6. Platform / architecture matrix & licensing
159+
160+
**Coverage = what you build.** Start with `linux-x86_64` (CI default), then add
161+
`macos-arm64`, `macos-x86_64`, `linux-arm64`, `windows-x86_64` as capacity allows.
162+
For any `(os, arch)` **not** in the registry, `ensure_binary` raises an actionable
163+
error pointing to conda (`conda install -c bioconda diamond blast`) or a manual
164+
install — that is the documented fallback, not a failure to fix urgently.
165+
166+
**Licensing (must comply when redistributing):**
167+
168+
- **BLAST+** — produced by NCBI (US Government); **public domain**, free to
169+
redistribute. Include NCBI's `LICENSE` for courtesy/provenance.
170+
- **DIAMOND****GPLv3**. Redistribution is allowed; you **must** include the
171+
GPLv3 licence text in the ZIP and keep the binary unmodified (or offer source).
172+
- **HMMER** (future) — BSD-3-Clause; include its `LICENSE`.
173+
174+
Always ship the upstream licence in the ZIP, and keep a `BINARIES_PROVENANCE.md`
175+
(or a note in the release body) recording, per asset: upstream URL, upstream
176+
version, upstream checksum, and the SHA256 you published.
177+
178+
### Native OS support per tool
179+
180+
raven-python invokes each tool through `subprocess.run([resolved_path, …])` — that
181+
call is itself cross-platform, so the real constraint is whether a given tool has
182+
a binary that runs natively on each OS. It varies:
183+
184+
| Tool | Linux | macOS (incl. arm64) | Windows (native) |
185+
|---|---|---|---|
186+
| BLAST+ (`blastp`, `makeblastdb`) ||| ✅ (NCBI ships Windows builds) |
187+
| DIAMOND ||| ⚠️ native build exists but Linux-first |
188+
| HMMER (`hmmbuild`/`hmmpress`/`hmmsearch`/`hmmscan`) ||| ❌ no official native build |
189+
| MAFFT ||| ⚠️ Windows package is a wrapper |
190+
| CD-HIT ||| ❌ no Windows build exists |
191+
192+
Implications:
193+
194+
- **Linux / macOS** — everything works. `conda install -c bioconda hmmer mafft
195+
cd-hit blast diamond`, or point the `RAVEN_PYTHON_*` env vars at your installs.
196+
- **Native Windows** — the homology track (BLAST+/DIAMOND) works, but the **KEGG
197+
HMM build (3b.3) and HMM query (3b.5) do not**: HMMER and CD-HIT have no Windows
198+
binaries, and bioconda has no Windows packages for any of them. Bundling can't
199+
fix this — there is no binary to bundle.
200+
- **Windows users should run raven-python inside WSL2** (or a Linux container), where
201+
every tool is native Linux. raven-python does **not** replicate RAVEN's
202+
`getWSLpath`/`wsl …` path translation: it calls the resolved binary directly, so
203+
mixing native-Windows Python with WSL binaries is unsupported — keep the whole
204+
stack inside WSL2.
205+
- The common end-user paths — homology reconstruction and the KEGG *species* model
206+
(3b.4) — need no HMMER/MAFFT/CD-HIT, so they are fully cross-platform.
207+
208+
---
209+
210+
## 7. Emitting the registry entry
211+
212+
After building the per-platform ZIPs (named `<bundle>-<version>-<os>-<arch>.zip`)
213+
and uploading them to the release, generate the `_REGISTRY` entry — checksums and
214+
URLs — with [`scripts/make_registry_snippet.py`](../scripts/README.md):
215+
216+
```bash
217+
python scripts/make_registry_snippet.py binary --bundle blast --version 2.16.0 \
218+
--provides blastp makeblastdb --dir zips \
219+
--base-url https://github.com/ORG/raven-python/releases/download/blast-2.16.0
220+
```
221+
222+
It prints the ready-to-paste `_REGISTRY["blast"]` block; its SHA256 helper is the
223+
same one `ensure_binary` verifies with, so the checksums always match. (Producing
224+
the minimal ZIPs themselves — download upstream, `strip`, `zip -9 -j`, add
225+
`LICENSE` per §3–§6 — is still a manual/per-tool step.)
226+
227+
---
228+
229+
## 8. Adding a new tool later (e.g. HMMER for KEGG reconstruction)
230+
231+
1. Decide the **minimal executable set** (e.g. HMMER → `hmmsearch`, `hmmscan`,
232+
maybe `hmmbuild`/`hmmpress`).
233+
2. Add a bundle entry to the registry with `provides` listing those executables.
234+
3. Build/attach ZIPs per §3–§4; include the tool's licence (§6).
235+
4. The wrappers call `ensure_binary("hmmsearch", …)` with the same resolution
236+
order — no new provisioning code needed.

scripts/make_registry_snippet.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
#!/usr/bin/env python
2+
"""Emit ready-to-paste registry entries for published artefacts / binary ZIPs.
3+
4+
Computes the SHA256 of each file and prints the Python/JSON entry to merge into
5+
``raven_python.data._DATA_REGISTRY`` (data artefacts) or ``raven_python.binaries._REGISTRY``
6+
(binary bundles). Run once per release, after uploading the files to the release.
7+
8+
Examples
9+
--------
10+
Data artefacts (KEGG reference model + tables + HMM libraries) for one release::
11+
12+
python scripts/make_registry_snippet.py data \\
13+
--dataset kegg --version kegg116 --dir artefacts \\
14+
--base-url https://github.com/ORG/raven_python/releases/download/kegg-data-kegg116
15+
16+
Binary bundle (one ZIP per platform, named ``<bundle>-<version>-<os>-<arch>.zip``)::
17+
18+
python scripts/make_registry_snippet.py binary \\
19+
--bundle blast --version 2.16.0 --provides blastp makeblastdb --dir zips \\
20+
--base-url https://github.com/ORG/raven_python/releases/download/blast-2.16.0
21+
22+
The SHA256 helper is shared with the runtime resolvers (``raven_python.binaries``), so
23+
published checksums always match what ``ensure_data`` / ``ensure_binary`` verify.
24+
"""
25+
from __future__ import annotations
26+
27+
import argparse
28+
import json
29+
import sys
30+
from pathlib import Path
31+
32+
from raven_python.binaries import _sha256
33+
34+
35+
def _files_in(directory: Path) -> list[Path]:
36+
"""Regular, non-hidden files in ``directory``, sorted by name."""
37+
return sorted(p for p in directory.iterdir() if p.is_file() and not p.name.startswith("."))
38+
39+
40+
def data_entry(dataset: str, version: str, base_url: str, directory: Path) -> dict:
41+
"""Build the ``_DATA_REGISTRY[dataset]`` entry for every file in ``directory``."""
42+
base = base_url.rstrip("/")
43+
files = {
44+
p.name: {"url": f"{base}/{p.name}", "sha256": _sha256(p)} for p in _files_in(directory)
45+
}
46+
if not files:
47+
raise SystemExit(f"No files found in {directory}")
48+
return {"version": version, "files": files}
49+
50+
51+
def binary_entry(
52+
bundle: str, version: str, provides: list[str], base_url: str, directory: Path
53+
) -> dict:
54+
"""Build the ``_REGISTRY[bundle]`` entry from ``<bundle>-<version>-<os>-<arch>.zip``."""
55+
base = base_url.rstrip("/")
56+
prefix = f"{bundle}-{version}-"
57+
platforms = {}
58+
for zip_path in directory.glob(f"{prefix}*.zip"):
59+
platform = zip_path.name[len(prefix) : -len(".zip")]
60+
platforms[platform] = {"url": f"{base}/{zip_path.name}", "sha256": _sha256(zip_path)}
61+
if not platforms:
62+
raise SystemExit(f"No {prefix}*.zip files found in {directory}")
63+
return {"version": version, "provides": provides, "platforms": dict(sorted(platforms.items()))}
64+
65+
66+
def render(key: str, entry: dict) -> str:
67+
"""Render ``{key: entry}`` as an indented JSON block (valid Python to paste)."""
68+
return json.dumps({key: entry}, indent=4)
69+
70+
71+
def main(argv: list[str] | None = None) -> None:
72+
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
73+
sub = parser.add_subparsers(dest="kind", required=True)
74+
75+
d = sub.add_parser("data", help="data-artefact registry entry (raven_python.data)")
76+
d.add_argument("--dataset", required=True, help="dataset key, e.g. 'kegg'")
77+
d.add_argument("--version", required=True)
78+
d.add_argument("--dir", required=True, type=Path, help="directory of uploaded artefacts")
79+
d.add_argument("--base-url", required=True, help="release download URL prefix")
80+
81+
b = sub.add_parser("binary", help="binary-bundle registry entry (raven_python.binaries)")
82+
b.add_argument("--bundle", required=True, help="bundle key, e.g. 'blast'")
83+
b.add_argument("--version", required=True)
84+
b.add_argument("--provides", nargs="+", required=True, help="executables the bundle provides")
85+
b.add_argument("--dir", required=True, type=Path, help="directory of uploaded ZIPs")
86+
b.add_argument("--base-url", required=True, help="release download URL prefix")
87+
88+
args = parser.parse_args(argv)
89+
if args.kind == "data":
90+
key, entry = args.dataset, data_entry(args.dataset, args.version, args.base_url, args.dir)
91+
target = "raven_python/data.py _DATA_REGISTRY"
92+
else:
93+
key = args.bundle
94+
entry = binary_entry(args.bundle, args.version, args.provides, args.base_url, args.dir)
95+
target = "raven_python/binaries.py _REGISTRY"
96+
97+
print(f"# Merge into {target}:", file=sys.stderr)
98+
print(render(key, entry))
99+
100+
101+
if __name__ == "__main__":
102+
main()

0 commit comments

Comments
 (0)