|
| 1 | +# Maintaining bundled binaries (BLAST+, DIAMOND, …) |
| 2 | + |
| 3 | +Audience: **raven-python maintainers / the GitHub repo owner.** This explains how |
| 4 | +raven-python ships external command-line tools, how to update their versions, and how |
| 5 | +to build **minimal-footprint** ZIPs to attach to a GitHub release. |
| 6 | + |
| 7 | +> End users never read this. They get a binary automatically via `ensure_binary`, |
| 8 | +> or use their own (system/conda) install. This doc is only for whoever publishes |
| 9 | +> the release assets. |
| 10 | +
|
| 11 | +--- |
| 12 | + |
| 13 | +## 1. How binary provisioning works |
| 14 | + |
| 15 | +raven-python does **not** vendor binaries in the git repo or on PyPI. Instead: |
| 16 | + |
| 17 | +1. For each tool we publish **version-pinned ZIPs as GitHub release assets**. |
| 18 | +2. A **registry** (`src/raven_python/binaries_registry.json`) maps each *bundle* to its |
| 19 | + version, the executables it provides, and per-platform `{asset, sha256}`. |
| 20 | +3. At run time `raven_python.binaries.ensure_binary("blastp")` resolves a tool in this |
| 21 | + order — and only reaches the download as a last resort: |
| 22 | + |
| 23 | + ``` |
| 24 | + explicit binary= arg → env var (RAVEN_PYTHON_BLASTP / RAVEN_PYTHON_DIAMOND / …) |
| 25 | + → shutil.which on PATH (system / conda / apt / brew) |
| 26 | + → ensure_binary: download the pinned ZIP → verify SHA256 → cache → return path |
| 27 | + → actionable error (with conda / manual instructions) |
| 28 | + ``` |
| 29 | + |
| 30 | +So a pre-installed binary always wins; the bundle is the zero-setup fallback. |
| 31 | +Pinning the version makes reconstruction **reproducible**. |
| 32 | + |
| 33 | +A *bundle* can provide several executables from one download (e.g. the `blast` |
| 34 | +bundle provides both `blastp` and `makeblastdb`), so they are fetched once. |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## 2. What raven-python actually needs — ship only these |
| 39 | + |
| 40 | +Distribute the **minimum** set of executables. Everything else (other suite |
| 41 | +tools, docs, examples, changelogs) must be excluded. |
| 42 | + |
| 43 | +| Bundle | Executables to include | Everything else | |
| 44 | +|---|---|---| |
| 45 | +| `diamond` | `diamond` | — (it is a single static binary) | |
| 46 | +| `blast` | `blastp`, `makeblastdb` | **drop** `blastn`, `tblastn`, `psiblast`, `rpsblast`, `blast_formatter`, `*_vdb`, the `doc/`, `ChangeLog`, `README`, ~30 other tools | |
| 47 | + |
| 48 | +(Confirmed against RAVEN `getBlast`/`getDiamond`: only `makeblastdb`+`blastp`, and |
| 49 | +`diamond` for its `makedb`/`blastp` subcommands, are ever invoked.) |
| 50 | + |
| 51 | +For BLAST+ this is the big win: the full NCBI suite is ~hundreds of MB; two |
| 52 | +binaries (stripped) are a small fraction. |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## 3. Asset & ZIP conventions |
| 57 | + |
| 58 | +**Asset filename:** `<bundle>-<version>-<os>-<arch>.zip` |
| 59 | + |
| 60 | +- `<os>` ∈ `linux`, `macos`, `windows` |
| 61 | +- `<arch>` ∈ `x86_64`, `arm64` |
| 62 | +- examples: `diamond-2.1.11-linux-x86_64.zip`, `blast-2.16.0-macos-arm64.zip` |
| 63 | + |
| 64 | +**ZIP layout — flat, executables at the root, plus the upstream licence:** |
| 65 | + |
| 66 | +``` |
| 67 | +diamond-2.1.11-linux-x86_64.zip |
| 68 | +├── diamond |
| 69 | +└── LICENSE |
| 70 | +
|
| 71 | +blast-2.16.0-linux-x86_64.zip |
| 72 | +├── blastp |
| 73 | +├── makeblastdb |
| 74 | +└── LICENSE |
| 75 | +``` |
| 76 | + |
| 77 | +No nested `bin/`, no extra files. `ensure_binary` extracts the ZIP into the cache |
| 78 | +and expects the executable at the top level. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## 4. Step-by-step: add or update a version |
| 83 | + |
| 84 | +Example: bump DIAMOND to a new version for Linux x86-64. Repeat per `(os, arch)`. |
| 85 | + |
| 86 | +1. **Download the official upstream build** (never rebuild from source unless you |
| 87 | + must): |
| 88 | + - DIAMOND → <https://github.com/bbuchfink/diamond/releases> |
| 89 | + (`diamond-linux64.tar.gz`, `diamond-macos.tar.gz`) |
| 90 | + - BLAST+ → <https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/> or a |
| 91 | + pinned version dir (`ncbi-blast-<ver>+-x64-linux.tar.gz`, |
| 92 | + `-x64-macosx.tar.gz`, `-aarch64-linux.tar.gz`, `-x64-win64.tar.gz`). |
| 93 | + - Record the upstream URL **and** its published checksum for provenance. |
| 94 | +2. **Extract only the needed executables** (see §2) to a clean staging dir. |
| 95 | +3. **Strip debug symbols** to shrink (skip on Windows / signed macOS builds): |
| 96 | + ```bash |
| 97 | + strip diamond # or: strip blastp makeblastdb |
| 98 | + ``` |
| 99 | +4. **Smoke-test the stripped binaries in a clean shell** (no other tools on PATH): |
| 100 | + ```bash |
| 101 | + ./diamond --version |
| 102 | + ./blastp -version && ./makeblastdb -version |
| 103 | + ``` |
| 104 | + If they fail for a missing shared library, add that `.so`/`.dylib` to the ZIP |
| 105 | + (rare — NCBI/DIAMOND release builds are largely self-contained). |
| 106 | +5. **Add the upstream licence file** as `LICENSE` (see §6). |
| 107 | +6. **Zip with max compression, flat layout:** |
| 108 | + ```bash |
| 109 | + zip -9 -j diamond-2.1.11-linux-x86_64.zip diamond LICENSE |
| 110 | + # -j junks paths so entries sit at the ZIP root |
| 111 | + ``` |
| 112 | +7. **Compute the SHA256:** |
| 113 | + ```bash |
| 114 | + sha256sum diamond-2.1.11-linux-x86_64.zip # shasum -a 256 on macOS |
| 115 | + ``` |
| 116 | +8. **Attach the ZIP to a raven-python GitHub release** (a release tagged for the binary |
| 117 | + set, e.g. `binaries-2024.06`, keeps them independent of code releases). |
| 118 | +9. **Update the registry** `src/raven_python/binaries_registry.json` — bump `version` |
| 119 | + and set the per-platform `asset` + `sha256`: |
| 120 | + ```json |
| 121 | + { |
| 122 | + "diamond": { |
| 123 | + "version": "2.1.11", |
| 124 | + "provides": ["diamond"], |
| 125 | + "platforms": { |
| 126 | + "linux-x86_64": { |
| 127 | + "asset": "diamond-2.1.11-linux-x86_64.zip", |
| 128 | + "url": "https://github.com/SysBioChalmers/raven-python/releases/download/binaries-2024.06/diamond-2.1.11-linux-x86_64.zip", |
| 129 | + "sha256": "<sha256>" |
| 130 | + } |
| 131 | + } |
| 132 | + }, |
| 133 | + "blast": { |
| 134 | + "version": "2.16.0", |
| 135 | + "provides": ["blastp", "makeblastdb"], |
| 136 | + "platforms": { "linux-x86_64": { "asset": "...", "url": "...", "sha256": "..." } } |
| 137 | + } |
| 138 | + } |
| 139 | + ``` |
| 140 | +10. **Commit the registry change**, run the homology tests, and (if you have the |
| 141 | + binary) confirm `ensure_binary("diamond", version="2.1.11")` downloads, |
| 142 | + verifies, and runs. |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +## 5. Keeping the footprint minimal — checklist |
| 147 | + |
| 148 | +- ✅ Only the executables in §2 (for BLAST+, exactly `blastp` + `makeblastdb`). |
| 149 | +- ✅ `strip` the binaries (often halves their size). |
| 150 | +- ✅ `zip -9 -j` (max compression, flat — no `bin/`, no folders). |
| 151 | +- ✅ Exactly one extra file: `LICENSE`. |
| 152 | +- ❌ No docs, examples, `ChangeLog`, `README`, man pages, test data, or sibling tools. |
| 153 | +- ❌ No `.dSYM`/debug bundles; no duplicate static `.a` libraries. |
| 154 | +- ➕ Only add a shared library if step-4 testing proves it is required. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## 6. Platform / architecture matrix & licensing |
| 159 | + |
| 160 | +**Coverage = what you build.** Start with `linux-x86_64` (CI default), then add |
| 161 | +`macos-arm64`, `macos-x86_64`, `linux-arm64`, `windows-x86_64` as capacity allows. |
| 162 | +For any `(os, arch)` **not** in the registry, `ensure_binary` raises an actionable |
| 163 | +error pointing to conda (`conda install -c bioconda diamond blast`) or a manual |
| 164 | +install — that is the documented fallback, not a failure to fix urgently. |
| 165 | + |
| 166 | +**Licensing (must comply when redistributing):** |
| 167 | + |
| 168 | +- **BLAST+** — produced by NCBI (US Government); **public domain**, free to |
| 169 | + redistribute. Include NCBI's `LICENSE` for courtesy/provenance. |
| 170 | +- **DIAMOND** — **GPLv3**. Redistribution is allowed; you **must** include the |
| 171 | + GPLv3 licence text in the ZIP and keep the binary unmodified (or offer source). |
| 172 | +- **HMMER** (future) — BSD-3-Clause; include its `LICENSE`. |
| 173 | + |
| 174 | +Always ship the upstream licence in the ZIP, and keep a `BINARIES_PROVENANCE.md` |
| 175 | +(or a note in the release body) recording, per asset: upstream URL, upstream |
| 176 | +version, upstream checksum, and the SHA256 you published. |
| 177 | + |
| 178 | +### Native OS support per tool |
| 179 | + |
| 180 | +raven-python invokes each tool through `subprocess.run([resolved_path, …])` — that |
| 181 | +call is itself cross-platform, so the real constraint is whether a given tool has |
| 182 | +a binary that runs natively on each OS. It varies: |
| 183 | + |
| 184 | +| Tool | Linux | macOS (incl. arm64) | Windows (native) | |
| 185 | +|---|---|---|---| |
| 186 | +| BLAST+ (`blastp`, `makeblastdb`) | ✅ | ✅ | ✅ (NCBI ships Windows builds) | |
| 187 | +| DIAMOND | ✅ | ✅ | ⚠️ native build exists but Linux-first | |
| 188 | +| HMMER (`hmmbuild`/`hmmpress`/`hmmsearch`/`hmmscan`) | ✅ | ✅ | ❌ no official native build | |
| 189 | +| MAFFT | ✅ | ✅ | ⚠️ Windows package is a wrapper | |
| 190 | +| CD-HIT | ✅ | ✅ | ❌ no Windows build exists | |
| 191 | + |
| 192 | +Implications: |
| 193 | + |
| 194 | +- **Linux / macOS** — everything works. `conda install -c bioconda hmmer mafft |
| 195 | + cd-hit blast diamond`, or point the `RAVEN_PYTHON_*` env vars at your installs. |
| 196 | +- **Native Windows** — the homology track (BLAST+/DIAMOND) works, but the **KEGG |
| 197 | + HMM build (3b.3) and HMM query (3b.5) do not**: HMMER and CD-HIT have no Windows |
| 198 | + binaries, and bioconda has no Windows packages for any of them. Bundling can't |
| 199 | + fix this — there is no binary to bundle. |
| 200 | +- **Windows users should run raven-python inside WSL2** (or a Linux container), where |
| 201 | + every tool is native Linux. raven-python does **not** replicate RAVEN's |
| 202 | + `getWSLpath`/`wsl …` path translation: it calls the resolved binary directly, so |
| 203 | + mixing native-Windows Python with WSL binaries is unsupported — keep the whole |
| 204 | + stack inside WSL2. |
| 205 | +- The common end-user paths — homology reconstruction and the KEGG *species* model |
| 206 | + (3b.4) — need no HMMER/MAFFT/CD-HIT, so they are fully cross-platform. |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +## 7. Emitting the registry entry |
| 211 | + |
| 212 | +After building the per-platform ZIPs (named `<bundle>-<version>-<os>-<arch>.zip`) |
| 213 | +and uploading them to the release, generate the `_REGISTRY` entry — checksums and |
| 214 | +URLs — with [`scripts/make_registry_snippet.py`](../scripts/README.md): |
| 215 | + |
| 216 | +```bash |
| 217 | +python scripts/make_registry_snippet.py binary --bundle blast --version 2.16.0 \ |
| 218 | + --provides blastp makeblastdb --dir zips \ |
| 219 | + --base-url https://github.com/ORG/raven-python/releases/download/blast-2.16.0 |
| 220 | +``` |
| 221 | + |
| 222 | +It prints the ready-to-paste `_REGISTRY["blast"]` block; its SHA256 helper is the |
| 223 | +same one `ensure_binary` verifies with, so the checksums always match. (Producing |
| 224 | +the minimal ZIPs themselves — download upstream, `strip`, `zip -9 -j`, add |
| 225 | +`LICENSE` per §3–§6 — is still a manual/per-tool step.) |
| 226 | + |
| 227 | +--- |
| 228 | + |
| 229 | +## 8. Adding a new tool later (e.g. HMMER for KEGG reconstruction) |
| 230 | + |
| 231 | +1. Decide the **minimal executable set** (e.g. HMMER → `hmmsearch`, `hmmscan`, |
| 232 | + maybe `hmmbuild`/`hmmpress`). |
| 233 | +2. Add a bundle entry to the registry with `provides` listing those executables. |
| 234 | +3. Build/attach ZIPs per §3–§4; include the tool's licence (§6). |
| 235 | +4. The wrappers call `ensure_binary("hmmsearch", …)` with the same resolution |
| 236 | + order — no new provisioning code needed. |
0 commit comments