Skip to content

Commit 698258d

Browse files
committed
ci(cuda): fast single-arch CUDA build for validation, full arch set only on publish
Invert the CUDA build-time/coverage trade-off in CI without risking the distributed jar. The crosscompile-linux-x86_64-cuda job now sets CUDA_FAST_BUILD=1 (single arch, CUDA_ARCH=90) for validation runs (PR/push/non-publish dispatch) to cut nvcc time, and CUDA_FAST_BUILD=0 (full arch set) only when publish_to_central is set. Because publish-snapshot/publish-release require publish_to_central, every artifact that reaches Maven Central is still built for every GPU generation — only non-distributed validation builds go fast. CI has no GPU so the fast path pins a fixed CUDA_ARCH (native would fail at configure); both vars are forwarded into the dockcross container via DOCKCROSS_ARGS -e. build_cuda_linux.sh's own default stays off, so local/manual builds remain release-safe unless you opt in. Docs updated in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
1 parent dd264b2 commit 698258d

2 files changed

Lines changed: 21 additions & 6 deletions

File tree

.github/workflows/publish.yml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,15 @@ jobs:
184184
SCCACHE_LOG: debug
185185
SCCACHE_ERROR_LOG: /tmp/sccache_server.log
186186
RUST_BACKTRACE: full
187-
DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE"
187+
# CUDA arch policy: FAST single-arch build for validation runs (PR / push / non-publish
188+
# dispatch) to cut nvcc time; FULL arch set only when actually publishing to Central
189+
# (publish_to_central=true) so the distributed jar runs on every GPU generation. The
190+
# publish-snapshot/publish-release jobs require publish_to_central, so any artifact that
191+
# reaches Central is always built with the full set. CI has no GPU, so the fast path pins a
192+
# fixed CUDA_ARCH ('native' would fail at configure). '0' => full (release-safe), '1' => fast.
193+
CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}
194+
CUDA_ARCH: '90'
195+
DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE -e CUDA_FAST_BUILD -e CUDA_ARCH"
188196
steps:
189197
- uses: actions/checkout@v6
190198
- name: Download shared WebUI assets

CLAUDE.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,11 +62,18 @@ CUDA_FAST_BUILD=1 CUDA_ARCH=90 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS
6262
# Direct-cmake equivalent: cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
6363
```
6464

65-
**Why a separate, off-by-default flag (never enable it in CI/release):** an artifact built with
66-
`CUDA_FAST_BUILD` runs on only the single GPU generation it was compiled for. The flag exists
67-
purely to speed up **local iteration**; the CI CUDA job leaves it unset, so released jars keep
68-
full arch coverage. To cache the nvcc kernels too you would add
69-
`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc
65+
**Default + CI policy (release-safety is the invariant).** An artifact built with `CUDA_FAST_BUILD`
66+
runs on only the single GPU generation it was compiled for, so the **distributed jar must always be
67+
the full arch set**. The script default is **off** (full) so any *local/manual* build is
68+
release-safe. In CI (`publish.yml`, the `crosscompile-linux-x86_64-cuda` job) the flag is **on for
69+
validation runs** (PR / push / non-publish dispatch) to cut nvcc time, and **off only when actually
70+
publishing to Central** — it is wired as `CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}`
71+
(`'0'`=full, `'1'`=fast). Because the `publish-snapshot`/`publish-release` jobs require
72+
`publish_to_central`, **every artifact that reaches Central is built with the full arch set** while
73+
ordinary PR/push CI stays fast. CI has no GPU, so the fast path pins a fixed `CUDA_ARCH` (default
74+
`90` in the job env) — `native` would fail at configure. Both `CUDA_FAST_BUILD` and `CUDA_ARCH` are
75+
forwarded into the dockcross container via `DOCKCROSS_ARGS` `-e`. To cache the nvcc kernels too you
76+
would add `-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc
7077
caching is unreliable — the arch knob is the better lever and is what this repo ships.
7178

7279
## Android minimum API level

0 commit comments

Comments
 (0)