ci(cuda): fast single-arch CUDA build for validation, full arch set only on publish

claude · claude · commit 698258db03d5 · 2026-06-20T15:24:03.000Z
Invert the CUDA build-time/coverage trade-off in CI without risking the distributed jar. The crosscompile-linux-x86_64-cuda job now sets CUDA_FAST_BUILD=1 (single arch, CUDA_ARCH=90) for validation runs (PR/push/non-publish dispatch) to cut nvcc time, and CUDA_FAST_BUILD=0 (full arch set) only when publish_to_central is set. Because publish-snapshot/publish-release require publish_to_central, every artifact that reaches Maven Central is still built for every GPU generation — only non-distributed validation builds go fast. CI has no GPU so the fast path pins a fixed CUDA_ARCH (native would fail at configure); both vars are forwarded into the dockcross container via DOCKCROSS_ARGS -e. build_cuda_linux.sh's own default stays off, so local/manual builds remain release-safe unless you opt in. Docs updated in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -184,7 +184,15 @@ jobs:
       SCCACHE_LOG: debug
       SCCACHE_ERROR_LOG: /tmp/sccache_server.log
       RUST_BACKTRACE: full
-      DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE"
+      # CUDA arch policy: FAST single-arch build for validation runs (PR / push / non-publish
+      # dispatch) to cut nvcc time; FULL arch set only when actually publishing to Central
+      # (publish_to_central=true) so the distributed jar runs on every GPU generation. The
+      # publish-snapshot/publish-release jobs require publish_to_central, so any artifact that
+      # reaches Central is always built with the full set. CI has no GPU, so the fast path pins a
+      # fixed CUDA_ARCH ('native' would fail at configure). '0' => full (release-safe), '1' => fast.
+      CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}
+      CUDA_ARCH: '90'
+      DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE -e CUDA_FAST_BUILD -e CUDA_ARCH"
     steps:
       - uses: actions/checkout@v6
       - name: Download shared WebUI assets
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -62,11 +62,18 @@ CUDA_FAST_BUILD=1 CUDA_ARCH=90 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS
 # Direct-cmake equivalent: cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
 ```
 
-**Why a separate, off-by-default flag (never enable it in CI/release):** an artifact built with
-`CUDA_FAST_BUILD` runs on only the single GPU generation it was compiled for. The flag exists
-purely to speed up **local iteration**; the CI CUDA job leaves it unset, so released jars keep
-full arch coverage. To cache the nvcc kernels too you would add
-`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc
+**Default + CI policy (release-safety is the invariant).** An artifact built with `CUDA_FAST_BUILD`
+runs on only the single GPU generation it was compiled for, so the **distributed jar must always be
+the full arch set**. The script default is **off** (full) so any *local/manual* build is
+release-safe. In CI (`publish.yml`, the `crosscompile-linux-x86_64-cuda` job) the flag is **on for
+validation runs** (PR / push / non-publish dispatch) to cut nvcc time, and **off only when actually
+publishing to Central** — it is wired as `CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}`
+(`'0'`=full, `'1'`=fast). Because the `publish-snapshot`/`publish-release` jobs require
+`publish_to_central`, **every artifact that reaches Central is built with the full arch set** while
+ordinary PR/push CI stays fast. CI has no GPU, so the fast path pins a fixed `CUDA_ARCH` (default
+`90` in the job env) — `native` would fail at configure. Both `CUDA_FAST_BUILD` and `CUDA_ARCH` are
+forwarded into the dockcross container via `DOCKCROSS_ARGS` `-e`. To cache the nvcc kernels too you
+would add `-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc
 caching is unreliable — the arch knob is the better lever and is what this repo ships.
 
 ## Android minimum API level