@@ -41,15 +41,18 @@ git commit -m "Upgrade CUDA from 13.2 to 13.3"
4141### Fast local CUDA builds (` CUDA_FAST_BUILD ` ) — single-arch speed knob
4242
4343The CUDA artifact must ship kernels for ** every supported GPU generation** , so the default
44- build — and every CI/release build — compiles the ** full ` CMAKE_CUDA_ARCHITECTURES ` set** that
44+ build — and every CI build — compiles the ** full ` CMAKE_CUDA_ARCHITECTURES ` set** that
4545ggml/llama.cpp selects. nvcc recompiles each ` .cu ` kernel once per architecture, which is the
46- dominant cost of the ~ 70 min CUDA job. ** ` sccache ` does not help here:** it caches the gcc
47- C/C++ TUs but not the nvcc ` .cu ` kernels (sccache's nvcc support is limited/experimental), so
48- the per-arch nvcc passes remain even with the cache on. The one reliable lever to cut that time
49- is to build ** fewer architectures** .
46+ dominant cost of the ~ 70 min CUDA job. ** ` sccache ` now wraps nvcc too:** ` build.sh ` adds
47+ ` -DCMAKE_CUDA_COMPILER_LAUNCHER=sccache ` for CUDA builds (it detects ` GGML_CUDA ` in the cmake
48+ args), so the per-arch ` .cu ` device passes are cached over Depot alongside the gcc C/C++ TUs.
49+ Because the kernels are content-addressed and llama.cpp is pinned, a ** warm** cache recompiles
50+ only what changed — so CI keeps the ** full arch set on every run** (release-safe everywhere)
51+ and relies on the cache, not a reduced arch set, for speed. The first (cold-cache) run still
52+ pays the full nvcc cost; the win shows on subsequent warm runs.
5053
51- ` build_cuda_linux.sh ` therefore honors an ** opt-in ** env knob — default ** off ** (full arch set,
52- release-safe):
54+ ` CUDA_FAST_BUILD ` remains as a ** local-dev ** single-arch knob (CI no longer sets it).
55+ ` build_cuda_linux.sh ` honors it — default ** off ** (full arch set, release-safe):
5356
5457``` bash
5558# Full release build (default): all archs — slow, runs on every GPU generation.
@@ -65,17 +68,16 @@ CUDA_FAST_BUILD=1 CUDA_ARCH=90 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS
6568** Default + CI policy (release-safety is the invariant).** An artifact built with ` CUDA_FAST_BUILD `
6669runs on only the single GPU generation it was compiled for, so the ** distributed jar must always be
6770the full arch set** . The script default is ** off** (full) so any * local/manual* build is
68- release-safe. In CI (` publish.yml ` , the ` crosscompile-linux-x86_64-cuda ` job) the flag is ** on for
69- validation runs** (PR / push / non-publish dispatch) to cut nvcc time, and ** off only when actually
70- publishing to Central** — it is wired as ` CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }} `
71- (` '0' ` =full, ` '1' ` =fast). Because the ` publish-snapshot ` /` publish-release ` jobs require
72- ` publish_to_central ` , ** every artifact that reaches Central is built with the full arch set** while
73- ordinary PR/push CI stays fast. CI has no GPU, so the fast path pins a fixed ` CUDA_ARCH ` (default
74- ` 120 ` — the newest CUDA 13.2 arch, sm_120 / consumer Blackwell — in the job env) — ` native `
75- would fail at configure. Both ` CUDA_FAST_BUILD ` and ` CUDA_ARCH ` are
76- forwarded into the dockcross container via ` DOCKCROSS_ARGS ` ` -e ` . To cache the nvcc kernels too you
77- would add ` -DCMAKE_CUDA_COMPILER_LAUNCHER=sccache ` (gated behind the same probe), but sccache's nvcc
78- caching is unreliable — the arch knob is the better lever and is what this repo ships.
71+ release-safe, and ** CI no longer sets ` CUDA_FAST_BUILD ` at all** — the ` crosscompile-linux-x86_64-cuda `
72+ job always builds the full set on PR / push / dispatch / publish, so every artifact (not just the ones
73+ that reach Central) runs on every GPU generation. The full-arch CI cost is absorbed by the
74+ sccache-over-Depot cache, which now wraps nvcc (` -DCMAKE_CUDA_COMPILER_LAUNCHER=sccache ` , added by
75+ ` build.sh ` for CUDA builds, gated behind the same probe). The launcher is safe to enable
76+ unconditionally: if sccache cannot wrap nvcc it runs it directly (uncached), and ` build.sh ` 's
77+ mid-build retry treats an sccache ` Compiler not supported ` failure like any other cache error and
78+ rebuilds the job without the launcher rather than redding it. ** Verify it works:** the premise
79+ (sccache producing nvcc cache hits inside the manylinux_2_28 container) is proven only by a ** warm**
80+ run — check ` sccache --show-stats ` shows CUDA hits on the second build before trusting the speedup.
7981
8082## Android minimum API level
8183
@@ -321,10 +323,14 @@ v0.16.0 + the probe this is no longer a risk.) Job-by-job status:
321323 ** v0.16.0** probe passed in-container (devtoolset-10 gcc), ` sccache ON ` over Depot WebDAV,
322324 warm cache 277/278 hits (99.64%), 1m46s build time.
3233252 . ` crosscompile-linux-x86_64-cuda ` (via ` build_cuda_linux.sh ` , which execs ` build.sh ` ) —
324- 🚧 ** first run in progress** (diagnostics on). Only the gcc C/C++ TUs cache (134 model files
325- + ggml + httplib); the nvcc ` .cu ` kernels won't (limited sccache nvcc support) — still a
326- large partial win on the ~ 70 min full-arch job; the fast single-arch (sm_120) validation path
327- cuts nvcc time independently of sccache.
326+ 🚧 ** nvcc caching enabled, full-arch always** (diagnostics on). ` build.sh ` now also wraps nvcc
327+ (` -DCMAKE_CUDA_COMPILER_LAUNCHER=sccache ` , scoped to CUDA builds), so both the gcc C/C++ TUs
328+ (134 model files + ggml + httplib) ** and** the per-arch ` .cu ` device passes cache over Depot.
329+ CI dropped the single-arch validation shortcut (` CUDA_FAST_BUILD ` /` CUDA_ARCH ` removed from the
330+ job) — every run builds the full arch set and leans on the warm cache for speed. ** Unverified
331+ until a warm run:** confirm ` sccache --show-stats ` reports CUDA hits on the second build; if
332+ nvcc caching proves weak in this container, the cold-vs-warm delta will be small and the job
333+ stays ~ 70 min (the mid-build retry guards against an nvcc-hostile sccache redding the build).
3283343 . ` crosscompile-linux-aarch64 ` — ✅ ** enabled** , now a ** native ` ubuntu-24.04-arm ` build** (not
329335 dockcross): ` build.sh ` self-fetches the aarch64 static-musl sccache (the fetch block in
330336 ` build.sh ` maps ` uname -m ` → ` x86_64 ` /` aarch64 ` ) and the probe guards it. See "Linux aarch64:
0 commit comments