Skip to content

Commit c85ff78

Browse files
Merge pull request #254 from bernardladenthin/claude/determined-brahmagupta-si4qyu
Enable sccache wrapping of nvcc for full-arch CUDA builds
2 parents 01263b7 + c91d6f2 commit c85ff78

4 files changed

Lines changed: 60 additions & 43 deletions

File tree

.github/build.sh

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,19 @@ if [ "${USE_CACHE:-true}" = "true" ] && command -v sccache >/dev/null 2>&1 \
9898
&& [ -n "${SCCACHE_WEBDAV_TOKEN:-}${SCCACHE_GHA_ENABLED:-}" ] \
9999
&& sccache_can_wrap_compiler; then
100100
LAUNCH="-DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_CXX_COMPILER_LAUNCHER=sccache"
101+
# CUDA builds: also wrap nvcc so the per-arch .cu device passes are cached too — not just
102+
# the gcc host TUs. Those per-architecture device-pass objects are the dominant cost of the
103+
# full-arch CUDA job, and sccache does support nvcc as a compiler. Scoped to CUDA builds
104+
# (GGML_CUDA in the cmake args): CMAKE_CUDA_COMPILER_LAUNCHER is inert when CUDA is not an
105+
# enabled language, but keeping it scoped leaves the CPU/Android jobs' configure output clean.
106+
# If sccache cannot wrap nvcc it runs it directly (uncached); and the mid-build retry below
107+
# also catches an sccache "Compiler not supported" failure and rebuilds without the launcher,
108+
# so an nvcc-hostile sccache can never red the build.
109+
case " $* " in
110+
*" -DGGML_CUDA=1 "* | *" -DGGML_CUDA=ON "* | *" -DGGML_CUDA=on "*)
111+
LAUNCH="$LAUNCH -DCMAKE_CUDA_COMPILER_LAUNCHER=sccache"
112+
echo "build.sh: sccache will also wrap nvcc (CUDA build detected)" ;;
113+
esac
101114
echo "build.sh: sccache ON (endpoint=${SCCACHE_WEBDAV_ENDPOINT:-default}), building with -j${JOBS}"
102115
else
103116
echo "build.sh: sccache OFF, building with -j${JOBS}"
@@ -113,12 +126,15 @@ cmake -Bbuild $LAUNCH $@ || exit 1
113126
# check, so recover by retrying the build once WITHOUT the launcher: a from-scratch uncached -O3
114127
# build is content-identical and release-safe, so the cache can never red the build. The retry is
115128
# gated on the failure output actually showing an sccache cache error, so a genuine compile error
116-
# still fails fast (and is reported) instead of triggering a wasteful uncached rebuild.
129+
# still fails fast (and is reported) instead of triggering a wasteful uncached rebuild. The
130+
# "Compiler not supported" signature additionally covers the CUDA case: if wrapping nvcc breaks
131+
# (sccache declining/erroring on the nvcc driver), the retry rebuilds the full-arch CUDA job
132+
# without any launcher rather than redding it.
117133
build_log="$(mktemp 2>/dev/null || echo "/tmp/jllama-build.$$.log")"
118134
cmake --build build --config Release -j"${JOBS}" 2>&1 | tee "$build_log"
119135
build_rc=${PIPESTATUS[0]}
120136
if [ "$build_rc" -ne 0 ]; then
121-
if [ -n "$LAUNCH" ] && grep -qiE 'sccache: error|Server startup failed|cache storage failed' "$build_log"; then
137+
if [ -n "$LAUNCH" ] && grep -qiE 'sccache: error|Server startup failed|cache storage failed|Compiler not supported' "$build_log"; then
122138
echo "build.sh: build failed via an sccache cache error — retrying WITHOUT cache (clean reconfigure)."
123139
rm -f "$build_log"
124140
rm -rf build && mkdir -p build

.github/workflows/publish.yml

Lines changed: 12 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -171,30 +171,25 @@ jobs:
171171
name: Cross-Compile manylinux_2_28 x86_64 (CUDA)
172172
needs: [startgate, build-webui]
173173
runs-on: ubuntu-latest
174-
# Phase 2 dockcross cache rollout — job 2, enabled after manylinux2014 (job 1) verified green
175-
# in CI with sccache v0.16.0 caching to Depot. build_cuda_linux.sh execs build.sh, so the same
176-
# probe guards this job: only the gcc C/C++ TUs cache (the nvcc .cu kernels are not wrapped),
177-
# still a large partial win on this ~70 min build. Diagnostics are on for its first run on the
178-
# manylinux_2_28 image; drop them (and their -e passthroughs) once it is confirmed green with a
179-
# cache hit, then enable the next job. Inert without DEPOT_TOKEN (fork PRs) or use_cache=false.
174+
# CUDA cache rollout. build_cuda_linux.sh execs build.sh, so the same sccache probe guards
175+
# this job. Unlike the other jobs, build.sh now also wraps nvcc (CMAKE_CUDA_COMPILER_LAUNCHER
176+
# =sccache) for CUDA builds, so the per-arch .cu device passes — the dominant cost of this job
177+
# — are cached too, not just the gcc host TUs. Because nvcc kernels now cache, this job always
178+
# builds the FULL CMAKE_CUDA_ARCHITECTURES set (no single-arch validation shortcut): the warm
179+
# cache, not a reduced arch set, is what keeps it fast, and every artifact stays release-safe
180+
# (runs on every GPU generation) on PR/push as well as publish. The first (cold-cache) run still
181+
# pays the full nvcc cost; the win shows on subsequent warm runs. CUDA_FAST_BUILD still exists in
182+
# build_cuda_linux.sh as a LOCAL-dev knob, but CI no longer sets it. Diagnostics (SCCACHE_LOG /
183+
# SCCACHE_ERROR_LOG / RUST_BACKTRACE) stay on until a warm run confirms nvcc cache hits; drop them
184+
# (and their -e passthroughs) afterwards. Inert without DEPOT_TOKEN (fork PRs) or use_cache=false.
180185
env:
181186
USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }}
182187
SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev
183188
SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}
184189
SCCACHE_LOG: debug
185190
SCCACHE_ERROR_LOG: /tmp/sccache_server.log
186191
RUST_BACKTRACE: full
187-
# CUDA arch policy: FAST single-arch build for validation runs (PR / push / non-publish
188-
# dispatch) to cut nvcc time; FULL arch set only when actually publishing to Central
189-
# (publish_to_central=true) so the distributed jar runs on every GPU generation. The
190-
# publish-snapshot/publish-release jobs require publish_to_central, so any artifact that
191-
# reaches Central is always built with the full set. CI has no GPU, so the fast path pins a
192-
# fixed CUDA_ARCH ('native' would fail at configure). '0' => full (release-safe), '1' => fast.
193-
CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}
194-
# Newest CUDA 13.2 architecture: sm_120 (consumer Blackwell / RTX 50xx). Only used on the
195-
# fast validation path; bump as newer GPU generations ship. Releases ignore it (full set).
196-
CUDA_ARCH: '120'
197-
DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE -e CUDA_FAST_BUILD -e CUDA_ARCH"
192+
DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE"
198193
steps:
199194
- uses: actions/checkout@v7
200195
- name: Download shared WebUI assets

CLAUDE.md

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -41,15 +41,18 @@ git commit -m "Upgrade CUDA from 13.2 to 13.3"
4141
### Fast local CUDA builds (`CUDA_FAST_BUILD`) — single-arch speed knob
4242

4343
The CUDA artifact must ship kernels for **every supported GPU generation**, so the default
44-
build — and every CI/release build — compiles the **full `CMAKE_CUDA_ARCHITECTURES` set** that
44+
build — and every CI build — compiles the **full `CMAKE_CUDA_ARCHITECTURES` set** that
4545
ggml/llama.cpp selects. nvcc recompiles each `.cu` kernel once per architecture, which is the
46-
dominant cost of the ~70 min CUDA job. **`sccache` does not help here:** it caches the gcc
47-
C/C++ TUs but not the nvcc `.cu` kernels (sccache's nvcc support is limited/experimental), so
48-
the per-arch nvcc passes remain even with the cache on. The one reliable lever to cut that time
49-
is to build **fewer architectures**.
46+
dominant cost of the ~70 min CUDA job. **`sccache` now wraps nvcc too:** `build.sh` adds
47+
`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` for CUDA builds (it detects `GGML_CUDA` in the cmake
48+
args), so the per-arch `.cu` device passes are cached over Depot alongside the gcc C/C++ TUs.
49+
Because the kernels are content-addressed and llama.cpp is pinned, a **warm** cache recompiles
50+
only what changed — so CI keeps the **full arch set on every run** (release-safe everywhere)
51+
and relies on the cache, not a reduced arch set, for speed. The first (cold-cache) run still
52+
pays the full nvcc cost; the win shows on subsequent warm runs.
5053

51-
`build_cuda_linux.sh` therefore honors an **opt-in** env knob — default **off** (full arch set,
52-
release-safe):
54+
`CUDA_FAST_BUILD` remains as a **local-dev** single-arch knob (CI no longer sets it).
55+
`build_cuda_linux.sh` honors it — default **off** (full arch set, release-safe):
5356

5457
```bash
5558
# Full release build (default): all archs — slow, runs on every GPU generation.
@@ -65,17 +68,16 @@ CUDA_FAST_BUILD=1 CUDA_ARCH=90 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS
6568
**Default + CI policy (release-safety is the invariant).** An artifact built with `CUDA_FAST_BUILD`
6669
runs on only the single GPU generation it was compiled for, so the **distributed jar must always be
6770
the full arch set**. The script default is **off** (full) so any *local/manual* build is
68-
release-safe. In CI (`publish.yml`, the `crosscompile-linux-x86_64-cuda` job) the flag is **on for
69-
validation runs** (PR / push / non-publish dispatch) to cut nvcc time, and **off only when actually
70-
publishing to Central** — it is wired as `CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}`
71-
(`'0'`=full, `'1'`=fast). Because the `publish-snapshot`/`publish-release` jobs require
72-
`publish_to_central`, **every artifact that reaches Central is built with the full arch set** while
73-
ordinary PR/push CI stays fast. CI has no GPU, so the fast path pins a fixed `CUDA_ARCH` (default
74-
`120` — the newest CUDA 13.2 arch, sm_120 / consumer Blackwell — in the job env) — `native`
75-
would fail at configure. Both `CUDA_FAST_BUILD` and `CUDA_ARCH` are
76-
forwarded into the dockcross container via `DOCKCROSS_ARGS` `-e`. To cache the nvcc kernels too you
77-
would add `-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc
78-
caching is unreliable — the arch knob is the better lever and is what this repo ships.
71+
release-safe, and **CI no longer sets `CUDA_FAST_BUILD` at all** — the `crosscompile-linux-x86_64-cuda`
72+
job always builds the full set on PR / push / dispatch / publish, so every artifact (not just the ones
73+
that reach Central) runs on every GPU generation. The full-arch CI cost is absorbed by the
74+
sccache-over-Depot cache, which now wraps nvcc (`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache`, added by
75+
`build.sh` for CUDA builds, gated behind the same probe). The launcher is safe to enable
76+
unconditionally: if sccache cannot wrap nvcc it runs it directly (uncached), and `build.sh`'s
77+
mid-build retry treats an sccache `Compiler not supported` failure like any other cache error and
78+
rebuilds the job without the launcher rather than redding it. **Verify it works:** the premise
79+
(sccache producing nvcc cache hits inside the manylinux_2_28 container) is proven only by a **warm**
80+
run — check `sccache --show-stats` shows CUDA hits on the second build before trusting the speedup.
7981

8082
## Android minimum API level
8183

@@ -321,10 +323,14 @@ v0.16.0 + the probe this is no longer a risk.) Job-by-job status:
321323
**v0.16.0** probe passed in-container (devtoolset-10 gcc), `sccache ON` over Depot WebDAV,
322324
warm cache 277/278 hits (99.64%), 1m46s build time.
323325
2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) —
324-
🚧 **first run in progress** (diagnostics on). Only the gcc C/C++ TUs cache (134 model files
325-
+ ggml + httplib); the nvcc `.cu` kernels won't (limited sccache nvcc support) — still a
326-
large partial win on the ~70 min full-arch job; the fast single-arch (sm_120) validation path
327-
cuts nvcc time independently of sccache.
326+
🚧 **nvcc caching enabled, full-arch always** (diagnostics on). `build.sh` now also wraps nvcc
327+
(`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache`, scoped to CUDA builds), so both the gcc C/C++ TUs
328+
(134 model files + ggml + httplib) **and** the per-arch `.cu` device passes cache over Depot.
329+
CI dropped the single-arch validation shortcut (`CUDA_FAST_BUILD`/`CUDA_ARCH` removed from the
330+
job) — every run builds the full arch set and leans on the warm cache for speed. **Unverified
331+
until a warm run:** confirm `sccache --show-stats` reports CUDA hits on the second build; if
332+
nvcc caching proves weak in this container, the cold-vs-warm delta will be small and the job
333+
stays ~70 min (the mid-build retry guards against an nvcc-hostile sccache redding the build).
328334
3. `crosscompile-linux-aarch64` — ✅ **enabled**, now a **native `ubuntu-24.04-arm` build** (not
329335
dockcross): `build.sh` self-fetches the aarch64 static-musl sccache (the fetch block in
330336
`build.sh` maps `uname -m``x86_64`/`aarch64`) and the probe guards it. See "Linux aarch64:

src/main/java/net/ladenthin/llama/parameters/InferenceParameters.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ public InferenceParameters withCachePrompt(boolean cachePrompt) {
216216
*/
217217
public InferenceParameters withCacheReuse(int cacheReuse) {
218218
if (cacheReuse < 0) {
219-
throw new IllegalArgumentException("cacheReuse must be non-negative");
219+
throw new IllegalArgumentException("cacheReuse must be non-negative but was " + cacheReuse);
220220
}
221221
return withScalar(PARAM_CACHE_REUSE, cacheReuse);
222222
}
@@ -231,7 +231,7 @@ public InferenceParameters withCacheReuse(int cacheReuse) {
231231
*/
232232
public InferenceParameters withSlotId(int slotId) {
233233
if (slotId < 0) {
234-
throw new IllegalArgumentException("slotId must be non-negative");
234+
throw new IllegalArgumentException("slotId must be non-negative but was " + slotId);
235235
}
236236
return withScalar(PARAM_SLOT_ID, slotId);
237237
}

0 commit comments

Comments
 (0)