Skip to content

Commit b20c2d3

Browse files
Merge pull request #255 from bernardladenthin/claude/cuda-cache-cleanup
CUDA CI: drop sccache debug diagnostics now that nvcc caching is proven
2 parents c85ff78 + faa93d6 commit b20c2d3

4 files changed

Lines changed: 33 additions & 33 deletions

File tree

.github/build_cuda_linux.sh

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,19 @@ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute
1515

1616
sudo dnf install -y cuda-toolkit-13-2
1717

18-
# CUDA target architectures — build-speed knob.
18+
# CUDA target architectures — LOCAL-dev build-speed knob.
1919
#
2020
# Default (CUDA_FAST_BUILD unset): we do NOT pass CMAKE_CUDA_ARCHITECTURES, so ggml/llama.cpp
21-
# compiles its full default arch set. That is exactly what release artifacts must ship (every
22-
# supported GPU generation) and is the slow part of this ~70 min job: nvcc recompiles each .cu
23-
# kernel once per architecture. sccache caches the gcc C/C++ TUs but NOT the nvcc .cu kernels
24-
# (sccache's nvcc support is limited/experimental), so the per-arch nvcc passes dominate even
25-
# with the cache on — which is why this knob exists as the real CUDA build-time lever.
21+
# compiles its full default arch set — exactly what release artifacts must ship (every supported
22+
# GPU generation). On a cold cache this is the slow part (nvcc recompiles each .cu kernel once per
23+
# architecture), but build.sh now wraps nvcc with sccache (-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache),
24+
# so on a warm Depot cache the per-arch .cu device passes are served from cache (verified: 100%
25+
# CUDA hit rate, ~51 min cold -> ~15 min warm). CI therefore always builds the full set and relies
26+
# on the warm cache for speed — it no longer sets CUDA_FAST_BUILD.
2627
#
27-
# Dev fast build (CUDA_FAST_BUILD=1): compile for a SINGLE architecture instead of the full
28-
# set, removing most of the nvcc time. Defaults to `native` (the build machine's own GPU —
29-
# needs a GPU present at configure time); override with CUDA_ARCH, e.g. CUDA_ARCH=90. This is
28+
# Dev fast build (CUDA_FAST_BUILD=1): compile for a SINGLE architecture instead of the full set,
29+
# removing most of the nvcc time on a COLD cache. Defaults to `native` (the build machine's own
30+
# GPU — needs a GPU present at configure time); override with CUDA_ARCH, e.g. CUDA_ARCH=90. This is
3031
# a MANUAL local-dev knob only: CI and release never set it, because an artifact built this
3132
# way runs on a single GPU generation. (Direct-cmake equivalent: -DCMAKE_CUDA_ARCHITECTURES=native.)
3233
CUDA_ARCH_ARGS=""

.github/workflows/publish.yml

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -171,25 +171,23 @@ jobs:
171171
name: Cross-Compile manylinux_2_28 x86_64 (CUDA)
172172
needs: [startgate, build-webui]
173173
runs-on: ubuntu-latest
174-
# CUDA cache rollout. build_cuda_linux.sh execs build.sh, so the same sccache probe guards
175-
# this job. Unlike the other jobs, build.sh now also wraps nvcc (CMAKE_CUDA_COMPILER_LAUNCHER
176-
# =sccache) for CUDA builds, so the per-arch .cu device passes — the dominant cost of this job
177-
# — are cached too, not just the gcc host TUs. Because nvcc kernels now cache, this job always
178-
# builds the FULL CMAKE_CUDA_ARCHITECTURES set (no single-arch validation shortcut): the warm
179-
# cache, not a reduced arch set, is what keeps it fast, and every artifact stays release-safe
180-
# (runs on every GPU generation) on PR/push as well as publish. The first (cold-cache) run still
181-
# pays the full nvcc cost; the win shows on subsequent warm runs. CUDA_FAST_BUILD still exists in
182-
# build_cuda_linux.sh as a LOCAL-dev knob, but CI no longer sets it. Diagnostics (SCCACHE_LOG /
183-
# SCCACHE_ERROR_LOG / RUST_BACKTRACE) stay on until a warm run confirms nvcc cache hits; drop them
184-
# (and their -e passthroughs) afterwards. Inert without DEPOT_TOKEN (fork PRs) or use_cache=false.
174+
# CUDA cache. build_cuda_linux.sh execs build.sh, so the same sccache probe guards this job.
175+
# build.sh also wraps nvcc (CMAKE_CUDA_COMPILER_LAUNCHER=sccache) for CUDA builds, so the
176+
# per-arch .cu device passes — the dominant cost of this job — cache over Depot alongside the
177+
# gcc host TUs. Verified on a warm run: 100% hit on CUDA / CUBIN / device-code (139 CUDA hits,
178+
# 99.86% overall), cutting the job from ~51 min cold to ~15 min warm. The job therefore always
179+
# builds the FULL CMAKE_CUDA_ARCHITECTURES set (no single-arch shortcut) and leans on the warm
180+
# cache for speed, so every artifact stays release-safe (runs on every GPU generation) on PR /
181+
# push as well as publish. CUDA_FAST_BUILD still exists in build_cuda_linux.sh as a LOCAL-dev
182+
# knob, but CI no longer sets it. The first-run sccache debug diagnostics (SCCACHE_LOG /
183+
# SCCACHE_ERROR_LOG / RUST_BACKTRACE) were dropped now that caching is confirmed; build.sh still
184+
# prints the `sccache --show-stats` hit table at the end of every run. Inert without DEPOT_TOKEN
185+
# (fork PRs) or use_cache=false.
185186
env:
186187
USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }}
187188
SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev
188189
SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}
189-
SCCACHE_LOG: debug
190-
SCCACHE_ERROR_LOG: /tmp/sccache_server.log
191-
RUST_BACKTRACE: full
192-
DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE"
190+
DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE"
193191
steps:
194192
- uses: actions/checkout@v7
195193
- name: Download shared WebUI assets

CLAUDE.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -75,9 +75,10 @@ sccache-over-Depot cache, which now wraps nvcc (`-DCMAKE_CUDA_COMPILER_LAUNCHER=
7575
`build.sh` for CUDA builds, gated behind the same probe). The launcher is safe to enable
7676
unconditionally: if sccache cannot wrap nvcc it runs it directly (uncached), and `build.sh`'s
7777
mid-build retry treats an sccache `Compiler not supported` failure like any other cache error and
78-
rebuilds the job without the launcher rather than redding it. **Verify it works:** the premise
79-
(sccache producing nvcc cache hits inside the manylinux_2_28 container) is proven only by a **warm**
80-
run — check `sccache --show-stats` shows CUDA hits on the second build before trusting the speedup.
78+
rebuilds the job without the launcher rather than redding it. **Verified:** a warm run in the
79+
manylinux_2_28 container hit **100%** on CUDA / CUBIN / device-code (139 CUDA hits, 99.86% overall,
80+
3 misses) and cut the job from **~51 min cold to ~15 min warm** — nvcc caching works here. `build.sh`
81+
prints `sccache --show-stats` at the end of every run so the hit table stays visible.
8182

8283
## Android minimum API level
8384

@@ -323,14 +324,15 @@ v0.16.0 + the probe this is no longer a risk.) Job-by-job status:
323324
**v0.16.0** probe passed in-container (devtoolset-10 gcc), `sccache ON` over Depot WebDAV,
324325
warm cache 277/278 hits (99.64%), 1m46s build time.
325326
2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) —
326-
🚧 **nvcc caching enabled, full-arch always** (diagnostics on). `build.sh` now also wraps nvcc
327+
**verified green with nvcc caching, full-arch always.** `build.sh` also wraps nvcc
327328
(`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache`, scoped to CUDA builds), so both the gcc C/C++ TUs
328329
(134 model files + ggml + httplib) **and** the per-arch `.cu` device passes cache over Depot.
329330
CI dropped the single-arch validation shortcut (`CUDA_FAST_BUILD`/`CUDA_ARCH` removed from the
330-
job) — every run builds the full arch set and leans on the warm cache for speed. **Unverified
331-
until a warm run:** confirm `sccache --show-stats` reports CUDA hits on the second build; if
332-
nvcc caching proves weak in this container, the cold-vs-warm delta will be small and the job
333-
stays ~70 min (the mid-build retry guards against an nvcc-hostile sccache redding the build).
331+
job) — every run builds the full arch set and leans on the warm cache for speed. A warm run hit
332+
**100%** on CUDA / CUBIN / device-code (139 CUDA hits, 99.86% overall, 3 misses), cutting the job
333+
from **~51 min cold to ~15 min warm**. The first-run debug diagnostics (`SCCACHE_LOG` /
334+
`SCCACHE_ERROR_LOG` / `RUST_BACKTRACE`) were dropped once confirmed; `sccache --show-stats` still
335+
prints the hit table every run.
334336
3. `crosscompile-linux-aarch64` — ✅ **enabled**, now a **native `ubuntu-24.04-arm` build** (not
335337
dockcross): `build.sh` self-fetches the aarch64 static-musl sccache (the fetch block in
336338
`build.sh` maps `uname -m``x86_64`/`aarch64`) and the probe guards it. See "Linux aarch64:

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222

2323
**Build cache:**
2424
[![Build cache by Depot](https://img.shields.io/badge/build%20cache-Depot-FF5C35)](https://depot.dev)
25-
_Shared, incremental CI compiler caching (sccache) powered by [Depot](https://depot.dev)._
2625

2726
**Coverage:**
2827
[![Coverage Status](https://coveralls.io/repos/github/bernardladenthin/java-llama.cpp/badge.svg?branch=main)](https://coveralls.io/github/bernardladenthin/java-llama.cpp?branch=main)

0 commit comments

Comments
 (0)