Merge pull request #255 from bernardladenthin/claude/cuda-cache-cleanup

bernardladenthin · web-flow · commit b20c2d3dd74d · 2026-06-21T13:55:52.000+02:00
CUDA CI: drop sccache debug diagnostics now that nvcc caching is proven
diff --git a/.github/build_cuda_linux.sh b/.github/build_cuda_linux.sh
@@ -15,18 +15,19 @@ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute
 
 sudo dnf install -y cuda-toolkit-13-2
 
-# CUDA target architectures — build-speed knob.
+# CUDA target architectures — LOCAL-dev build-speed knob.
 #
 # Default (CUDA_FAST_BUILD unset): we do NOT pass CMAKE_CUDA_ARCHITECTURES, so ggml/llama.cpp
-# compiles its full default arch set. That is exactly what release artifacts must ship (every
-# supported GPU generation) and is the slow part of this ~70 min job: nvcc recompiles each .cu
-# kernel once per architecture. sccache caches the gcc C/C++ TUs but NOT the nvcc .cu kernels
-# (sccache's nvcc support is limited/experimental), so the per-arch nvcc passes dominate even
-# with the cache on — which is why this knob exists as the real CUDA build-time lever.
+# compiles its full default arch set — exactly what release artifacts must ship (every supported
+# GPU generation). On a cold cache this is the slow part (nvcc recompiles each .cu kernel once per
+# architecture), but build.sh now wraps nvcc with sccache (-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache),
+# so on a warm Depot cache the per-arch .cu device passes are served from cache (verified: 100%
+# CUDA hit rate, ~51 min cold -> ~15 min warm). CI therefore always builds the full set and relies
+# on the warm cache for speed — it no longer sets CUDA_FAST_BUILD.
 #
-# Dev fast build (CUDA_FAST_BUILD=1): compile for a SINGLE architecture instead of the full
-# set, removing most of the nvcc time. Defaults to `native` (the build machine's own GPU —
-# needs a GPU present at configure time); override with CUDA_ARCH, e.g. CUDA_ARCH=90. This is
+# Dev fast build (CUDA_FAST_BUILD=1): compile for a SINGLE architecture instead of the full set,
+# removing most of the nvcc time on a COLD cache. Defaults to `native` (the build machine's own
+# GPU — needs a GPU present at configure time); override with CUDA_ARCH, e.g. CUDA_ARCH=90. This is
 # a MANUAL local-dev knob only: CI and release never set it, because an artifact built this
 # way runs on a single GPU generation. (Direct-cmake equivalent: -DCMAKE_CUDA_ARCHITECTURES=native.)
 CUDA_ARCH_ARGS=""
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -171,25 +171,23 @@ jobs:
     name: Cross-Compile manylinux_2_28 x86_64 (CUDA)
     needs: [startgate, build-webui]
     runs-on: ubuntu-latest
-    # CUDA cache rollout. build_cuda_linux.sh execs build.sh, so the same sccache probe guards
-    # this job. Unlike the other jobs, build.sh now also wraps nvcc (CMAKE_CUDA_COMPILER_LAUNCHER
-    # =sccache) for CUDA builds, so the per-arch .cu device passes — the dominant cost of this job
-    # — are cached too, not just the gcc host TUs. Because nvcc kernels now cache, this job always
-    # builds the FULL CMAKE_CUDA_ARCHITECTURES set (no single-arch validation shortcut): the warm
-    # cache, not a reduced arch set, is what keeps it fast, and every artifact stays release-safe
-    # (runs on every GPU generation) on PR/push as well as publish. The first (cold-cache) run still
-    # pays the full nvcc cost; the win shows on subsequent warm runs. CUDA_FAST_BUILD still exists in
-    # build_cuda_linux.sh as a LOCAL-dev knob, but CI no longer sets it. Diagnostics (SCCACHE_LOG /
-    # SCCACHE_ERROR_LOG / RUST_BACKTRACE) stay on until a warm run confirms nvcc cache hits; drop them
-    # (and their -e passthroughs) afterwards. Inert without DEPOT_TOKEN (fork PRs) or use_cache=false.
+    # CUDA cache. build_cuda_linux.sh execs build.sh, so the same sccache probe guards this job.
+    # build.sh also wraps nvcc (CMAKE_CUDA_COMPILER_LAUNCHER=sccache) for CUDA builds, so the
+    # per-arch .cu device passes — the dominant cost of this job — cache over Depot alongside the
+    # gcc host TUs. Verified on a warm run: 100% hit on CUDA / CUBIN / device-code (139 CUDA hits,
+    # 99.86% overall), cutting the job from ~51 min cold to ~15 min warm. The job therefore always
+    # builds the FULL CMAKE_CUDA_ARCHITECTURES set (no single-arch shortcut) and leans on the warm
+    # cache for speed, so every artifact stays release-safe (runs on every GPU generation) on PR /
+    # push as well as publish. CUDA_FAST_BUILD still exists in build_cuda_linux.sh as a LOCAL-dev
+    # knob, but CI no longer sets it. The first-run sccache debug diagnostics (SCCACHE_LOG /
+    # SCCACHE_ERROR_LOG / RUST_BACKTRACE) were dropped now that caching is confirmed; build.sh still
+    # prints the `sccache --show-stats` hit table at the end of every run. Inert without DEPOT_TOKEN
+    # (fork PRs) or use_cache=false.
     env:
       USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }}
       SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev
       SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}
-      SCCACHE_LOG: debug
-      SCCACHE_ERROR_LOG: /tmp/sccache_server.log
-      RUST_BACKTRACE: full
-      DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE"
+      DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE"
     steps:
       - uses: actions/checkout@v7
       - name: Download shared WebUI assets
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -75,9 +75,10 @@ sccache-over-Depot cache, which now wraps nvcc (`-DCMAKE_CUDA_COMPILER_LAUNCHER=
 `build.sh` for CUDA builds, gated behind the same probe). The launcher is safe to enable
 unconditionally: if sccache cannot wrap nvcc it runs it directly (uncached), and `build.sh`'s
 mid-build retry treats an sccache `Compiler not supported` failure like any other cache error and
-rebuilds the job without the launcher rather than redding it. **Verify it works:** the premise
-(sccache producing nvcc cache hits inside the manylinux_2_28 container) is proven only by a **warm**
-run — check `sccache --show-stats` shows CUDA hits on the second build before trusting the speedup.
+rebuilds the job without the launcher rather than redding it. **Verified:** a warm run in the
+manylinux_2_28 container hit **100%** on CUDA / CUBIN / device-code (139 CUDA hits, 99.86% overall,
+3 misses) and cut the job from **~51 min cold to ~15 min warm** — nvcc caching works here. `build.sh`
+prints `sccache --show-stats` at the end of every run so the hit table stays visible.
 
 ## Android minimum API level
 
@@ -323,14 +324,15 @@ v0.16.0 + the probe this is no longer a risk.) Job-by-job status:
    **v0.16.0** probe passed in-container (devtoolset-10 gcc), `sccache ON` over Depot WebDAV,
    warm cache 277/278 hits (99.64%), 1m46s build time.
 2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) —
-   🚧 **nvcc caching enabled, full-arch always** (diagnostics on). `build.sh` now also wraps nvcc
+   ✅ **verified green with nvcc caching, full-arch always.** `build.sh` also wraps nvcc
    (`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache`, scoped to CUDA builds), so both the gcc C/C++ TUs
    (134 model files + ggml + httplib) **and** the per-arch `.cu` device passes cache over Depot.
    CI dropped the single-arch validation shortcut (`CUDA_FAST_BUILD`/`CUDA_ARCH` removed from the
-   job) — every run builds the full arch set and leans on the warm cache for speed. **Unverified
-   until a warm run:** confirm `sccache --show-stats` reports CUDA hits on the second build; if
-   nvcc caching proves weak in this container, the cold-vs-warm delta will be small and the job
-   stays ~70 min (the mid-build retry guards against an nvcc-hostile sccache redding the build).
+   job) — every run builds the full arch set and leans on the warm cache for speed. A warm run hit
+   **100%** on CUDA / CUBIN / device-code (139 CUDA hits, 99.86% overall, 3 misses), cutting the job
+   from **~51 min cold to ~15 min warm**. The first-run debug diagnostics (`SCCACHE_LOG` /
+   `SCCACHE_ERROR_LOG` / `RUST_BACKTRACE`) were dropped once confirmed; `sccache --show-stats` still
+   prints the hit table every run.
 3. `crosscompile-linux-aarch64` — ✅ **enabled**, now a **native `ubuntu-24.04-arm` build** (not
    dockcross): `build.sh` self-fetches the aarch64 static-musl sccache (the fetch block in
    `build.sh` maps `uname -m` → `x86_64`/`aarch64`) and the probe guards it. See "Linux aarch64:
diff --git a/README.md b/README.md
@@ -22,7 +22,6 @@
 
 **Build cache:**  
 [![Build cache by Depot](https://img.shields.io/badge/build%20cache-Depot-FF5C35)](https://depot.dev)  
-_Shared, incremental CI compiler caching (sccache) powered by [Depot](https://depot.dev)._  
 
 **Coverage:**  
 [![Coverage Status](https://coveralls.io/repos/github/bernardladenthin/java-llama.cpp/badge.svg?branch=main)](https://coveralls.io/github/bernardladenthin/java-llama.cpp?branch=main)