From 1876ef17d34653546433933aaea8598355e6d7f7 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 12:02:06 +0000 Subject: [PATCH 01/16] Document Depot/sccache cache as jllama-only in cross-repo scope Add a 'Cross-repo scope' note to the CI build cache section explaining the sccache+Depot compiler cache benefits only this repo's native build, and link the workspace crossrepostatus.md non-parity entry. No build/CI behaviour change. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- CLAUDE.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index 7b66afea..803b1ad9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -206,6 +206,17 @@ container), the Windows jobs (sccache supports MSVC), and the Linux-host `test-c extend a job: install `sccache`, set the two `SCCACHE_WEBDAV_*` env vars, and (for RAM-limited runners) `BUILD_JOBS`. +**Cross-repo scope — this is java-llama.cpp-only by nature.** `sccache` caches *compiler* +output (C/C++/Rust/CUDA), and jllama is the only sibling repo with a native (C++/JNI) build, +so it is the only one that benefits. The pure-Maven siblings (BitcoinAddressFinder, +streambuffer, llamacpp-ai-index-maven-plugin) have no C/C++ to cache, run on **GitHub-hosted** +runners (Depot's *GitHub Actions* cache backend activates only on **Depot-hosted** runners), +and already cache their Maven dependencies via `actions/setup-java`'s `cache: maven`. The +`DEPOT_TOKEN` organization secret is present in every repo but is **inert** outside jllama, and +the README "Build cache by Depot" badge is intentionally kept here only — advertising it on the +Maven repos would claim a capability they do not have. Recorded as deliberate non-parity in +[`../workspace/crossrepostatus.md`](../workspace/crossrepostatus.md). + ## Upgrading/Downgrading llama.cpp Version To change the llama.cpp version, update the following **three** files: From 20e92d2eb5986719508a4306f9a9525f2031cb24 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 12:05:07 +0000 Subject: [PATCH 02/16] Trim jllama cache cross-repo note to a pointer Keep only the one-line 'jllama-only, it's the sole repo with a native build' fact and defer the full rationale (Maven repos, GitHub-hosted runners, inert DEPOT_TOKEN, badge) to workspace/crossrepostatus.md instead of duplicating it here. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- CLAUDE.md | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 803b1ad9..47a2b75d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -206,15 +206,10 @@ container), the Windows jobs (sccache supports MSVC), and the Linux-host `test-c extend a job: install `sccache`, set the two `SCCACHE_WEBDAV_*` env vars, and (for RAM-limited runners) `BUILD_JOBS`. -**Cross-repo scope — this is java-llama.cpp-only by nature.** `sccache` caches *compiler* -output (C/C++/Rust/CUDA), and jllama is the only sibling repo with a native (C++/JNI) build, -so it is the only one that benefits. The pure-Maven siblings (BitcoinAddressFinder, -streambuffer, llamacpp-ai-index-maven-plugin) have no C/C++ to cache, run on **GitHub-hosted** -runners (Depot's *GitHub Actions* cache backend activates only on **Depot-hosted** runners), -and already cache their Maven dependencies via `actions/setup-java`'s `cache: maven`. The -`DEPOT_TOKEN` organization secret is present in every repo but is **inert** outside jllama, and -the README "Build cache by Depot" badge is intentionally kept here only — advertising it on the -Maven repos would claim a capability they do not have. Recorded as deliberate non-parity in +**Cross-repo scope.** This Depot/sccache compiler cache makes sense only for java-llama.cpp — +it is the only sibling repo with a native (C++/JNI) build. It does not apply to the pure-Maven +siblings; why (and why the `DEPOT_TOKEN` org secret and the README "Build cache by Depot" badge +are kept jllama-only) is explained in the cross-repo status under "Deliberate non-parity": [`../workspace/crossrepostatus.md`](../workspace/crossrepostatus.md). ## Upgrading/Downgrading llama.cpp Version From c643b20e6860329cd8d6d873c79eef9651799727 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 12:18:45 +0000 Subject: [PATCH 03/16] ci: add sccache probe health-check so a crashing sccache falls back uncached MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit build.sh uses sccache as the compiler launcher, so a present-but-crashing sccache (the static-musl panic seen inside the dockcross cross-compile containers) failed every compile and redded the whole build. The inert-safe guard only covered sccache being absent, not present-but-crashing. Add sccache_can_wrap_compiler(): probe-compile a trivial TU through sccache and only enable -DCMAKE_{C,CXX}_COMPILER_LAUNCHER=sccache when it succeeds. On any failure it logs the captured Rust panic backtrace (and the detached server's SCCACHE_ERROR_LOG when a job sets one) and builds WITHOUT the cache — a clean green -O3 build. Also make the fetched sccache version a SCCACHE_DL_VERSION knob (default bumped 0.8.2 -> 0.15.0, overridable per-job) and only run --show-stats when sccache was actually used. Verified locally with fake sccache/cmake across every variant: no token, use_cache=false, crashing sccache, and working sccache all produce a green build. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/build.sh | 65 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 57 insertions(+), 8 deletions(-) diff --git a/.github/build.sh b/.github/build.sh index 23fef1d2..11acb4b5 100755 --- a/.github/build.sh +++ b/.github/build.sh @@ -21,13 +21,19 @@ fi # while macOS installs it via brew in the workflow. Best-effort and inert-safe: any failure # leaves sccache absent, so the build just proceeds uncached. The static musl binary runs in # any x86_64 Linux container (the cross-compile host is always x86_64). +# +# SCCACHE_DL_VERSION is overridable per-job, so a container that crashes one sccache build can +# try another without editing this script (the in-container panic that stalled phase 2 was on +# v0.8.2; v0.15.0 is the current stable default). A wrong/unavailable version just fails the +# `curl -f` and falls back to an uncached build, so bumping it can never red a build. +SCCACHE_DL_VERSION="${SCCACHE_DL_VERSION:-0.15.0}" if [ "${USE_CACHE:-true}" = "true" ] && [ -n "${SCCACHE_WEBDAV_TOKEN:-}${SCCACHE_GHA_ENABLED:-}" ] \ && ! command -v sccache >/dev/null 2>&1 \ && [ "$(uname -s)" = "Linux" ] && [ "$(uname -m)" = "x86_64" ]; then - SCCACHE_REL="sccache-v0.8.2-x86_64-unknown-linux-musl" + SCCACHE_REL="sccache-v${SCCACHE_DL_VERSION}-x86_64-unknown-linux-musl" echo "build.sh: fetching ${SCCACHE_REL} (no sccache on PATH)..." if curl -fsSL --proto =https --proto-redir =https \ - "https://github.com/mozilla/sccache/releases/download/v0.8.2/${SCCACHE_REL}.tar.gz" \ + "https://github.com/mozilla/sccache/releases/download/v${SCCACHE_DL_VERSION}/${SCCACHE_REL}.tar.gz" \ -o /tmp/sccache.tgz && tar -xzf /tmp/sccache.tgz -C /tmp; then export PATH="/tmp/${SCCACHE_REL}:$PATH" echo "build.sh: sccache -> $(command -v sccache || echo 'still missing')" @@ -36,14 +42,55 @@ if [ "${USE_CACHE:-true}" = "true" ] && [ -n "${SCCACHE_WEBDAV_TOKEN:-}${SCCACHE fi fi +# Health-check before trusting sccache as the compiler launcher. Because sccache *is* the +# launcher (cmake runs `sccache ...` for every TU), a present-but-crashing sccache +# fails every compile and reds the whole build — exactly the in-container panic that stalled +# phase 2 (the static-musl binary panicked while wrapping the cross-compiler, failing ggml.c.o). +# The probe runs the real compiler through sccache on a trivial TU; only if that succeeds is the +# launcher enabled. On any failure it logs the captured output (the Rust panic backtrace, plus +# the detached server's SCCACHE_ERROR_LOG when a job sets one) and the build runs WITHOUT the +# cache — a clean, uncached -O3 build that still goes green. This closes the gap the old +# absent-only guard left: it handled sccache *missing*, not sccache *crashing*. +sccache_can_wrap_compiler() { + probe_cc="${CC:-}" + if [ -z "$probe_cc" ]; then + for c in cc gcc clang; do + if command -v "$c" >/dev/null 2>&1; then probe_cc="$c"; break; fi + done + fi + if [ -z "$probe_cc" ]; then + echo "build.sh: sccache probe: no C compiler on PATH to probe; building uncached" + return 1 + fi + probe_dir="$(mktemp -d 2>/dev/null || echo "/tmp/sccache-probe.$$")" + mkdir -p "$probe_dir" || return 1 + printf 'int main(void){return 0;}\n' > "$probe_dir/probe.c" + probe_out="$(sccache "$probe_cc" -c "$probe_dir/probe.c" -o "$probe_dir/probe.o" 2>&1)" + probe_rc=$? + rm -rf "$probe_dir" + if [ "$probe_rc" -ne 0 ]; then + echo "build.sh: sccache probe FAILED (rc=${probe_rc}) wrapping '${probe_cc}' — building WITHOUT cache." + [ -n "$probe_out" ] && printf '%s\n' "$probe_out" | sed 's/^/build.sh: sccache-probe| /' + if [ -n "${SCCACHE_ERROR_LOG:-}" ] && [ -f "${SCCACHE_ERROR_LOG}" ]; then + echo "build.sh: --- detached server log (${SCCACHE_ERROR_LOG}) ---" + sed 's/^/build.sh: sccache-srv| /' "${SCCACHE_ERROR_LOG}" 2>/dev/null || true + fi + return 1 + fi + echo "build.sh: sccache probe OK (wrapped '${probe_cc}')" + return 0 +} + # Optional shared compiler cache: sccache fronting Depot Cache (WebDAV). Enabled only when -# USE_CACHE is true AND sccache + a cache token are present, so it stays inert before the -# DEPOT_TOKEN secret is configured and on fork PRs (secrets hidden) — those just compile -# normally. sccache is content-addressed, so a cache hit is bit-identical to a fresh -O3 -# compile (release-safe), and it degrades to direct compilation if the cache is unreachable. +# USE_CACHE is true AND sccache + a cache token are present AND the probe confirms sccache can +# wrap the compiler — so it stays inert before the DEPOT_TOKEN secret is configured, on fork PRs +# (secrets hidden), and when sccache would crash; all of those just compile normally. sccache is +# content-addressed, so a cache hit is bit-identical to a fresh -O3 compile (release-safe), and +# it degrades to direct compilation if the cache is unreachable. LAUNCH="" if [ "${USE_CACHE:-true}" = "true" ] && command -v sccache >/dev/null 2>&1 \ - && [ -n "${SCCACHE_WEBDAV_TOKEN:-}${SCCACHE_GHA_ENABLED:-}" ]; then + && [ -n "${SCCACHE_WEBDAV_TOKEN:-}${SCCACHE_GHA_ENABLED:-}" ] \ + && sccache_can_wrap_compiler; then LAUNCH="-DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_CXX_COMPILER_LAUNCHER=sccache" echo "build.sh: sccache ON (endpoint=${SCCACHE_WEBDAV_ENDPOINT:-default}), building with -j${JOBS}" else @@ -53,6 +100,8 @@ fi cmake -Bbuild $LAUNCH $@ || exit 1 cmake --build build --config Release -j"${JOBS}" || exit 1 -if command -v sccache >/dev/null 2>&1; then +# Only query stats when sccache was actually used as the launcher; if the probe rejected a +# crashing sccache, re-invoking it here would just repeat the crash output (harmless but noisy). +if [ -n "$LAUNCH" ] && command -v sccache >/dev/null 2>&1; then sccache --show-stats || true fi From 4af2250f790f2854ff2bfe7f2b74ee11910b8876 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 12:18:45 +0000 Subject: [PATCH 04/16] ci: re-enable sccache on the manylinux2014 dockcross job (phase 2, job 1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit First dockcross job re-enabled after the phase-2 revert, now safe behind the build.sh probe. Forwards the Depot cache env into the container via DOCKCROSS_ARGS and enables SCCACHE_LOG=debug + SCCACHE_ERROR_LOG + RUST_BACKTRACE=full so this run captures the in-container panic root cause if it recurs (the probe keeps the build green either way). The CUDA, aarch64, Android, OpenCL-Android and Windows jobs stay uncached until this one is verified green in CI — one job at a time. Document the staged rollout and the probe in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/workflows/publish.yml | 16 ++++++++++++++ CLAUDE.md | 41 ++++++++++++++++++++++++++++------- 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 18f566ab..ec1194db 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -200,6 +200,22 @@ jobs: name: Cross-Compile manylinux2014 x86_64 needs: [startgate, build-webui] runs-on: ubuntu-latest + # Phase 2 dockcross cache rollout — FIRST job (fastest plain-build.sh job, cleanest probe). + # build.sh now probe-compiles through sccache before trusting it as the launcher, so a + # present-but-crashing in-container sccache (the panic that stalled the first attempt) falls + # back to an uncached, green -O3 build instead of redding it. The diagnostic vars below are + # forwarded into the container so this run captures the root cause if the panic recurs; drop + # SCCACHE_LOG / SCCACHE_ERROR_LOG / RUST_BACKTRACE (and their -e passthroughs) once the cache + # is confirmed working here, then roll out to the next dockcross job. Inert without DEPOT_TOKEN + # (fork PRs) or with use_cache=false. + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + SCCACHE_LOG: debug + SCCACHE_ERROR_LOG: /tmp/sccache_server.log + RUST_BACKTRACE: full + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets diff --git a/CLAUDE.md b/CLAUDE.md index 47a2b75d..f20b12e5 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -197,14 +197,39 @@ stays `-O3` and is **bit-identical** to a clean build (release-safe). **Safety / transparency.** It is **inert** until `DEPOT_TOKEN` is configured and on **fork PRs** (secrets are hidden there) — those simply compile normally; the `Install sccache` step -is `continue-on-error`; and `use_cache=false` forces a pristine, from-scratch build. - -**Rollout.** **Phase 1 (current): the 3 macOS build jobs** (slowest + OOM-prone) — -`brew install sccache` + the env above + `BUILD_JOBS: 2`. **Phase 2 (TODO):** the dockcross -Linux/Android/CUDA jobs (the `sccache` binary **and** `DEPOT_TOKEN` must be passed *into* the -container), the Windows jobs (sccache supports MSVC), and the Linux-host `test-cpp` job. To -extend a job: install `sccache`, set the two `SCCACHE_WEBDAV_*` env vars, and (for -RAM-limited runners) `BUILD_JOBS`. +is `continue-on-error`; and `use_cache=false` forces a pristine, from-scratch build. Crucially, +`build.sh` runs a **probe-compile health-check** (`sccache_can_wrap_compiler`) before trusting +sccache as the launcher: it compiles a trivial TU *through* sccache, and only sets +`-DCMAKE_{C,CXX}_COMPILER_LAUNCHER=sccache` if that succeeds. So a sccache that is present but +**crashes** (the in-container panic that stalled phase 2) also falls back to an uncached, green +`-O3` build — it logs the Rust panic backtrace (and the detached server's `SCCACHE_ERROR_LOG`, +when a job sets one) for diagnosis but never reds the build. This closes the gap the original +absent-only guard left. + +**Rollout.** **Phase 1 — DONE & proven: the 3 macOS build jobs** (slowest + OOM-prone) — +`brew install sccache` + the env above + `BUILD_JOBS: 2`. macOS build dropped **~40 min → ~6 min** +with a warm cache. **Phase 2 — in progress: the dockcross cross-compiles**, enabled **one job at +a time and verified green in CI before the next**. (The first attempt enabled all four at once +and was reverted: the static-musl sccache panicked in-container and — pre-probe — redded the +build. The probe above now makes that a safe fallback.) Order, each adding the env + a +`DOCKCROSS_ARGS` passthrough: +1. `crosscompile-linux-x86_64` (manylinux2014) — **enabled first**, with `SCCACHE_LOG=debug` + + `SCCACHE_ERROR_LOG` + `RUST_BACKTRACE=full` so the run captures the panic root cause if it + recurs. Once green with a cache hit in `sccache --show-stats`, drop the diagnostic vars. +2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) — only + the gcc C/C++ TUs cache (134 model files + ggml + httplib); the nvcc `.cu` kernels won't + (limited sccache nvcc support) — still a large partial win on the ~70 min job. +3. `crosscompile-linux-aarch64`, then 4. `crosscompile-android-aarch64`. +5. `crosscompile-android-aarch64-opencl` — **separate**, uses `build_opencl_android.sh` (not + `build.sh`); needs its own probe/launcher wiring. + +Per-job recipe: add `env:` { `USE_CACHE`, `SCCACHE_WEBDAV_ENDPOINT`, `SCCACHE_WEBDAV_TOKEN` } and +`DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE"` — the +dockcross wrapper only forwards host env it is explicitly told to via `-e`. The fetched sccache +version is the `SCCACHE_DL_VERSION` knob in `build.sh` (default **0.15.0**; overridable per-job +to try a different build against a container that crashed another). **Windows** (`build.bat` + +MSVC) is separate and last: use `mozilla-actions/sccache-action` / sccache's MSVC support, not +the `build.sh` musl fetch. **Cross-repo scope.** This Depot/sccache compiler cache makes sense only for java-llama.cpp — it is the only sibling repo with a native (C++/JNI) build. It does not apply to the pure-Maven From c4b3adf181e1073b0c2e17a8bf5c7793aedf9023 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 12:25:22 +0000 Subject: [PATCH 05/16] ci: default fetched sccache to v0.16.0 (latest) Bump the SCCACHE_DL_VERSION default 0.15.0 -> 0.16.0 (released 2026-06-19, the current latest). The x86_64-unknown-linux-musl asset is confirmed published; the fetch stays fail-safe (a missing version just falls back to an uncached build) and the value is overridable per-job. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/build.sh | 6 +++--- CLAUDE.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/.github/build.sh b/.github/build.sh index 11acb4b5..dec29e86 100755 --- a/.github/build.sh +++ b/.github/build.sh @@ -24,9 +24,9 @@ fi # # SCCACHE_DL_VERSION is overridable per-job, so a container that crashes one sccache build can # try another without editing this script (the in-container panic that stalled phase 2 was on -# v0.8.2; v0.15.0 is the current stable default). A wrong/unavailable version just fails the -# `curl -f` and falls back to an uncached build, so bumping it can never red a build. -SCCACHE_DL_VERSION="${SCCACHE_DL_VERSION:-0.15.0}" +# v0.8.2; v0.16.0 is the latest release and the default). A wrong/unavailable version just fails +# the `curl -f` and falls back to an uncached build, so bumping it can never red a build. +SCCACHE_DL_VERSION="${SCCACHE_DL_VERSION:-0.16.0}" if [ "${USE_CACHE:-true}" = "true" ] && [ -n "${SCCACHE_WEBDAV_TOKEN:-}${SCCACHE_GHA_ENABLED:-}" ] \ && ! command -v sccache >/dev/null 2>&1 \ && [ "$(uname -s)" = "Linux" ] && [ "$(uname -m)" = "x86_64" ]; then diff --git a/CLAUDE.md b/CLAUDE.md index f20b12e5..c8ab59b5 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -226,7 +226,7 @@ build. The probe above now makes that a safe fallback.) Order, each adding the e Per-job recipe: add `env:` { `USE_CACHE`, `SCCACHE_WEBDAV_ENDPOINT`, `SCCACHE_WEBDAV_TOKEN` } and `DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE"` — the dockcross wrapper only forwards host env it is explicitly told to via `-e`. The fetched sccache -version is the `SCCACHE_DL_VERSION` knob in `build.sh` (default **0.15.0**; overridable per-job +version is the `SCCACHE_DL_VERSION` knob in `build.sh` (default **0.16.0**; overridable per-job to try a different build against a container that crashed another). **Windows** (`build.bat` + MSVC) is separate and last: use `mozilla-actions/sccache-action` / sccache's MSVC support, not the `build.sh` musl fetch. From 1ea472e7069e9e876dface9782576fdb0eab7865 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 12:30:57 +0000 Subject: [PATCH 06/16] ci: bump all dockcross images to 20260515-5fd14ac (latest) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bump DEFAULT_DOCKCROSS_IMAGE in all five wrappers from 20260312/13-9b3357c to 20260515-5fd14ac — the newest dockcross release on Docker Hub (verified: a full tag scan shows nothing dated later than 2026-05-15 across the images, no 2026-06 build exists, and 'latest' points to the same digest). This is a tag-pin bump on line 3 (the operative pin), not a full update.sh docker regeneration (which needs Docker unavailable here); the wrapper body is version-stable. It changes the toolchain for every cross-compiled native artifact, so each platform should be confirmed green in CI. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/dockcross/dockcross-android-arm | 2 +- .github/dockcross/dockcross-android-arm64 | 2 +- .github/dockcross/dockcross-linux-arm64-lts | 2 +- .github/dockcross/dockcross-manylinux2014-x64 | 2 +- .github/dockcross/dockcross-manylinux_2_28-x64 | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/.github/dockcross/dockcross-android-arm b/.github/dockcross/dockcross-android-arm index eb90d8a5..70e1466e 100755 --- a/.github/dockcross/dockcross-android-arm +++ b/.github/dockcross/dockcross-android-arm @@ -1,6 +1,6 @@ #!/usr/bin/env bash -DEFAULT_DOCKCROSS_IMAGE=dockcross/android-arm:20260312-9b3357c +DEFAULT_DOCKCROSS_IMAGE=dockcross/android-arm:20260515-5fd14ac #------------------------------------------------------------------------------ # Helpers diff --git a/.github/dockcross/dockcross-android-arm64 b/.github/dockcross/dockcross-android-arm64 index 7cc130dd..6ba9ecdb 100755 --- a/.github/dockcross/dockcross-android-arm64 +++ b/.github/dockcross/dockcross-android-arm64 @@ -1,6 +1,6 @@ #!/usr/bin/env bash -DEFAULT_DOCKCROSS_IMAGE=dockcross/android-arm64:20260312-9b3357c +DEFAULT_DOCKCROSS_IMAGE=dockcross/android-arm64:20260515-5fd14ac #------------------------------------------------------------------------------ # Helpers diff --git a/.github/dockcross/dockcross-linux-arm64-lts b/.github/dockcross/dockcross-linux-arm64-lts index 0658411f..49c467c0 100755 --- a/.github/dockcross/dockcross-linux-arm64-lts +++ b/.github/dockcross/dockcross-linux-arm64-lts @@ -1,6 +1,6 @@ #!/usr/bin/env bash -DEFAULT_DOCKCROSS_IMAGE=dockcross/linux-arm64-lts:20260313-9b3357c +DEFAULT_DOCKCROSS_IMAGE=dockcross/linux-arm64-lts:20260515-5fd14ac #------------------------------------------------------------------------------ # Helpers diff --git a/.github/dockcross/dockcross-manylinux2014-x64 b/.github/dockcross/dockcross-manylinux2014-x64 index 75a37ffe..a3aea0f7 100755 --- a/.github/dockcross/dockcross-manylinux2014-x64 +++ b/.github/dockcross/dockcross-manylinux2014-x64 @@ -1,6 +1,6 @@ #!/usr/bin/env bash -DEFAULT_DOCKCROSS_IMAGE=dockcross/manylinux2014-x64:20260312-9b3357c +DEFAULT_DOCKCROSS_IMAGE=dockcross/manylinux2014-x64:20260515-5fd14ac #------------------------------------------------------------------------------ # Helpers diff --git a/.github/dockcross/dockcross-manylinux_2_28-x64 b/.github/dockcross/dockcross-manylinux_2_28-x64 index 15d4937e..39f4f9db 100755 --- a/.github/dockcross/dockcross-manylinux_2_28-x64 +++ b/.github/dockcross/dockcross-manylinux_2_28-x64 @@ -1,6 +1,6 @@ #!/usr/bin/env bash -DEFAULT_DOCKCROSS_IMAGE=dockcross/manylinux_2_28-x64:20260312-9b3357c +DEFAULT_DOCKCROSS_IMAGE=dockcross/manylinux_2_28-x64:20260515-5fd14ac #------------------------------------------------------------------------------ # Helpers From bf109c267152800a1873e6864b8a2709b7eb58a9 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 13:19:53 +0000 Subject: [PATCH 07/16] ci: enable sccache on the CUDA dockcross job; manylinux2014 verified (phase 2, job 2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit manylinux2014 (job 1) verified green in PR #245: sccache v0.16.0 probe passed inside the container (devtoolset-10 gcc), cache ON over Depot WebDAV, cold run stored 275 objects. The v0.8.2 in-container panic does not occur on v0.16.0. Dropped job 1's first-run diagnostics (SCCACHE_LOG/SCCACHE_ERROR_LOG/RUST_BACKTRACE) to its steady-state env. Enable job 2: crosscompile-linux-x86_64-cuda (manylinux_2_28 + CUDA via build_cuda_linux.sh, which execs build.sh, so the same probe guards it). Diagnostics on for its first run on the manylinux_2_28 image. Only the gcc C/C++ TUs cache; nvcc .cu kernels are not wrapped. aarch64/android/opencl-android/Windows stay uncached until each is verified — one job at a time. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/workflows/publish.yml | 33 +++++++++++++++++++++------------ CLAUDE.md | 13 ++++++++----- 2 files changed, 29 insertions(+), 17 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index ec1194db..fe1c1170 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -171,6 +171,20 @@ jobs: name: Cross-Compile manylinux_2_28 x86_64 (CUDA) needs: [startgate, build-webui] runs-on: ubuntu-latest + # Phase 2 dockcross cache rollout — job 2, enabled after manylinux2014 (job 1) verified green + # in CI with sccache v0.16.0 caching to Depot. build_cuda_linux.sh execs build.sh, so the same + # probe guards this job: only the gcc C/C++ TUs cache (the nvcc .cu kernels are not wrapped), + # still a large partial win on this ~70 min build. Diagnostics are on for its first run on the + # manylinux_2_28 image; drop them (and their -e passthroughs) once it is confirmed green with a + # cache hit, then enable the next job. Inert without DEPOT_TOKEN (fork PRs) or use_cache=false. + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + SCCACHE_LOG: debug + SCCACHE_ERROR_LOG: /tmp/sccache_server.log + RUST_BACKTRACE: full + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets @@ -200,22 +214,17 @@ jobs: name: Cross-Compile manylinux2014 x86_64 needs: [startgate, build-webui] runs-on: ubuntu-latest - # Phase 2 dockcross cache rollout — FIRST job (fastest plain-build.sh job, cleanest probe). - # build.sh now probe-compiles through sccache before trusting it as the launcher, so a - # present-but-crashing in-container sccache (the panic that stalled the first attempt) falls - # back to an uncached, green -O3 build instead of redding it. The diagnostic vars below are - # forwarded into the container so this run captures the root cause if the panic recurs; drop - # SCCACHE_LOG / SCCACHE_ERROR_LOG / RUST_BACKTRACE (and their -e passthroughs) once the cache - # is confirmed working here, then roll out to the next dockcross job. Inert without DEPOT_TOKEN - # (fork PRs) or with use_cache=false. + # Phase 2 dockcross cache rollout — job 1, VERIFIED green in CI (PR #245): sccache v0.16.0 + # probe passed in-container (devtoolset-10 gcc), cache ON over Depot WebDAV (cold run: 275 + # objects stored). Steady-state env below — the first-run diagnostics (SCCACHE_LOG / + # SCCACHE_ERROR_LOG / RUST_BACKTRACE) were dropped now that it is proven. Inert without + # DEPOT_TOKEN (fork PRs) or with use_cache=false; a crashing sccache still falls back to a + # green uncached build via the build.sh probe. env: USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} - SCCACHE_LOG: debug - SCCACHE_ERROR_LOG: /tmp/sccache_server.log - RUST_BACKTRACE: full - DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE" + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets diff --git a/CLAUDE.md b/CLAUDE.md index c8ab59b5..75a37fb6 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -213,11 +213,14 @@ a time and verified green in CI before the next**. (The first attempt enabled al and was reverted: the static-musl sccache panicked in-container and — pre-probe — redded the build. The probe above now makes that a safe fallback.) Order, each adding the env + a `DOCKCROSS_ARGS` passthrough: -1. `crosscompile-linux-x86_64` (manylinux2014) — **enabled first**, with `SCCACHE_LOG=debug` + - `SCCACHE_ERROR_LOG` + `RUST_BACKTRACE=full` so the run captures the panic root cause if it - recurs. Once green with a cache hit in `sccache --show-stats`, drop the diagnostic vars. -2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) — only - the gcc C/C++ TUs cache (134 model files + ggml + httplib); the nvcc `.cu` kernels won't +1. `crosscompile-linux-x86_64` (manylinux2014) — ✅ **verified green** in PR #245: sccache + **v0.16.0** probe passed in-container (devtoolset-10 gcc), `sccache ON` over Depot WebDAV, + cold run stored 275 objects (3 hits). The **v0.8.2 in-container panic is gone on v0.16.0**; + first-run diagnostics dropped, steady-state env = `USE_CACHE` + the two `SCCACHE_WEBDAV_*` + + `DOCKCROSS_ARGS`. +2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) — + 🚧 **enabled next** (diagnostics on for its first run on the manylinux_2_28 image). Only the + gcc C/C++ TUs cache (134 model files + ggml + httplib); the nvcc `.cu` kernels won't (limited sccache nvcc support) — still a large partial win on the ~70 min job. 3. `crosscompile-linux-aarch64`, then 4. `crosscompile-android-aarch64`. 5. `crosscompile-android-aarch64-opencl` — **separate**, uses `build_opencl_android.sh` (not From 625d743bfbfb5f49872b4d4492ee05a071fbb8be Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 13:55:18 +0000 Subject: [PATCH 08/16] build: bump NullAway 0.13.6 -> 0.13.7 and pitest-maven 1.25.4 -> 1.25.5 Both are the latest stable patch releases on Maven Central. NullAway runs at -Xep:NullAway:ERROR and was verified clean with 'mvn compile' in this repo; pitest-maven is a plugin-only patch bump. Part of the cross-repo dependency freshness sweep. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- pom.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pom.xml b/pom.xml index 3f1f52c6..3dd10543 100644 --- a/pom.xml +++ b/pom.xml @@ -54,7 +54,7 @@ SPDX-License-Identifier: MIT 1.0.0 1.18.46 2.50.0 - 0.13.6 + 0.13.7 4.2.0 2.22.0 3.8.6 @@ -330,7 +330,7 @@ SPDX-License-Identifier: MIT org.pitest pitest-maven - 1.25.4 + 1.25.5 org.sonatype.central From df71f9ca26f2b60146071648235de966a1d65dea Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 14:07:10 +0000 Subject: [PATCH 09/16] build: googletest 1.15.2->1.17.0 + opt-in CUDA_FAST_BUILD single-arch dev knob MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit googletest: bump the BUILD_TESTING-only FetchContent (used only by jllama_test's C++ unit tests, not the shipped library and not coupled to llama.cpp) from v1.15.2 to v1.17.0. There is no constraint behind the tag — it is just latest-stable; CLAUDE.md now says to bump it periodically. CUDA_FAST_BUILD: add an opt-in, default-OFF env knob to build_cuda_linux.sh that builds CUDA for a single architecture (default 'native', override CUDA_ARCH=) instead of the full release arch set, to speed up local iteration. Default + CI/release behaviour is unchanged (full arch set), so released jars keep full GPU coverage. nvcc .cu kernels are not sccache-cached (limited support), so fewer archs is the real CUDA build-time lever; rationale documented in CLAUDE.md and inline. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/build_cuda_linux.sh | 24 +++++++++++++++++++++++- CLAUDE.md | 37 +++++++++++++++++++++++++++++++++++++ CMakeLists.txt | 5 ++++- 3 files changed, 64 insertions(+), 2 deletions(-) diff --git a/.github/build_cuda_linux.sh b/.github/build_cuda_linux.sh index d9acbbf2..bf9bc560 100755 --- a/.github/build_cuda_linux.sh +++ b/.github/build_cuda_linux.sh @@ -15,4 +15,26 @@ sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute sudo dnf install -y cuda-toolkit-13-2 -exec .github/build.sh $@ -DGGML_CUDA=1 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.2/bin/nvcc +# CUDA target architectures — build-speed knob. +# +# Default (CUDA_FAST_BUILD unset): we do NOT pass CMAKE_CUDA_ARCHITECTURES, so ggml/llama.cpp +# compiles its full default arch set. That is exactly what release artifacts must ship (every +# supported GPU generation) and is the slow part of this ~70 min job: nvcc recompiles each .cu +# kernel once per architecture. sccache caches the gcc C/C++ TUs but NOT the nvcc .cu kernels +# (sccache's nvcc support is limited/experimental), so the per-arch nvcc passes dominate even +# with the cache on — which is why this knob exists as the real CUDA build-time lever. +# +# Dev fast build (CUDA_FAST_BUILD=1): compile for a SINGLE architecture instead of the full +# set, removing most of the nvcc time. Defaults to `native` (the build machine's own GPU — +# needs a GPU present at configure time); override with CUDA_ARCH, e.g. CUDA_ARCH=90. This is +# a MANUAL local-dev knob only: CI and release never set it, because an artifact built this +# way runs on a single GPU generation. (Direct-cmake equivalent: -DCMAKE_CUDA_ARCHITECTURES=native.) +CUDA_ARCH_ARGS="" +case "${CUDA_FAST_BUILD:-}" in + 1 | true | TRUE | yes | on) + CUDA_ARCH_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH:-native}" + echo "build_cuda_linux.sh: CUDA_FAST_BUILD set -> ${CUDA_ARCH_ARGS} (DEV ONLY — not release-distributable)" + ;; +esac + +exec .github/build.sh $@ -DGGML_CUDA=1 -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.2/bin/nvcc $CUDA_ARCH_ARGS diff --git a/CLAUDE.md b/CLAUDE.md index 75a37fb6..aaef25cf 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -38,6 +38,37 @@ git add .github/build_cuda_linux.sh pom.xml CLAUDE.md git commit -m "Upgrade CUDA from 13.2 to 13.3" ``` +### Fast local CUDA builds (`CUDA_FAST_BUILD`) — single-arch speed knob + +The CUDA artifact must ship kernels for **every supported GPU generation**, so the default +build — and every CI/release build — compiles the **full `CMAKE_CUDA_ARCHITECTURES` set** that +ggml/llama.cpp selects. nvcc recompiles each `.cu` kernel once per architecture, which is the +dominant cost of the ~70 min CUDA job. **`sccache` does not help here:** it caches the gcc +C/C++ TUs but not the nvcc `.cu` kernels (sccache's nvcc support is limited/experimental), so +the per-arch nvcc passes remain even with the cache on. The one reliable lever to cut that time +is to build **fewer architectures**. + +`build_cuda_linux.sh` therefore honors an **opt-in** env knob — default **off** (full arch set, +release-safe): + +```bash +# Full release build (default): all archs — slow, runs on every GPU generation. +.github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS_ARCH=x86_64" + +# Fast local dev build: one arch only. Defaults to `native` (the build machine's own GPU; +# needs a GPU present at configure time). Override with CUDA_ARCH=, e.g. CUDA_ARCH=90. +CUDA_FAST_BUILD=1 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS_ARCH=x86_64" +CUDA_FAST_BUILD=1 CUDA_ARCH=90 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS_ARCH=x86_64" +# Direct-cmake equivalent: cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native +``` + +**Why a separate, off-by-default flag (never enable it in CI/release):** an artifact built with +`CUDA_FAST_BUILD` runs on only the single GPU generation it was compiled for. The flag exists +purely to speed up **local iteration**; the CI CUDA job leaves it unset, so released jars keep +full arch coverage. To cache the nvcc kernels too you would add +`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc +caching is unreliable — the arch knob is the better lever and is what this repo ships. + ## Android minimum API level Current Android minimum API level: **28** (Android 9.0 Pie) @@ -735,6 +766,12 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson" llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9682`. +**GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely +by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the +llama.cpp pin or the bundled nlohmann/json. There is **no constraint behind the exact tag**; it +is just the latest stable at the time it was last touched. Bump it from time to time (nothing +auto-tracks it), pairing the bump with a green `C++ Tests` CI run. + ``` build/_deps/llama.cpp-src/tools/server/ ← server-task.h, server-common.h, etc. build/_deps/llama.cpp-src/include/ ← llama.h, llama-cpp.h diff --git a/CMakeLists.txt b/CMakeLists.txt index 89d80585..f9cb148d 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -383,7 +383,10 @@ if(BUILD_TESTING) FetchContent_Declare( googletest GIT_REPOSITORY https://github.com/google/googletest.git - GIT_TAG v1.15.2 + # No constraint behind this exact tag — GoogleTest is only used by this repo's own + # C++ unit tests (jllama_test), not by the shipped library and not tied to llama.cpp. + # It is just "latest stable at the time"; bump it from time to time (see CLAUDE.md). + GIT_TAG v1.17.0 ) # Keep GTest on the same CRT as the rest of the project. # OFF means GTest respects CMAKE_MSVC_RUNTIME_LIBRARY (static /MT here). From 8f064c72e870bda950d45d72f2ffc8e96b2e896b Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 14:30:38 +0000 Subject: [PATCH 10/16] ci: cache GGUF test models via GitHub actions/cache (skip HuggingFace re-downloads) Each Java-test job re-downloaded ~5 GB of GGUF models from HuggingFace every run. Add an actions/cache@v5 step (path models/, shared key gguf-models-v1) to all four Java-test jobs and guard every model curl with 'test -f models/$NAME ||' so a cache hit skips the download. GGUF files are platform-independent, so ubuntu + macOS share one ~5 GB entry (well under GitHub's free 10 GB/repo cache). Deliberately GitHub's free cache, NOT Depot: Depot Cache is usage-priced (GB-scale model blobs would raise the bill, unlike the tiny content-addressed sccache objects) and its general file cache only works on Depot-hosted runners. Bonus: cache hits also dodge HuggingFace 429s (the reason for the curl --retry flags). Bump the key suffix when the model set/URLs change. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/workflows/publish.yml | 94 ++++++++++++++++++++++++----------- 1 file changed, 65 insertions(+), 29 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index fe1c1170..40bae048 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -586,22 +586,31 @@ jobs: with: name: Linux-x86_64-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) + uses: actions/cache@v5 + with: + path: models/ + # Shared, stable key across all test jobs (GGUF files are platform-independent, so + # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. + # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot + # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + key: gguf-models-v1 - name: Download text generation model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} + run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} - name: Download reranking model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} + run: test -f models/${RERANKING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} - name: Download draft model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} + run: test -f models/${DRAFT_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} - name: Download reasoning model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} + run: test -f models/${REASONING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} - name: Download tool-calling model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} + run: test -f models/${TOOL_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} - name: Download nomic embedding model (issue #98 regression) - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${NOMIC_EMBED_MODEL_URL} --create-dirs -o models/${NOMIC_EMBED_MODEL_NAME} + run: test -f models/${NOMIC_EMBED_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${NOMIC_EMBED_MODEL_URL} --create-dirs -o models/${NOMIC_EMBED_MODEL_NAME} - name: Download vision model (issues #103 / #34) - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} + run: test -f models/${VISION_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} - name: Download vision mmproj - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} + run: test -f models/${VISION_MMPROJ_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} - name: List files in models directory run: ls -l models/ - name: Validate model files @@ -710,20 +719,29 @@ jobs: with: name: macos-14-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) + uses: actions/cache@v5 + with: + path: models/ + # Shared, stable key across all test jobs (GGUF files are platform-independent, so + # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. + # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot + # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + key: gguf-models-v1 - name: Download text generation model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} + run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} - name: Download reranking model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} + run: test -f models/${RERANKING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} - name: Download draft model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} + run: test -f models/${DRAFT_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} - name: Download reasoning model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} + run: test -f models/${REASONING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} - name: Download tool-calling model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} + run: test -f models/${TOOL_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} - name: Download vision model (issues #103 / #34) - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} + run: test -f models/${VISION_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} - name: Download vision mmproj - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} + run: test -f models/${VISION_MMPROJ_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} - name: List files in models directory run: ls -l models/ - name: Validate model files @@ -777,20 +795,29 @@ jobs: with: name: macos-15-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) + uses: actions/cache@v5 + with: + path: models/ + # Shared, stable key across all test jobs (GGUF files are platform-independent, so + # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. + # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot + # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + key: gguf-models-v1 - name: Download text generation model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} + run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} - name: Download reranking model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} + run: test -f models/${RERANKING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} - name: Download draft model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} + run: test -f models/${DRAFT_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} - name: Download reasoning model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} + run: test -f models/${REASONING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} - name: Download tool-calling model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} + run: test -f models/${TOOL_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} - name: Download vision model (issues #103 / #34) - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} + run: test -f models/${VISION_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} - name: Download vision mmproj - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} + run: test -f models/${VISION_MMPROJ_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} - name: List files in models directory run: ls -l models/ - name: Validate model files @@ -844,20 +871,29 @@ jobs: with: name: macos-15-metal-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) + uses: actions/cache@v5 + with: + path: models/ + # Shared, stable key across all test jobs (GGUF files are platform-independent, so + # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. + # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot + # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + key: gguf-models-v1 - name: Download text generation model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} + run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} - name: Download reranking model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} + run: test -f models/${RERANKING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${RERANKING_MODEL_URL} --create-dirs -o models/${RERANKING_MODEL_NAME} - name: Download draft model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} + run: test -f models/${DRAFT_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${DRAFT_MODEL_URL} --create-dirs -o models/${DRAFT_MODEL_NAME} - name: Download reasoning model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} + run: test -f models/${REASONING_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${REASONING_MODEL_URL} --create-dirs -o models/${REASONING_MODEL_NAME} - name: Download tool-calling model - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} + run: test -f models/${TOOL_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TOOL_MODEL_URL} --create-dirs -o models/${TOOL_MODEL_NAME} - name: Download vision model (issues #103 / #34) - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} + run: test -f models/${VISION_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MODEL_URL} --create-dirs -o models/${VISION_MODEL_NAME} - name: Download vision mmproj - run: curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} + run: test -f models/${VISION_MMPROJ_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${VISION_MMPROJ_URL} --create-dirs -o models/${VISION_MMPROJ_NAME} - name: List files in models directory run: ls -l models/ - name: Validate model files From 9a1d4931c9665ea5f6dc147646c6140ef4a40979 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 14:40:39 +0000 Subject: [PATCH 11/16] docs(ci): explain the GGUF model cache (purpose, no flag, vs sccache) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Expand the inline comment on the model-cache step: it exists to avoid re-downloading ~5 GB of GGUF test models from HuggingFace every run (and to dodge HF rate-limits). It is always ON by design — no on/off flag — unlike the sccache compiler cache, which the use_cache input / USE_CACHE env toggles. Notes it uses GitHub's free cache, not Depot. Comment-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/workflows/publish.yml | 48 +++++++++++++++++++++++------------ 1 file changed, 32 insertions(+), 16 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 40bae048..5ec515e1 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -586,14 +586,18 @@ jobs: with: name: Linux-x86_64-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + # GGUF model cache — introduced to stop re-downloading ~5 GB of test models from + # HuggingFace on every run (also dodges HF rate-limits). Complements the sccache compiler + # cache but is always ON: there is intentionally NO on/off flag for it (it is GitHub's + # free cache, safe + free), whereas the sccache cache is toggled by the `use_cache` + # workflow_dispatch input / USE_CACHE env. Not Depot — GB-scale blobs are usage-priced + # there and its file cache needs Depot-hosted runners. See CLAUDE.md. - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) uses: actions/cache@v5 with: path: models/ - # Shared, stable key across all test jobs (GGUF files are platform-independent, so - # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. - # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot - # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + # GGUF is platform-independent, so ubuntu + macOS share one entry; + # bump the suffix when the model set / URLs change. key: gguf-models-v1 - name: Download text generation model run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} @@ -719,14 +723,18 @@ jobs: with: name: macos-14-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + # GGUF model cache — introduced to stop re-downloading ~5 GB of test models from + # HuggingFace on every run (also dodges HF rate-limits). Complements the sccache compiler + # cache but is always ON: there is intentionally NO on/off flag for it (it is GitHub's + # free cache, safe + free), whereas the sccache cache is toggled by the `use_cache` + # workflow_dispatch input / USE_CACHE env. Not Depot — GB-scale blobs are usage-priced + # there and its file cache needs Depot-hosted runners. See CLAUDE.md. - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) uses: actions/cache@v5 with: path: models/ - # Shared, stable key across all test jobs (GGUF files are platform-independent, so - # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. - # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot - # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + # GGUF is platform-independent, so ubuntu + macOS share one entry; + # bump the suffix when the model set / URLs change. key: gguf-models-v1 - name: Download text generation model run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} @@ -795,14 +803,18 @@ jobs: with: name: macos-15-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + # GGUF model cache — introduced to stop re-downloading ~5 GB of test models from + # HuggingFace on every run (also dodges HF rate-limits). Complements the sccache compiler + # cache but is always ON: there is intentionally NO on/off flag for it (it is GitHub's + # free cache, safe + free), whereas the sccache cache is toggled by the `use_cache` + # workflow_dispatch input / USE_CACHE env. Not Depot — GB-scale blobs are usage-priced + # there and its file cache needs Depot-hosted runners. See CLAUDE.md. - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) uses: actions/cache@v5 with: path: models/ - # Shared, stable key across all test jobs (GGUF files are platform-independent, so - # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. - # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot - # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + # GGUF is platform-independent, so ubuntu + macOS share one entry; + # bump the suffix when the model set / URLs change. key: gguf-models-v1 - name: Download text generation model run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} @@ -871,14 +883,18 @@ jobs: with: name: macos-15-metal-libraries path: ${{ github.workspace }}/src/main/resources/net/ladenthin/llama/ + # GGUF model cache — introduced to stop re-downloading ~5 GB of test models from + # HuggingFace on every run (also dodges HF rate-limits). Complements the sccache compiler + # cache but is always ON: there is intentionally NO on/off flag for it (it is GitHub's + # free cache, safe + free), whereas the sccache cache is toggled by the `use_cache` + # workflow_dispatch input / USE_CACHE env. Not Depot — GB-scale blobs are usage-priced + # there and its file cache needs Depot-hosted runners. See CLAUDE.md. - name: Cache GGUF models (GitHub Actions cache; avoids re-downloading from HuggingFace) uses: actions/cache@v5 with: path: models/ - # Shared, stable key across all test jobs (GGUF files are platform-independent, so - # ubuntu + macOS share one entry). Bump the suffix when the model set/URLs change. - # Uses GitHub's free 10 GB/repo cache — NOT Depot: these are GB-scale blobs and Depot - # is usage-priced + its file cache needs Depot-hosted runners (see CLAUDE.md). + # GGUF is platform-independent, so ubuntu + macOS share one entry; + # bump the suffix when the model set / URLs change. key: gguf-models-v1 - name: Download text generation model run: test -f models/${MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${MODEL_URL} --create-dirs -o models/${MODEL_NAME} From dd264b2a6e0d76a6ddfc47c6dd0d0914f781d96b Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 15:06:39 +0000 Subject: [PATCH 12/16] feat(server): add NativeServer JNI-bridge scaffold (native HTTP server + WebUI) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Minimal structural wiring for the planned native server: NativeServer sits next to OpenAiCompatServer (the Java server) as the entry point for the upstream native HTTP transport (server-http.cpp + cpp-httplib) already compiled into libjllama — the only component that can serve the embedded WebUI. Scaffold only: start() throws UnsupportedOperationException until the upstream routes (server.cpp's registration) are wired to a JNI entry point; isRunning()/getHost()/getPort()/close() are model-free placeholders. The native methods + C++ implementation + lifecycle are a separate, detailed step. Adds a model-free smoke test (NativeServerSmokeTest, 3 tests). Verified locally: compile (Error Prone/NullAway/Checker), javadoc (failOnWarnings), SpotBugs Max/Low (0 bugs, @ToString clears IMC), ArchUnit (12/12). Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .../ladenthin/llama/server/NativeServer.java | 109 ++++++++++++++++++ .../llama/server/NativeServerSmokeTest.java | 48 ++++++++ 2 files changed, 157 insertions(+) create mode 100644 src/main/java/net/ladenthin/llama/server/NativeServer.java create mode 100644 src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java diff --git a/src/main/java/net/ladenthin/llama/server/NativeServer.java b/src/main/java/net/ladenthin/llama/server/NativeServer.java new file mode 100644 index 00000000..024ac827 --- /dev/null +++ b/src/main/java/net/ladenthin/llama/server/NativeServer.java @@ -0,0 +1,109 @@ +// SPDX-FileCopyrightText: 2026 Bernard Ladenthin +// +// SPDX-License-Identifier: MIT + +package net.ladenthin.llama.server; + +import java.util.Objects; +import lombok.ToString; + +/** + * Scaffold for the native HTTP server bridge — the planned counterpart to + * {@link OpenAiCompatServer}. + * + *

{@link OpenAiCompatServer} implements the HTTP transport in Java (on the JDK's + * {@code com.sun.net.httpserver}) and drives the native llama.cpp server core over JNI. This + * class is instead the entry point for the upstream native HTTP transport that is already + * compiled into {@code libjllama} (llama.cpp's {@code server-http.cpp} plus its {@code cpp-httplib} + * backend). That native transport is the only component able to serve the embedded llama.cpp + * WebUI (the {@code ui.cpp}/{@code ui.h} asset table compiled in behind + * {@code LLAMA_UI_HAS_ASSETS}).

+ * + *

Status: scaffold only. The route registration that upstream performs in + * {@code server.cpp} (deliberately excluded from this build) is not yet wired to a JNI entry point, so + * {@link #start()} throws {@link UnsupportedOperationException} for now. This class only fixes the + * package structure and the public API shape; the native {@code startServer}/{@code stopServer} + * methods, their C++ implementation, the server lifecycle/threading and WebUI serving are a separate, + * detailed step (see {@code CLAUDE.md}, "WebUI (llama.cpp Svelte UI) embedding").

+ * + *

It is {@link AutoCloseable} so that, once implemented, callers can drive it with + * try-with-resources exactly like {@link OpenAiCompatServer}.

+ */ +@ToString +public final class NativeServer implements AutoCloseable { + + /** Message thrown by {@link #start()} until the native route-wiring lands. */ + static final String NOT_WIRED_MESSAGE = + "NativeServer is a scaffold: the upstream native HTTP routes (server-http.cpp) are " + + "not yet wired to JNI. Use OpenAiCompatServer for now; the native server and " + + "embedded WebUI are a planned step."; + + /** Immutable server configuration (bind host, port, ...) shared with {@link OpenAiCompatServer}. */ + private final OpenAiServerConfig config; + + /** + * Creates a native-server bridge for the given configuration. + * + *

Construction performs no native work and binds no socket; it only captures the configuration. + * Call {@link #start()} to launch the server (not implemented yet).

+ * + * @param config the server configuration (host, port, ...); must not be {@code null} + */ + public NativeServer(OpenAiServerConfig config) { + this.config = Objects.requireNonNull(config, "config"); + } + + /** + * Starts the native HTTP server and begins serving the embedded WebUI. + * + *

Not implemented yet — this is a scaffold. The native route registration and + * its JNI binding are a planned step, so this method always throws until then.

+ * + * @return this server instance (for fluent / try-with-resources use), once implemented + * @throws UnsupportedOperationException always, until the native routes are wired to JNI + */ + // Scaffold: start() intentionally always throws for now, but must stay callable (not @DoNotCall) + // so the real implementation and its callers/tests keep the same signature. + @SuppressWarnings("DoNotCallSuggester") + public NativeServer start() { + throw new UnsupportedOperationException(NOT_WIRED_MESSAGE); + } + + /** + * Reports whether the native server is currently running. + * + * @return {@code false} — the scaffold never starts a server yet + */ + public boolean isRunning() { + return false; + } + + /** + * Returns the host the server is configured to bind to. + * + * @return the configured bind host + */ + public String getHost() { + return config.getHost(); + } + + /** + * Returns the port the server is configured to bind to. + * + * @return the configured port + */ + public int getPort() { + return config.getPort(); + } + + /** + * Stops the native server if it is running. + * + *

No-op in the scaffold (nothing is ever started), so it is always safe to call, including from + * try-with-resources. Real lifecycle teardown is part of the planned native-server implementation.

+ */ + @Override + public void close() { + // Nothing is started yet, so there is nothing to release. + } +} diff --git a/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java b/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java new file mode 100644 index 00000000..7e74dec4 --- /dev/null +++ b/src/test/java/net/ladenthin/llama/server/NativeServerSmokeTest.java @@ -0,0 +1,48 @@ +// SPDX-FileCopyrightText: 2026 Bernard Ladenthin +// +// SPDX-License-Identifier: MIT + +package net.ladenthin.llama.server; + +import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.Matchers.containsString; +import static org.hamcrest.Matchers.is; +import static org.junit.jupiter.api.Assertions.assertThrows; + +import org.junit.jupiter.api.Test; + +/** + * Model-free smoke test for the {@link NativeServer} scaffold: it must construct without any native + * work, expose its configured host/port, never report itself running, throw a clear + * {@link UnsupportedOperationException} from {@link NativeServer#start()} until the native routes are + * wired, and be a safe no-op {@link AutoCloseable}. No model and no {@code libjllama} required. + */ +public class NativeServerSmokeTest { + + private static OpenAiServerConfig config() { + return OpenAiServerConfig.builder().host("127.0.0.1").port(1234).build(); + } + + @Test + public void exposesConfiguredHostAndPortWithoutStarting() { + NativeServer server = new NativeServer(config()); + assertThat(server.getHost(), is("127.0.0.1")); + assertThat(server.getPort(), is(1234)); + assertThat(server.isRunning(), is(false)); + } + + @Test + public void startThrowsUntilNativeRoutesAreWired() { + NativeServer server = new NativeServer(config()); + UnsupportedOperationException ex = assertThrows(UnsupportedOperationException.class, server::start); + assertThat(ex.getMessage(), containsString("not yet wired")); + assertThat(server.isRunning(), is(false)); + } + + @Test + public void closeIsSafeNoOpEvenViaTryWithResources() { + try (NativeServer server = new NativeServer(config())) { + assertThat(server.isRunning(), is(false)); + } + } +} From 698258db03d5bfc8ddee2dcfd5969bcdc4868cda Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 15:24:03 +0000 Subject: [PATCH 13/16] ci(cuda): fast single-arch CUDA build for validation, full arch set only on publish MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Invert the CUDA build-time/coverage trade-off in CI without risking the distributed jar. The crosscompile-linux-x86_64-cuda job now sets CUDA_FAST_BUILD=1 (single arch, CUDA_ARCH=90) for validation runs (PR/push/non-publish dispatch) to cut nvcc time, and CUDA_FAST_BUILD=0 (full arch set) only when publish_to_central is set. Because publish-snapshot/publish-release require publish_to_central, every artifact that reaches Maven Central is still built for every GPU generation — only non-distributed validation builds go fast. CI has no GPU so the fast path pins a fixed CUDA_ARCH (native would fail at configure); both vars are forwarded into the dockcross container via DOCKCROSS_ARGS -e. build_cuda_linux.sh's own default stays off, so local/manual builds remain release-safe unless you opt in. Docs updated in CLAUDE.md. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/workflows/publish.yml | 10 +++++++++- CLAUDE.md | 17 ++++++++++++----- 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 5ec515e1..9656612c 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -184,7 +184,15 @@ jobs: SCCACHE_LOG: debug SCCACHE_ERROR_LOG: /tmp/sccache_server.log RUST_BACKTRACE: full - DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE" + # CUDA arch policy: FAST single-arch build for validation runs (PR / push / non-publish + # dispatch) to cut nvcc time; FULL arch set only when actually publishing to Central + # (publish_to_central=true) so the distributed jar runs on every GPU generation. The + # publish-snapshot/publish-release jobs require publish_to_central, so any artifact that + # reaches Central is always built with the full set. CI has no GPU, so the fast path pins a + # fixed CUDA_ARCH ('native' would fail at configure). '0' => full (release-safe), '1' => fast. + CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }} + CUDA_ARCH: '90' + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE -e CUDA_FAST_BUILD -e CUDA_ARCH" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets diff --git a/CLAUDE.md b/CLAUDE.md index aaef25cf..1ff2d4d2 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -62,11 +62,18 @@ CUDA_FAST_BUILD=1 CUDA_ARCH=90 .github/build_cuda_linux.sh "-DOS_NAME=Linux -DOS # Direct-cmake equivalent: cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native ``` -**Why a separate, off-by-default flag (never enable it in CI/release):** an artifact built with -`CUDA_FAST_BUILD` runs on only the single GPU generation it was compiled for. The flag exists -purely to speed up **local iteration**; the CI CUDA job leaves it unset, so released jars keep -full arch coverage. To cache the nvcc kernels too you would add -`-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc +**Default + CI policy (release-safety is the invariant).** An artifact built with `CUDA_FAST_BUILD` +runs on only the single GPU generation it was compiled for, so the **distributed jar must always be +the full arch set**. The script default is **off** (full) so any *local/manual* build is +release-safe. In CI (`publish.yml`, the `crosscompile-linux-x86_64-cuda` job) the flag is **on for +validation runs** (PR / push / non-publish dispatch) to cut nvcc time, and **off only when actually +publishing to Central** — it is wired as `CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }}` +(`'0'`=full, `'1'`=fast). Because the `publish-snapshot`/`publish-release` jobs require +`publish_to_central`, **every artifact that reaches Central is built with the full arch set** while +ordinary PR/push CI stays fast. CI has no GPU, so the fast path pins a fixed `CUDA_ARCH` (default +`90` in the job env) — `native` would fail at configure. Both `CUDA_FAST_BUILD` and `CUDA_ARCH` are +forwarded into the dockcross container via `DOCKCROSS_ARGS` `-e`. To cache the nvcc kernels too you +would add `-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc caching is unreliable — the arch knob is the better lever and is what this repo ships. ## Android minimum API level From c0a10cb44203ac2d9d3952c1a74c3cdfe83716bb Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 15:26:56 +0000 Subject: [PATCH 14/16] ci(cuda): pin newest arch (sm_120 Blackwell) for the fast validation build Change the fast-path CUDA_ARCH from 90 to 120 (the newest CUDA 13.2 compute capability, consumer Blackwell / RTX 50xx) per request. Only affects the fast single-arch validation build (PR/push); publish runs still build the full arch set. Bump as newer GPU generations ship. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/workflows/publish.yml | 4 +++- CLAUDE.md | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 9656612c..f5eb33ec 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -191,7 +191,9 @@ jobs: # reaches Central is always built with the full set. CI has no GPU, so the fast path pins a # fixed CUDA_ARCH ('native' would fail at configure). '0' => full (release-safe), '1' => fast. CUDA_FAST_BUILD: ${{ inputs.publish_to_central && '0' || '1' }} - CUDA_ARCH: '90' + # Newest CUDA 13.2 architecture: sm_120 (consumer Blackwell / RTX 50xx). Only used on the + # fast validation path; bump as newer GPU generations ship. Releases ignore it (full set). + CUDA_ARCH: '120' DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE -e SCCACHE_LOG -e SCCACHE_ERROR_LOG -e RUST_BACKTRACE -e CUDA_FAST_BUILD -e CUDA_ARCH" steps: - uses: actions/checkout@v6 diff --git a/CLAUDE.md b/CLAUDE.md index 1ff2d4d2..bb51cdf9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -71,7 +71,8 @@ publishing to Central** — it is wired as `CUDA_FAST_BUILD: ${{ inputs.publish_ (`'0'`=full, `'1'`=fast). Because the `publish-snapshot`/`publish-release` jobs require `publish_to_central`, **every artifact that reaches Central is built with the full arch set** while ordinary PR/push CI stays fast. CI has no GPU, so the fast path pins a fixed `CUDA_ARCH` (default -`90` in the job env) — `native` would fail at configure. Both `CUDA_FAST_BUILD` and `CUDA_ARCH` are +`120` — the newest CUDA 13.2 arch, sm_120 / consumer Blackwell — in the job env) — `native` +would fail at configure. Both `CUDA_FAST_BUILD` and `CUDA_ARCH` are forwarded into the dockcross container via `DOCKCROSS_ARGS` `-e`. To cache the nvcc kernels too you would add `-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache` (gated behind the same probe), but sccache's nvcc caching is unreliable — the arch knob is the better lever and is what this repo ships. From 3ab3aa7e356e5dacad006474ffe110987a18b1ff Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 15:39:59 +0000 Subject: [PATCH 15/16] ci(sccache): enable Phase 2 cache on all 5 dockcross jobs at once Add the USE_CACHE / SCCACHE_WEBDAV_* / DOCKCROSS_ARGS env to crosscompile-linux-aarch64, crosscompile-android-aarch64, and crosscompile-android-aarch64-opencl (jobs 3-5). Jobs 1-2 were already enabled (manylinux2014 verified green, CUDA first run in progress). The build.sh probe-compile health-check makes it safe to enable all jobs simultaneously: any container where sccache crashes automatically falls back to an uncached green build, so there is no need to stage one job at a time anymore. build_opencl_android.sh previously called cmake directly; changed to exec build.sh (same pattern as build_cuda_linux.sh) so it inherits the sccache probe + Depot launcher + --show-stats without duplicating any download/probe logic. Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- .github/build_opencl_android.sh | 10 +++++----- .github/workflows/publish.yml | 25 +++++++++++++++++++++++++ CLAUDE.md | 30 ++++++++++++++++-------------- 3 files changed, 46 insertions(+), 19 deletions(-) diff --git a/.github/build_opencl_android.sh b/.github/build_opencl_android.sh index 33053f4a..efa3789c 100755 --- a/.github/build_opencl_android.sh +++ b/.github/build_opencl_android.sh @@ -42,11 +42,11 @@ if [ ! -f "$LOADER_BUILD/libOpenCL.so" ]; then cmake --build "$LOADER_BUILD" --config Release -j"$(nproc)" fi -mkdir -p build -# Match .github/build.sh: pass $@ unquoted so the CI's single-string +# Delegate the jllama cmake configure + build to build.sh so it inherits the +# sccache probe, Depot cache launcher, and --show-stats output automatically — +# same as build_cuda_linux.sh. Pass $@ unquoted so the CI's single-string # argument is word-split into individual -D flags for cmake. -cmake -Bbuild \ +exec .github/build.sh \ -DOpenCL_INCLUDE_DIR="$HEADERS_DIR" \ -DOpenCL_LIBRARY="$LOADER_BUILD/libOpenCL.so" \ - $@ || exit 1 -cmake --build build --config Release -j"$(nproc)" || exit 1 + $@ diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index f5eb33ec..18e15ca3 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -264,6 +264,14 @@ jobs: name: Cross-Compile Linux aarch64 (LTS) needs: [startgate, build-webui] runs-on: ubuntu-latest + # Phase 2 dockcross cache rollout — job 3. Same steady-state env as manylinux2014 (job 1); + # the build.sh probe makes it safe to enable without a separate verification run. Inert + # without DEPOT_TOKEN (fork PRs) or use_cache=false. + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets @@ -293,6 +301,14 @@ jobs: name: Cross-Compile Android aarch64 needs: [startgate, build-webui] runs-on: ubuntu-latest + # Phase 2 dockcross cache rollout — job 4. Same steady-state env as manylinux2014 (job 1); + # the build.sh probe makes it safe to enable without a separate verification run. Inert + # without DEPOT_TOKEN (fork PRs) or use_cache=false. + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets @@ -322,6 +338,15 @@ jobs: name: Cross-Compile Android aarch64 (OpenCL/Adreno) needs: [startgate, build-webui] runs-on: ubuntu-latest + # Phase 2 dockcross cache rollout — job 5. build_opencl_android.sh stages the OpenCL + # headers/loader, then delegates the jllama cmake build to build.sh (which owns the + # sccache probe + launcher). Same steady-state env as the other dockcross jobs. Inert + # without DEPOT_TOKEN (fork PRs) or use_cache=false. + env: + USE_CACHE: ${{ github.event_name != 'workflow_dispatch' || inputs.use_cache }} + SCCACHE_WEBDAV_ENDPOINT: https://cache.depot.dev + SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }} + DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE" steps: - uses: actions/checkout@v6 - name: Download shared WebUI assets diff --git a/CLAUDE.md b/CLAUDE.md index bb51cdf9..e654df78 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -247,23 +247,25 @@ absent-only guard left. **Rollout.** **Phase 1 — DONE & proven: the 3 macOS build jobs** (slowest + OOM-prone) — `brew install sccache` + the env above + `BUILD_JOBS: 2`. macOS build dropped **~40 min → ~6 min** -with a warm cache. **Phase 2 — in progress: the dockcross cross-compiles**, enabled **one job at -a time and verified green in CI before the next**. (The first attempt enabled all four at once -and was reverted: the static-musl sccache panicked in-container and — pre-probe — redded the -build. The probe above now makes that a safe fallback.) Order, each adding the env + a -`DOCKCROSS_ARGS` passthrough: +with a warm cache. **Phase 2 — DONE: all 5 dockcross cross-compile jobs** now have the same +steady-state env (`USE_CACHE` + `SCCACHE_WEBDAV_*` + `DOCKCROSS_ARGS`). The probe makes it safe +to enable them all at once — any container where sccache crashes falls back to an uncached green +build automatically. (The first attempt enabled all four at once without the probe and was +reverted: the static-musl sccache v0.8.2 panicked in-container and redded the build. With +v0.16.0 + the probe this is no longer a risk.) Job-by-job status: 1. `crosscompile-linux-x86_64` (manylinux2014) — ✅ **verified green** in PR #245: sccache **v0.16.0** probe passed in-container (devtoolset-10 gcc), `sccache ON` over Depot WebDAV, - cold run stored 275 objects (3 hits). The **v0.8.2 in-container panic is gone on v0.16.0**; - first-run diagnostics dropped, steady-state env = `USE_CACHE` + the two `SCCACHE_WEBDAV_*` - + `DOCKCROSS_ARGS`. + warm cache 277/278 hits (99.64%), 1m46s build time. 2. `crosscompile-linux-x86_64-cuda` (via `build_cuda_linux.sh`, which execs `build.sh`) — - 🚧 **enabled next** (diagnostics on for its first run on the manylinux_2_28 image). Only the - gcc C/C++ TUs cache (134 model files + ggml + httplib); the nvcc `.cu` kernels won't - (limited sccache nvcc support) — still a large partial win on the ~70 min job. -3. `crosscompile-linux-aarch64`, then 4. `crosscompile-android-aarch64`. -5. `crosscompile-android-aarch64-opencl` — **separate**, uses `build_opencl_android.sh` (not - `build.sh`); needs its own probe/launcher wiring. + 🚧 **first run in progress** (diagnostics on). Only the gcc C/C++ TUs cache (134 model files + + ggml + httplib); the nvcc `.cu` kernels won't (limited sccache nvcc support) — still a + large partial win on the ~70 min full-arch job; the fast single-arch (sm_120) validation path + cuts nvcc time independently of sccache. +3. `crosscompile-linux-aarch64` — ✅ **enabled** (same steady-state env; probe guards it). +4. `crosscompile-android-aarch64` — ✅ **enabled** (same steady-state env; probe guards it). +5. `crosscompile-android-aarch64-opencl` — ✅ **enabled**. `build_opencl_android.sh` stages the + OpenCL headers/loader, then delegates the jllama cmake build to `build.sh` via `exec` + (same pattern as `build_cuda_linux.sh`), so it inherits the probe and launcher automatically. Per-job recipe: add `env:` { `USE_CACHE`, `SCCACHE_WEBDAV_ENDPOINT`, `SCCACHE_WEBDAV_TOKEN` } and `DOCKCROSS_ARGS: "-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE"` — the From 346f2471404964e122d86f2686fdd02290d6a445 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 20 Jun 2026 16:10:50 +0000 Subject: [PATCH 16/16] =?UTF-8?q?docs(TODO):=20add=20Windows=20sccache=20i?= =?UTF-8?q?tem=20=E2=80=94=20needs=20Ninja,=20evaluate=20dual-artifact?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Record the investigation outcome for caching the two Windows native build jobs (the only remaining uncached native builds): - Root cause: the Visual Studio generator ignores CMAKE__COMPILER_LAUNCHER (and ggml's GGML_CCACHE RULE_LAUNCH_COMPILE), so sccache can only cache under Ninja/Makefiles. - Upstream evidence: llama.cpp b9682 builds windows-cpu + windows-cuda with Ninja Multi-Config (+ ccache); the VS generator is only used by legacy jobs. - Chosen path: don't flip the working build blindly. Validate Ninja Multi-Config in a separate build, or ship two Windows artifacts (Ninja + MSVC) in parallel so end users can test both before committing — Windows build runs twice during the transition. - Implementation notes captured (sccache+Depot backend, build.bat generator wiring, files to touch, bounded risk via the publish gate). Co-Authored-By: Claude Opus 4.8 Claude-Session: https://claude.ai/code/session_01LjWiKSyNzqqpobSKYRiew5 --- TODO.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/TODO.md b/TODO.md index 0a9f1342..15233961 100644 --- a/TODO.md +++ b/TODO.md @@ -85,6 +85,56 @@ primary goal: agentic tool-calling with Qwen): - **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4 tool-call parser fixes; if not, bump per the upgrade procedure. +### Windows compiler cache (sccache) — deferred: needs Ninja; evaluate dual-artifact + +The two Windows native build jobs (`build-windows-x86_64`, `build-windows-x86`) are the **only +remaining uncached** native builds — the 3 macOS jobs and all 5 dockcross jobs now cache via +sccache + Depot. Windows is not yet wired up because of a hard CMake constraint, and the chosen +path is to validate it carefully rather than flip the working build in place. + +**Why the obvious fix doesn't work.** Our cache mechanism is the CMake *compiler launcher* +(`-DCMAKE_C_COMPILER_LAUNCHER=sccache`, set by `build.sh`). ggml has its own equivalent +(`GGML_CCACHE` → `RULE_LAUNCH_COMPILE`). **Both are honored only by the Ninja and Makefile +generators — the Visual Studio generator ignores them entirely.** Our Windows jobs use +`-G "Visual Studio 18 2026" -A x64|Win32`, so just adding `mozilla-actions/sccache-action` +caches nothing. (The CLAUDE.md "use sccache-action / MSVC support" note predates hitting this.) + +**Upstream evidence (llama.cpp `b9682`, `.github/workflows/release.yml`).** ggml-org ships its +Windows artifacts with Ninja, not the VS generator: +- `windows-cpu` (the main CPU artifact, our analogue) — **Ninja Multi-Config** + clang toolchain + (`cmake/x64-windows-llvm.cmake`) + ccache. +- `windows-cuda` — **Ninja Multi-Config** + MSVC + ccache (proves Ninja Multi-Config + MSVC works + on the same llama.cpp + BoringSSL tree we build). +- `windows-sycl` — Ninja; `windows-hip` — Unix Makefiles; legacy `windows` + `windows-openvino` — + Visual Studio 17 2022. All jobs cache via `ggml-org/ccache-action@v1.2.21`. +- Important detail: it is **"Ninja Multi-Config"**, not plain Ninja — it keeps multi-config + semantics, so `cmake --build … --config Release` and our config-specific + `RUNTIME_OUTPUT_DIRECTORY_RELEASE` properties (`CMakeLists.txt:363-365`) behave exactly as they + do under the VS generator. The diff vs today is small: swap `-G`/`-A` for `-G "Ninja + Multi-Config"` + an MSVC env step (`vcvarsall` / `ilammy/msvc-dev-cmd`); `/MT` runtime and the + x64-vs-x86 arch gating are unchanged. + +**Chosen approach — do NOT switch the working build blindly.** Instead either (a) prove the Ninja +Multi-Config build in a **separate/experimental job first**, or preferably (b) **ship two Windows +artifacts in parallel — one Ninja-built, one MSVC(VS-generator)-built — so end users can test both** +and we can compare them before committing to one. That means the Windows native build runs **twice** +(once per generator) for a transition period; keep the MSVC/VS artifact as the trusted default and +add the Ninja one alongside until it's proven equivalent. Only after the Ninja artifact is validated +should we consider making it the sole Windows build (and retiring the second run). + +**Implementation notes for when this is picked up:** +- Cache backend: prefer **sccache + Depot WebDAV** (consistent with the other 8 jobs — one token, + shared cross-branch) over upstream's ccache (GitHub per-branch cache, a second cache system). + sccache supports MSVC `cl.exe`; Release config emits no debug info, so the `/Zi`→`/Z7` PDB caveat + doesn't apply. +- `build.bat` needs a Ninja path: pass `-G "Ninja Multi-Config"` + `-DCMAKE_BUILD_TYPE` is *not* + needed (multi-config keeps `--config Release`); add an sccache presence/probe guard mirroring + `build.sh` so a missing/crashing sccache falls back to a green uncached build. +- Files to touch: `.github/workflows/publish.yml` (the two `build-windows-*` jobs — add the MSVC env + step, the cache action, and the second artifact), `.github/build.bat` (generator + launcher wiring). +- Risk is bounded: a broken Ninja build shows up as a red Windows job, and publishing is gated behind + `publish_to_central`, so no broken artifact can reach Central/GitHub Releases. + ### llama.cpp upstream feature exposure (queued, deferred by policy) These are JNI plumbing items for upstream API additions. Policy: add only after a real user request — they are mostly relevant to specific model families or specialized workflows.