NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation by bernardladenthin · Pull Request #293 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-07-03T12:31:52Z

Summary

Server

NativeServer: run the full upstream llama.cpp HTTP server (embedded WebUI) inside libjllama over JNI, as the default server mode. ServerLauncher fat-jar entry point dispatches between NativeServer (default) and OpenAiCompatServer (via --jllama-openai-compat).
CLI: OpenAiServerCli parses llama.cpp tuning flags (-b, -ub, -tb, -ctk, -ctv, --jinja, --chat-template-kwargs) with JSON validation for chat-template kwargs.

Native classifier matrix expansion — new GPU backends, each wired with the same 5-place pattern (CMakeLists routing → build job in package.needs, fail-loud → pom profile → README row → git-ignored resources_* tree), all build-only (runners have no matching GPU), no vendor runtime bundled:

Vulkan Linux: vulkan-linux-x86-64, vulkan-linux-aarch64 (vendor-neutral NVIDIA/AMD/Intel)
AMD ROCm/HIP: rocm-linux-x86-64, rocm-windows-x86-64
Intel SYCL (oneAPI): sycl-fp16-linux-x86-64, sycl-fp32-linux-x86-64, sycl-windows-x86-64
Intel OpenVINO: openvino-linux-x86-64, openvino-windows-x86-64 (2026.2.1 archive + OpenCL via apt/vcpkg)
Windows-on-ARM OpenCL (Adreno): opencl-windows-aarch64

New default-jar CPU platforms

Windows arm64 (Snapdragon X; clang-cl, GGML_OPENMP=OFF)
Linux s390x (IBM Z, big-endian): cross-compiled + the full 462-test C++ suite run under qemu-user as a real big-endian correctness gate (WAV writer, JSON/token/embedding transforms, JNI helpers). OSInfo maps os.arch=s390x → Linux/s390x.

llama.cpp upgrade b9864 → b9870

New completion params surfaced through the Java layer: per-request sse_ping_interval, model ftype (quantization) on /v1/models, plus 8 audited withers (xtcProbability/Threshold, nDiscard, nIndent, tMaxPredictMs, postSamplingProbs, timingsPerToken, returnTokens).
Native patches: embedded-mode server support (signal-handler suppression, forwarded-argv parse, out-of-band shutdown) and recurrent/hybrid-model near-prompt-end checkpoint handling.

Build / CI / tooling

Version-bump automation: .github/scripts/llama-next-version.sh (diff-size chunking against a cached mirror) + docs/upgrade/llama-cpp-version-bump.md runbook, linked from CLAUDE.md.
Per-job sccache statistics table in the GitHub job summary.
CI stabilization across the new jobs (SpotBugs, ArchUnit, clang-format, REUSE).

Test plan

C++ unit suite (462 tests) green locally on x86-64 and on big-endian s390x under qemu
Affected Java unit tests pass locally (OpenAiServerCliTest, ServerLauncherTest, NativeServerSmokeTest, InferenceParametersTest, ModelParametersExtendedTest)
All 8 new GPU classifiers + Windows-arm64 + s390x build green in CI
Patches re-verified against b9870 via the fail-loud cmake PATCH_COMMAND
README and CLAUDE.md updated (server modes, full classifier table, version, s390x gate, bump runbook)

Related issues / PRs

Implements the native server mode + WebUI embedding and the GPU/CPU classifier matrix expansion described in CLAUDE.md and TODO.md.

Checklist

I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
My commits follow Conventional Commits
No security-sensitive changes

https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…s (Granite-4) Recurrent/hybrid models (granitehybrid, Mamba, Jamba) can only roll a slot back to a saved context checkpoint. In upstream b9859 the near-prompt-end checkpoints are gated by checkpoint_min_step (default 8192 tokens) and new checkpoints are otherwise only created at user-message boundaries. An agentic tool-calling conversation appends only assistant/tool messages after turn 1, so no new checkpoint is ever created and every turn re-prefills the whole conversation tail. Measured on a synthetic granitehybrid model (llama-server, 6-turn tool loop, ~643 new tokens/turn): prefilled tokens per turn grew 901 -> 1544 -> 2187 -> 2830 -> 3473 unpatched, i.e. quadratic total prefill. patches/0005 (upstream-submittable, server-context.cpp): - exempt near-prompt-end checkpoints from the min-step spacing when the memory can only roll back via checkpoints (seq-rm type FULL or RS); SWA-only models are unaffected - never create a checkpoint at the same position as the newest one (the last-user-message checkpoint was re-created identically every turn, flooding the 32-entry checkpoint list) With the patch the same loop prefills a constant 647 tokens/turn (each turn restores the previous turn's near-end checkpoint): 5.4x less prefill at turn 6, growing with conversation length. Outputs verified byte-identical to unpatched at temperature=0. ModelParameters gains setCtxCheckpoints(int) / setCheckpointMinStep(int) (--ctx-checkpoints / --checkpoint-min-step, both LLAMA_EXAMPLE_SERVER scope, reach the embedded server through common_params_parse) so callers can tune checkpoint density/RAM from Java. +2 unit tests (144 pass), javadoc clean, spotless applied. Complements open upstream PRs #24035/#24899/#24891 (checkpoint invalidation/ retention); this fixes checkpoint starvation. Drop the patch once upstream lands role-boundary checkpoint placement. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Pure readability refactor of the checkpoint-starvation fix — no behavior change. The two compound `do_checkpoint = do_checkpoint && (empty || ...)` assignments are lifted into named locals so the final gate reads: do_checkpoint = do_checkpoint && checkpoint_well_spaced && checkpoint_not_duplicate; - checkpoint_well_spaced: the min-step spacing test with the last-user-message and near-prompt-end (checkpoint-only-rollback) exemptions - checkpoint_not_duplicate: the same-position dedup guard Each named bool keeps the leading `checkpoints.empty() ||` so the `checkpoints.back()` access stays short-circuit-guarded (identical semantics to the previous inlined `&&`-chains). Compiles clean; patch re-verified to apply and reverse-check (idempotence) against pristine b9859 via the same `git apply` path the CMake applier uses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…enAiServerCli The fat-jar launcher (OpenAiCompatServer.main) parses args via OpenAiServerCli, which only understood a subset of flags and threw on anything else. Extend it with the seven tuning flags that a llama-server.exe user needs so the bundled `java -jar …-jar-with-dependencies.jar` covers a full invocation without any custom Java: -b/--batch-size, -ub/--ubatch-size -> ModelParameters.setBatchSize/setUbatchSize -tb/--threads-batch -> setThreadsBatch -ctk/--cache-type-k, -ctv/--cache-type-v -> setCacheTypeK/V (case-insensitive CacheType lookup; unknown -> error) --jinja -> enableJinja --chat-template-kwargs <json> -> setChatTemplateKwargs --chat-template-kwargs is parsed here (Jackson, already a server-package dep) into the raw-per-value map setChatTemplateKwargs expects, so a malformed object fails fast with usage text instead of at native model load. All setters already existed; the ints/CacheType/kwargs use 0/null "leave the default" sentinels mirroring the existing ctx/threads/parallel handling. +13 unit tests (30 pass total); usage() and README flag list updated; javadoc and spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…LL via JNI Second server mode alongside the Java-transport OpenAiCompatServer: NativeServer runs the *full* upstream llama.cpp HTTP server — embedded WebUI included — inside libjllama over JNI, with no separate llama-server executable. It forwards the raw llama-server argv verbatim, so every flag works exactly as for the standalone binary (no per-flag Java mapping). How: b9859 already exposes `int llama_server(int, char**)` (no main() in server.cpp). patches/0006 makes it embeddable — skips installing process-wide SIGINT/SIGTERM handlers when embedded (they would hijack the JVM), parses the forwarded argv via common_params_parse instead of common_params_parse_main (whose GetCommandLineW recovery would grab java.exe's command line — the Windows bug class 0001 fixes), and adds llama_server_request_shutdown() for out-of-band stop (ctx_server is loop-local). native_server.cpp's JNI bridge runs llama_server on a worker thread; start/stop/isRunning map to the three native methods. CMake: server.cpp + server-tools.cpp are now compiled in (non-Android — both pull subprocess.h/posix_spawn_*, so they share server-models.cpp's guard), plus native_server.cpp. NativeServer is an independent lifecycle (loads its own model from the argv, like llama-server.exe), single-instance per process (upstream keeps shutdown state in file-scope globals), and unavailable on Android. Reusing an already-loaded LlamaModel's context is a documented TODO. libjllama loads lazily in start(), so construction/arg-parsing/close stay pure-Java unit-testable. Verified end-to-end on Linux x86_64 with a synthetic granitehybrid model: server starts, GET /health -> 200 {"status":"ok"}, /v1/models and /props served, / is the native WebUI route (404 locally with the empty-asset stub; serves index.html in released jars that bake in webui-generated assets), close() shuts down cleanly. 7 pure-Java NativeServer tests + javadoc + spotless + clang-format(22.1.5) clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…CompatServer) Two runnable server mains now exist. The fat jar's default Main-Class becomes NativeServer, so `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full native llama.cpp server with its embedded WebUI, forwarding every argument. OpenAiCompatServer is unchanged and still runnable via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`. - NativeServer.main(args): forwards argv, starts the server, registers a JVM shutdown hook (the embedded server installs no signal handlers of its own — see patches/0006 — so the hook is what stops it cleanly on Ctrl-C/SIGTERM), and blocks until the native worker exits. - llama/pom.xml assembly profile: Main-Class OpenAiCompatServer -> NativeServer. - README + CLAUDE.md: document the two modes and how to select each. Verified end-to-end (Linux x86_64, synthetic granitehybrid): `java -cp … NativeServer -m model --port 8972` serves /health=ok after load; SIGTERM to the JVM fires the shutdown hook -> clean "cleaning up before exit" -> port down. Javadoc + spotless clean; 7 pure-Java NativeServer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…ai-compat Tiny dispatcher as the fat jar's Main-Class: with --open-ai-compat it runs OpenAiCompatServer (Java-transport OpenAI API), without it the default NativeServer (full native llama.cpp server + WebUI). The --open-ai-compat marker is stripped (it is not a llama.cpp flag); all other args are forwarded verbatim to the chosen server. Both underlying mains stay runnable directly by class name via java -cp. Note: the two servers accept different flag sets — NativeServer forwards every llama-server flag, OpenAiCompatServer's CLI accepts a curated subset and rejects unknown flags — so native-only flags can't be combined with --open-ai-compat. Dispatch logic split into pure static helpers (selectsOpenAiCompat / withoutFlag) with 7 unit tests. Verified at runtime: `ServerLauncher --open-ai-compat -m model --port 8973` starts the Java server (/ -> invalid_request_error, its handler), and without the flag starts NativeServer (/ -> native File Not Found); both shut down cleanly on SIGTERM. pom Main-Class NativeServer -> ServerLauncher; README + CLAUDE.md updated. spotless + javadoc clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…g to --openai-compat Per review: collapse the dispatch to a single pure helper. withoutFlag(args, flag) strips the selector and returns a (possibly shorter) array; main() selects the mode purely by whether that shortened the list (present iff result is smaller), so the separate selectsOpenAiCompat method + its baked-in constant are gone. The helper takes the flag as a parameter, so it is general and testable independent of the flag's meaning. Also rename the selector --open-ai-compat -> --openai-compat ("OpenAI" is one word, matching the brand and the codebase's oaicompat / OpenAiCompatServer); constant OPEN_AI_COMPAT_FLAG -> OPENAI_COMPAT_FLAG. Tests rewritten around withoutFlag: the length-change selection signal (shorter iff present, same length iff absent, position-independent) plus stripping behaviour (strips all occurrences, preserves order, no-op when absent, empty). 7 pass. Verified at runtime: `ServerLauncher --openai-compat -m model --port 8974` routes to OpenAiCompatServer (/ -> invalid_request_error) and shuts down cleanly; no-flag routes to NativeServer. README + CLAUDE.md + pom updated; spotless + javadoc clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Rename the fat-jar dispatch selector from --openai-compat to --jllama-openai-compat so it can never collide with a current or future llama.cpp / llama-server flag: upstream owns the --* space, this launcher owns --jllama-*. The jllama prefix is the project's native-library name, which upstream will never use, and it stays a lowercase-hyphen CLI token (not the verbose FQN, not the class name). ServerLauncher strips it before forwarding, so it never reaches llama_server (which rejects unknown flags). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Additive-only upgrade — no incompatibilities, no project source changes. Bumps GIT_TAG (and the TTS provenance banner), the README badge/link, and the CLAUDE.md pinned-version line + build examples. The b9859..b9862 diff touches two patch-target files (server-context.cpp/.h, for the new model_ftype field + get_meta/get_model_info additions). Patches 0002/0003/0005 were applied in sequence against the actual b9862 server-context.{cpp,h} and all apply cleanly (their regions are disjoint from and far from the additions). Patches 0001/0004/0006 target files not in the changed-file list (common/arg.*, server-common.cpp, test-chat.cpp, server.cpp, the ~34 standalone mains) and the OuteTTS generator anchors (tts.cpp) are unchanged, so they apply byte-identically. New upstream features surfaced by the bump (documented in the breaking-changes history): - Additive C API llama_ftype_name / llama_model_ftype; the native server now emits a model 'ftype' in get_model_info(), so NativeServer mode surfaces the quant type automatically. Optional follow-up: bind it into LlamaModel and the Java OpenAiCompatServer propsJson. - CUDA gated-delta-net fused snapshot-copy path (decode win for gated-delta / hybrid-recurrent models on the cuda13 classifiers). - Vendored cpp-httplib bumped to v0.49.0 (locale-independent ASCII classifiers, additive MultipartFormDataWriter, base64 UB fix) — internal to the compiled server transport, no bound symbol. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Wires the new b9862 llama_ftype_name / llama_model_ftype quant-type info (exposed by server_context_meta::model_ftype) up through JNI to Java: - jllama.cpp getModelMetaJson now emits "ftype" from server_context_meta. - ModelMeta.getFtype() and the convenience LlamaModel.getModelFtype() expose the quant label (e.g. "Q4_K - Medium"; a guessed type is prefixed with "(guessed) "), empty when the native layer does not report it. - OpenAiCompatServer advertises it as data[].ftype in GET /v1/models, matching the upstream server's get_model_info() key. The value is threaded through OpenAiServerConfig.modelFtype (new field/builder/getter) from the loaded model, mirroring how supportsVision is threaded — keeping the "models built from config alone" invariant. The field is omitted when unknown/blank. Tests: +2 ModelMeta, +1 OpenAiServerConfig, +1 OpenAiSseFormatter (ftype present/omitted). Verified end-to-end: full native jllama build against b9862 with all six patches applied (100% link), model-free load smoke test green, and 63 model-free Java unit tests pass; clang-format 22.1.5 + spotless + javadoc all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Extends the artifact matrix toward upstream llama.cpp's release set with three new native builds (all wired for CI; the user runs CI to validate on the GPU/ arm runners this environment lacks): 1. vulkan-linux-x86-64 — Linux x86_64 Vulkan classifier JAR 2. vulkan-linux-aarch64 — Linux aarch64 Vulkan classifier JAR 3. Windows arm64 CPU — folded into the DEFAULT JAR (no classifier) Linux Vulkan (vendor-neutral GPU jar, no CUDA toolkit) — the intersection of the existing Vulkan-Windows and CUDA-Linux wiring: - CMakeLists: the elseif(GGML_VULKAN) branch is now OS-aware like GGML_CUDA (Windows -> resources_windows_vulkan, else resources_linux_vulkan/.../Linux/ ${OS_ARCH}); one tree holds both arches. - pom.xml: profiles vulkan-linux / vulkan-linux-aarch64, both reading the shared resources_linux_vulkan tree with an arch-scoped resource-copy <includes> (Linux/x86_64 vs Linux/aarch64), so each classifier JAR carries only its arch. Verified locally with staged dummy natives: each jar contains exactly one libjllama.so for its arch. - publish.yml: build-linux-x86_64-vulkan (native ubuntu-latest) + build-linux-aarch64-vulkan (ubuntu-24.04-arm, GCC 14); both apt-install the Vulkan SDK, build -DGGML_VULKAN=ON -DGGML_NATIVE=OFF, build-only (GPU-less runners). Artifacts merge into one resources_linux_vulkan tree in package/ publish; profiles added to the three -P lists. - .gitignore: ignore resources_linux_vulkan (also fixed the pre-existing resources_cuda_linux -> resources_linux_cuda typo). Windows arm64 CPU (default JAR): - build-windows-arm64 on the free windows-11-arm runner (msvc-dev-cmd arch:arm64, Ninja Multi-Config, -DOS_ARCH=aarch64, build + ctest), emitting to the canonical resources/.../Windows/aarch64 and uploading Windows-aarch64-libraries, which the *-libraries glob merges into the default tree. No Java change: OSInfo already maps a Windows-on-ARM JVM (os.arch=aarch64) to Windows/aarch64. Matches the existing Windows CPU jobs (committed jllama.h + bundled JNI headers, so no mvn compile / setup-java needed). All three added to package.needs. Runtime GPU libs are never bundled (driver supplies libvulkan.so.1) — same policy as every GPU classifier. Local verification (CI does the real GPU/arm builds): CMake configures clean and the CPU branch still routes correctly; pom.xml is well-formed and both new profiles are recognized and activate; the per-arch classifier split was proven by packaging staged dummy natives; the workflow YAML parses (40 jobs) with all needs resolving and all -P lists updated. README classifier table + snippets and CLAUDE.md document the additions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

KISS analogue of upstream llama.cpp's ccache-action "CCache Statistics" table, but for this repo's sccache-over-Depot cache (ccache-action can't be dropped in: it manages its own ccache + actions/cache backend, conflicting with the Depot WebDAV design). build.sh/build.bat already print `sccache --show-stats` to the log; now, when running in CI (GITHUB_STEP_SUMMARY set) and sccache was actually the launcher, they also parse those stats and append a small markdown table: ### sccache statistics | Cache hits | Requests | Hit rate | |------------|----------|----------| | 589 | 600 | 98.2% | Per-job (GitHub does not merge job summaries), covering every native build job uniformly — build.sh handles the dockcross/native-Linux/aarch64/vulkan-linux and macOS jobs; build.bat the Windows jobs. Parses the text stats (top-level "Compile requests" = total, top-level "Cache hits" = hits; the per-language "Cache hits (C/C++)" line is skipped by the digit-anchored regex). Best-effort: skipped silently if the numbers can't be parsed or there were no requests, and never emitted for local runs (no GITHUB_STEP_SUMMARY) — so local builds are untouched. Verified the build.sh parse end-to-end against a realistic sccache --show-stats sample (req=600, hits=589 -> 98.2%, with the "Compile requests executed" line correctly excluded); build.sh passes bash -n. The Windows batch path (integer math with rounding, escaped pipes/parens) is validated by CI on the Windows runners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Additive-only upgrade — no incompatibilities, no project source changes. Bumps GIT_TAG (and the TTS provenance banner), the README badge/link, and the CLAUDE.md pinned-version line + build examples. The b9862..b9864 diff is almost entirely the Svelte WebUI (tools/ui/**, which auto-follows the pinned GIT_TAG via the build-webui CI job) plus one small server change: a new per-request sse_ping_interval in the completion API (task_params field + make_llama_cmpl_schema field + handle_completions_impl capture). It's inside upstream-compiled server TUs the project already links; NativeServer mode gets it for free, and the project binds no new symbol. Patch verification: the diff touches exactly one patch-target file (server-context.cpp, only in handle_completions_impl ~L4089, far below every patched region). Patches 0002/0003/0005 were applied in sequence against the actual b9864 server-context.{cpp,h} — all clean; server-context.h is unchanged, and server-schema.cpp/server-task.h are not patch targets. Patches 0001/0004/0006 target files not in the changed-file list, so they apply unchanged. Confirmed end-to-end by a clean cmake configure: b9864 fetched and all six patches applied via the fail-loud PATCH_COMMAND (exit 0), OuteTTS generator anchors held. Optional future work (documented in the breaking-changes history): expose sse_ping_interval on the Java InferenceParameters — it would flow through the OAI-compat completion path via eval_llama_cmpl_schema like any other field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…meters Wires the b9864 per-request sse_ping_interval into the Java API and, from a completion-schema audit, adds the other already-parseable-but-unexposed plain scalars so callers of the OAI-compat completion path can set them. New withers (each emits a JSON key honored by eval_llama_cmpl_schema): - withSsePingInterval(int) -> sse_ping_interval (b9864; -1 disables pings) - withXtcProbability(float) / withXtcThreshold(float) -> XTC sampler - withNDiscard(int) -> n_discard (context-shift discard) - withNIndent(int) -> n_indent (infill indentation) - withTMaxPredictMs(int) -> t_max_predict_ms (generation time budget) - withPostSamplingProbs(boolean) -> post_sampling_probs - withTimingsPerToken(boolean) -> timings_per_token - withReturnTokens(boolean) -> return_tokens Audit method: extracted every field name from b9864's make_llama_cmpl_schema and diffed against the InferenceParameters keys. t_max_prompt_ms was deliberately skipped (commented out upstream, so not parseable). The remaining unexposed fields are OAI aliases already covered (max_tokens/ max_completion_tokens -> n_predict) or OAI/server-internal / array-shaped / advanced knobs (n, logprobs, echo, verbose, include_usage, return_progress, response_fields, lora, grammar_lazy/grammar_triggers/preserved_tokens, chat_format, parse_tool_calls, reasoning_control, backend_sampling, adaptive_*), left out on purpose and documented in the breaking-changes history. Tests: +2 Java withers tests (InferenceParametersTest -> 104 pass) and +3 C++ schema round-trip guards in test_server.cpp pinning that the native parser honors sse_ping_interval (round-trip, -1 disables, below-hard-limit throws, absent inherits the server default) -> full C++ suite 462 pass (was 459). javadoc (llama module) + spotless + clang-format all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…arm64 Fixes three PR #293 CI failures. 1. SpotBugs (spotbugs-exclude.xml): the branch's new server classes tripped 14 fb-contrib/findsecbugs findings not yet covered by the exclude file. All are false-positives or intentional tradeoffs mirroring suppressions the project already applies elsewhere. Added six method-scoped Match blocks with rationale: - NativeServer: IMC_IMMATURE_CLASS_NO_EQUALS, MDM_THREAD_YIELD (main() poll), UVA_USE_VAR_ARGS (native JNI method), WEM_WEAK_EXCEPTION_MESSAGING. - ServerLauncher.main: THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION. - OpenAiServerCli parse/usage/parseChatTemplateKwargs: ENMI_NULL_ENUM_VALUE (@nullable CacheType unset sentinel), POTENTIAL_XML_INJECTION + PRMC (plain- text usage() help, no XML), EXS_EXCEPTION_SOFTENING (Jackson->CLI error, cause chained), PSC_PRESIZE_COLLECTIONS. - OpenAiServerCli$Options.getChatTemplateKwargs: EI_EXPOSE_REP (returns an already-unmodifiable map). Verified: mvn spotbugs:check -> BugInstance size 0, BUILD SUCCESS. 2. Linux Vulkan (publish.yml): both build-linux-*-vulkan jobs failed find_package(SPIRV-Headers CONFIG REQUIRED) because the apt install omitted it. Added spirv-headers to both apt lines (exact parity with upstream llama.cpp's build-vulkan.yml: 'glslc libvulkan-dev spirv-headers'). 3. Windows arm64 (publish.yml + CLAUDE.md): ggml aborts 'MSVC is not supported for ARM, use clang'. The generator (Ninja) was never the issue — the compiler was. Switched the job to clang-cl (-DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl): it satisfies ggml's guard (if (MSVC AND NOT CMAKE_C_COMPILER_ID STREQUAL "Clang")) while keeping CMake's MSVC=TRUE, so our static /MT CRT block still applies and the runner + Ninja + ctest all stay. msvc-dev-cmd (arm64) supplies the MSVC headers/libs. First-run watch item: clang-cl must be on PATH (VS Clang component / LLVM); if CI reports it missing, add an LLVM setup step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Two LlamaArchitectureTest rules failed on PR #293 because of this branch's server additions: - layeredArchitecture (12 violations): the branch adds two legitimate new edges out of the Server layer -- Server -> Args (OpenAiServerCli maps -ctk/-ctv to the args.CacheType enum) and Server -> Loader (NativeServer.start() calls LlamaLoader.initialize() before launching the embedded native server). The rule documents itself as the EXACT set of accessors today, to be updated when a new dependency is intended, so Server is added to the Loader and Args mayOnlyBeAccessedByLayers lists (+ a doc note). Server remains the only layer allowed to reach the Api root and stays mayNotBeAccessedByAnyLayer. - noThreadSleep (1 violation): NativeServer.main() kept the JVM alive with a while(isRunning()) Thread.sleep(200) poll loop. The rule bans Thread.sleep and has no suppression seam (it prefers Condition.await/poll), so main() now blocks on a bounded CountDownLatch.await(200ms) signalled by the shutdown hook. This is also a behavioural improvement: Ctrl-C/SIGTERM wakes the wait immediately instead of after up to a 200 ms tick, while the timeout still re-checks isRunning() to catch a self-terminated native worker. Verified: LlamaArchitectureTest 12/12 pass; server-package tests 44/44 pass (ServerLauncher, OpenAiServerCli, NativeServerSmoke); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

The clang-cl fix worked (jllama.dll linked for Windows/aarch64), but the next step failed at test discovery: gtest_discover_tests could not launch jllama_test.exe -> exit 0xc0000135 (STATUS_DLL_NOT_FOUND). Root cause: with clang-cl, ggml links LLVM's OpenMP runtime (libomp.lib -> needs libomp140.aarch64.dll at run time). Unlike MSVC's ambient vcomp140.dll on x64, that DLL is not on PATH, so neither the test exe nor a consumer could load the binary. (Upstream llama.cpp works around this by copying libomp140.aarch64.dll next to its arm64 output.) Fix: pass -DGGML_OPENMP=OFF for the arm64 job. ggml falls back to its own std::thread threadpool, so both jllama_test.exe and the shipped arm64 jllama.dll are self-contained with no libomp dependency to ship — cleaner than bundling an LLVM OpenMP DLL into the default JAR. The x86_64/x86 jobs keep OpenMP (MSVC vcomp, which is ambient and already proven). Also updated the job comment + CLAUDE.md to record that VC\Tools\Llvm\ARM64 supplies clang-cl/lld-link (no separate LLVM install needed) and the OpenMP rationale. The getenv/strdup/ctime deprecation messages in the same log are warnings only (clang-cl flagging POSIX names against the MSVC UCRT headers), not the failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Trivial additive upgrade — no incompatibilities, no project source changes. Bumps GIT_TAG (+ the TTS provenance banner), the README badge/link, and the CLAUDE.md pinned-version line + build examples. The b9864..b9866 diff is backend/WebUI-only: the CUDA topk-moe kernel gains a case 288 instantiation + accepts n_expert==288 (StepFun 3.7's non-power-of-2 expert count) — device-side, affecting only the cuda13 classifiers; a test-backend-ops.cpp case (not built here, LLAMA_BUILD_TESTS OFF); and WebUI changes (a config string-boolean normalization migration + a thinking-default flip) that auto-follow the pinned GIT_TAG via the build-webui job. The project binds no new symbol. Patch verification: the diff touches no patch-target file and no OuteTTS anchor, so all six patches are byte-identical to b9864. Confirmed end-to-end by a clean cmake configure: b9866 fetched (case 288 present) and all six patches applied via the fail-loud PATCH_COMMAND (exit 0; 0005 + 0006 markers present), OuteTTS anchors held. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Add a diff-size-driven bump workflow so a version upgrade never lands an unreviewably large diff in one step. - .github/scripts/llama-next-version.sh: read-only helper that computes the next reviewable step. Reads the current pin from llama/CMakeLists.txt and the target from an explicit b<nnnn> arg or the GitHub releases atom feed, against a cached blobless mirror clone. If git diff cur..target is under the threshold (LLAMA_BUMP_MAX_DIFF_KB, default 100 KiB) it bumps straight to the target; otherwise it binary-searches the intermediate tags for the largest one still under the threshold and prints that chunk plus its compare/.patch URLs. LLAMA_BUMP_EXCLUDE_WEBUI sizes the diff excluding the auto-followed tools/ui WebUI. - docs/upgrade/llama-cpp-version-bump.md: the runbook (documentation root) for target selection, byte-size chunking, the helper, and the edit/verify/commit loop. - CLAUDE.md: link the runbook from the Upgrading/Downgrading section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

First bump driven by the new .github/scripts/llama-next-version.sh helper: b9866 -> b9867 is a 2 KiB single-commit final chunk (well under the 100 KiB threshold), so it bumps straight to the latest release. b9867 (spec: support spec-draft-p-min in DFlash) changes only common/speculative.cpp: the DFlash draft path now also clamps n_min to the block size, raises the draft sampler top_k 1 -> 10, stops drafting when the top candidate probability drops below p_min, and discards a step producing fewer than n_min tokens. All three use existing common_speculative_params fields; common/speculative.h is untouched. Entirely inside upstream-compiled common; the project binds no common_speculative_* symbol. No project source changes required. Re-verified all six patches (0001-0006) apply cleanly against b9867 via a fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers present); OuteTTS generator anchors held. Appended the b9866->b9867 history rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

The new docs/upgrade/llama-cpp-version-bump.md lacked copyright/licensing info, failing the License Compliance (REUSE) check. Add the top-of-file HTML-comment SPDX block used by the sibling docs (docs/history/*.md, docs/feature-investigation-*.md). reuse lint now reports 310/310 files compliant with REUSE 3.3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Driven by .github/scripts/llama-next-version.sh: b9867 -> b9870 is a 21 KiB (11 KiB excl. WebUI) three-commit final chunk, under the 100 KiB threshold, so it bumps straight to the latest release. The only source edit in b9867..b9870 is common/chat.cpp: a StepFun message-content whitespace workaround (issue #24181) that trims leading and trailing whitespace from each common_chat_msg content, reasoning_content and text content-part before Jinja rendering, detected by the StepFun template signature. It uses existing common_chat_msg fields; common/chat.h is untouched. The removed stepfun-ai-Step-3.5-Flash.jinja template and the test-chat additions are not built here (LLAMA_BUILD_TESTS OFF); tools/ui is the auto-followed WebUI. No project source changes required. Re-verified all six patches (0001-0006) apply cleanly against b9870 via a fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers and the b9870 trim_all_content change present); OuteTTS generator anchors held. Appended the b9867->b9870 history rows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…VINO Wire eight new GPU-backend classifiers following the exact same 5-place pattern as the existing CUDA/Vulkan classifiers (fail-loud, in package.needs, no continue-on-error, no special cases): rocm-linux-x86-64 GGML_HIP Linux x86_64 (AMD ROCm/HIP) rocm-windows-x86-64 GGML_HIP Windows x86_64 (AMD HIP SDK) sycl-fp16-linux-x86-64 GGML_SYCL+F16 Linux x86_64 (Intel oneAPI, fp16) sycl-fp32-linux-x86-64 GGML_SYCL Linux x86_64 (Intel oneAPI, fp32) sycl-windows-x86-64 GGML_SYCL Windows x86_64 (Intel oneAPI) opencl-windows-aarch64 GGML_OPENCL Windows aarch64 (Snapdragon/Adreno) openvino-linux-x86-64 GGML_OPENVINO Linux x86_64 (Intel OpenVINO) openvino-windows-x86-64 GGML_OPENVINO Windows x86_64 (Intel OpenVINO) - llama/CMakeLists.txt: extend the OS-aware backend routing with GGML_HIP, GGML_SYCL (Linux fp16/fp32 split by GGML_SYCL_F16) and GGML_OPENVINO branches. - llama/pom.xml: eight classifier profiles; the existing opencl-windows include is now arch-scoped to Windows/x86_64 so the new aarch64 OpenCL build sharing the resources_windows_opencl tree does not leak into it (vulkan-linux split precedent). - .github/workflows/publish.yml: eight build jobs (build-only; GitHub runners have no matching GPU), all added to package.needs and to the download + profile-activation steps of package/publish-snapshot/publish-release. Vendor toolchain installs are first-pass and intentionally fail loud if a URL/version is stale. - README.md + CLAUDE.md: classifier table rows, dependency snippets, and a wiring/routing section. .gitignore: the seven new resources_* trees. All build-only, no vendor runtime bundled (consumer's driver/toolkit supplies it). Validated locally: CMake CPU reconfigure parses the extended routing, Maven recognizes all 8 profiles, publish.yml is valid YAML, pom.xml is well-formed, REUSE compliant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Mirror upstream llama.cpp's own release-job recipes for the Windows SYCL and Windows HIP builds, and fix the two OpenVINO installs: - Windows ROCm/HIP: the AMD HIP SDK URL 404'd and find_package(hip) could not locate the SDK. Use HIP SDK 26.Q1 (upstream's pin), resolve HIP_PATH from the installed ROCm dir, and pass -DCMAKE_PREFIX_PATH plus the SDK's own clang/clang++ so ggml-hip's find_package(hip) resolves (GPU_TARGETS, upstream spelling). - Windows SYCL: the oneAPI offline installer URL returned 403. Use upstream's intel-deep-learning-essentials-2025.3.3.18 offline installer with the extract + bootstrapper silent install (DPC++/MKL/oneDNN/TBB components), then setvars intel64 --force and build with cl (C) + icx (C++), matching upstream. - Linux OpenVINO: OpenVINOConfig.cmake's find_package(TBB) failed. Add libtbb-dev (supplies TBBConfig.cmake). - Windows OpenVINO: the archive extracts into a nested versioned folder, so the hard-coded C:\openvino\runtime\cmake did not exist. Resolve the nested dir and pass -DOpenVINO_DIR explicitly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…O OpenCL Second round of fail-loud CI fixes for the new GPU classifiers, from the actual build logs: - Windows ROCm/HIP: device-code compile failed because ROCm 7.1's HIP clang headers cannot overload the __host__ __device__ isgreater/isless/... that the very new VS 2026 MSVC <cmath> declares via _CLANG_BUILTIN2. Move the job to windows-2022 (MSVC 14.4x), which is what upstream llama.cpp uses for win-hip. - Windows SYCL: icx rejected the project's static /MT CRT with '-fsycl' ("invalid argument 'MT' not allowed with '-fsycl'"). Exempt GGML_SYCL (and GGML_OPENVINO, whose import libs are /MD) from the static-CRT force in CMakeLists so they build with the dynamic /MD runtime. Those classifiers already need the vendor runtime on the host, so the self-contained-DLL rationale doesn't apply; CPU + CUDA/Vulkan/OpenCL keep /MT. - Linux OpenVINO: past the TBB fix, ggml-openvino's find_package(OpenCL) failed. Add ocl-icd-opencl-dev + opencl-headers to the apt install. - Windows OpenVINO: same find_package(OpenCL) need — build it via build_opencl_windows.bat (stages the Khronos headers + OpenCL.lib, then delegates to build.bat) instead of build.bat directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

OpenVINO backend now compiles but failed on both platforms: - ov::Allocator template error (allocate/is_equal on a const void*): version mismatch. llama.cpp's ggml-openvino targets OpenVINO 2026.2.1 (what upstream ships), not the 2025.0.0 I pinned. Bump Linux apt to openvino-2026.2.1 (repo /openvino/2026) and the Windows archive to 2026.2.1. - Windows 'CL/cl2.hpp' not found: the staged Khronos OpenCL-Headers dropped cl2.hpp. Install OpenCL via vcpkg (opencl:x64-windows ships cl2.hpp) and pass the vcpkg toolchain file, mirroring upstream's windows-openvino job; drop the build_opencl_windows.bat staging for this job. - Linux: add opencl-clhpp-headers + intel-opencl-icd to the apt set (upstream's full OpenCL package list for ubuntu-openvino). Also raise cmake_minimum_required 3.15 -> 3.22 to match what the build actually relies on (runners ship 3.31); no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

…2026 apt repo The apt repo https://apt.repos.intel.com/openvino/2026 returns 404 — Intel only publishes OpenVINO apt repos up to ~2025, and 2025.x has the older ov::Allocator API that breaks ggml-openvino's template compile. Switch Linux OpenVINO to the archive for 2026.2.1, exactly as upstream llama.cpp's linux-setup-openvino composite action does: storage.openvinotoolkit.org/repositories/openvino/packages/2026.2.1/linux/ openvino_toolkit_ubuntu24_2026.2.1.21919.ede283a88e3_x86_64.tgz extracted to /opt/intel/openvino, with OpenVINO_DIR set to its runtime/cmake. OpenCL headers (incl. the C++ CL/cl2.hpp via opencl-clhpp-headers) come from Ubuntu's own repos, so no Intel apt repo is needed at all. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Wire build-linux-s390x: cross-compile for IBM Z (s390x, big-endian) with the GCC cross toolchain (native x86 speed), then run the full 462-test C++ suite under qemu-user as a real big-endian correctness gate for our byte-order- sensitive code (the little-endian WAV writer, JSON/token/embedding transforms, JNI helpers). Model-backed Java tests are not run under emulation (slow/flaky); the Java<->JNI boundary uses host-native array copies, so the C++ gate covers the actual endian risk. - publish.yml: build-linux-s390x (g++-s390x-linux-gnu + qemu-user-static; CMAKE_CROSSCOMPILING_EMULATOR + QEMU_LD_PREFIX make ctest run the s390x exe; GGML_OPENMP=OFF avoids cross-libgomp). s390x is a default-jar CPU platform like aarch64, so the artifact merges via the *-libraries glob (no classifier / pom profile). Fail-loud and in package.needs. - OSInfo.java: map os.arch=s390x -> Linux/s390x (S390X constant + archMapping). - README/CLAUDE.md: document the platform + the big-endian gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

Address 'Resources should be closed' (java:S2095) on NativeServer.main. The server was already closed on every real path (shutdown hook on SIGTERM, explicit close on self-termination), but not in a structure Sonar recognizes. Wrap the body in try/finally so close() is guaranteed on normal or exceptional exit — S2095's 'close in a finally clause' option. try-with-resources is deliberately NOT used: the shutdown hook must also call close() explicitly, which javac flags under -Werror as 'explicit call to close() on an auto-closeable resource'. close() is idempotent (guards on a zero handle), so the finally and the hook both firing is safe. The now-redundant stoppedByHook flag is dropped. All 7 NativeServerSmokeTest cases still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

sonarqubecloud · 2026-07-04T14:28:35Z

Quality Gate failed

Failed conditions
78.9% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

claude added 14 commits July 2, 2026 21:23

bernardladenthin temporarily deployed to startgate July 3, 2026 12:31 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 3, 2026 13:17 — with GitHub Actions Inactive

bernardladenthin had a problem deploying to startgate July 3, 2026 13:46 — with GitHub Actions Error

bernardladenthin temporarily deployed to startgate July 3, 2026 13:49 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 3, 2026 14:38 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 3, 2026 14:55 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 3, 2026 15:32 — with GitHub Actions Inactive

bernardladenthin had a problem deploying to startgate July 3, 2026 18:12 — with GitHub Actions Error

bernardladenthin temporarily deployed to startgate July 4, 2026 07:08 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 4, 2026 08:37 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 4, 2026 09:25 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 4, 2026 10:00 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 4, 2026 10:24 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 4, 2026 10:37 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate July 4, 2026 12:13 — with GitHub Actions Inactive

bernardladenthin changed the title ~~Add NativeServer mode, Vulkan Linux classifiers, and llama.cpp b9864 features~~ NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation Jul 4, 2026

bernardladenthin had a problem deploying to startgate July 4, 2026 14:25 — with GitHub Actions Error

bernardladenthin merged commit d8523a4 into main Jul 4, 2026
12 of 15 checks passed

bernardladenthin deleted the claude/granite-4-llama-cpp-decode-kk9xjp branch July 4, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation#293

NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation#293
bernardladenthin merged 29 commits into
mainfrom
claude/granite-4-llama-cpp-decode-kk9xjp

bernardladenthin commented Jul 3, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Related issues / PRs

Checklist

Uh oh!

sonarqubecloud Bot commented Jul 4, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bernardladenthin commented Jul 3, 2026 •

edited

Loading