Skip to content

NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation#293

Merged
bernardladenthin merged 29 commits into
mainfrom
claude/granite-4-llama-cpp-decode-kk9xjp
Jul 4, 2026
Merged

NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation#293
bernardladenthin merged 29 commits into
mainfrom
claude/granite-4-llama-cpp-decode-kk9xjp

Conversation

@bernardladenthin

@bernardladenthin bernardladenthin commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Summary

Server

  • NativeServer: run the full upstream llama.cpp HTTP server (embedded WebUI) inside libjllama over JNI, as the default server mode. ServerLauncher fat-jar entry point dispatches between NativeServer (default) and OpenAiCompatServer (via --jllama-openai-compat).
  • CLI: OpenAiServerCli parses llama.cpp tuning flags (-b, -ub, -tb, -ctk, -ctv, --jinja, --chat-template-kwargs) with JSON validation for chat-template kwargs.

Native classifier matrix expansion — new GPU backends, each wired with the same 5-place pattern (CMakeLists routing → build job in package.needs, fail-loud → pom profile → README row → git-ignored resources_* tree), all build-only (runners have no matching GPU), no vendor runtime bundled:

  • Vulkan Linux: vulkan-linux-x86-64, vulkan-linux-aarch64 (vendor-neutral NVIDIA/AMD/Intel)
  • AMD ROCm/HIP: rocm-linux-x86-64, rocm-windows-x86-64
  • Intel SYCL (oneAPI): sycl-fp16-linux-x86-64, sycl-fp32-linux-x86-64, sycl-windows-x86-64
  • Intel OpenVINO: openvino-linux-x86-64, openvino-windows-x86-64 (2026.2.1 archive + OpenCL via apt/vcpkg)
  • Windows-on-ARM OpenCL (Adreno): opencl-windows-aarch64

New default-jar CPU platforms

  • Windows arm64 (Snapdragon X; clang-cl, GGML_OPENMP=OFF)
  • Linux s390x (IBM Z, big-endian): cross-compiled + the full 462-test C++ suite run under qemu-user as a real big-endian correctness gate (WAV writer, JSON/token/embedding transforms, JNI helpers). OSInfo maps os.arch=s390x → Linux/s390x.

llama.cpp upgrade b9864 → b9870

  • New completion params surfaced through the Java layer: per-request sse_ping_interval, model ftype (quantization) on /v1/models, plus 8 audited withers (xtcProbability/Threshold, nDiscard, nIndent, tMaxPredictMs, postSamplingProbs, timingsPerToken, returnTokens).
  • Native patches: embedded-mode server support (signal-handler suppression, forwarded-argv parse, out-of-band shutdown) and recurrent/hybrid-model near-prompt-end checkpoint handling.

Build / CI / tooling

  • Version-bump automation: .github/scripts/llama-next-version.sh (diff-size chunking against a cached mirror) + docs/upgrade/llama-cpp-version-bump.md runbook, linked from CLAUDE.md.
  • Per-job sccache statistics table in the GitHub job summary.
  • CI stabilization across the new jobs (SpotBugs, ArchUnit, clang-format, REUSE).

Test plan

  • C++ unit suite (462 tests) green locally on x86-64 and on big-endian s390x under qemu
  • Affected Java unit tests pass locally (OpenAiServerCliTest, ServerLauncherTest, NativeServerSmokeTest, InferenceParametersTest, ModelParametersExtendedTest)
  • All 8 new GPU classifiers + Windows-arm64 + s390x build green in CI
  • Patches re-verified against b9870 via the fail-loud cmake PATCH_COMMAND
  • README and CLAUDE.md updated (server modes, full classifier table, version, s390x gate, bump runbook)

Related issues / PRs

Implements the native server mode + WebUI embedding and the GPU/CPU classifier matrix expansion described in CLAUDE.md and TODO.md.

Checklist

  • I have read CONTRIBUTING.md and CODE_OF_CONDUCT.md
  • My commits follow Conventional Commits
  • No security-sensitive changes

https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7

claude added 14 commits July 2, 2026 21:23
…s (Granite-4)

Recurrent/hybrid models (granitehybrid, Mamba, Jamba) can only roll a slot
back to a saved context checkpoint. In upstream b9859 the near-prompt-end
checkpoints are gated by checkpoint_min_step (default 8192 tokens) and new
checkpoints are otherwise only created at user-message boundaries. An agentic
tool-calling conversation appends only assistant/tool messages after turn 1,
so no new checkpoint is ever created and every turn re-prefills the whole
conversation tail.

Measured on a synthetic granitehybrid model (llama-server, 6-turn tool loop,
~643 new tokens/turn): prefilled tokens per turn grew 901 -> 1544 -> 2187 ->
2830 -> 3473 unpatched, i.e. quadratic total prefill.

patches/0005 (upstream-submittable, server-context.cpp):
- exempt near-prompt-end checkpoints from the min-step spacing when the
  memory can only roll back via checkpoints (seq-rm type FULL or RS);
  SWA-only models are unaffected
- never create a checkpoint at the same position as the newest one (the
  last-user-message checkpoint was re-created identically every turn,
  flooding the 32-entry checkpoint list)

With the patch the same loop prefills a constant 647 tokens/turn (each turn
restores the previous turn's near-end checkpoint): 5.4x less prefill at
turn 6, growing with conversation length. Outputs verified byte-identical
to unpatched at temperature=0.

ModelParameters gains setCtxCheckpoints(int) / setCheckpointMinStep(int)
(--ctx-checkpoints / --checkpoint-min-step, both LLAMA_EXAMPLE_SERVER scope,
reach the embedded server through common_params_parse) so callers can tune
checkpoint density/RAM from Java. +2 unit tests (144 pass), javadoc clean,
spotless applied.

Complements open upstream PRs #24035/#24899/#24891 (checkpoint invalidation/
retention); this fixes checkpoint starvation. Drop the patch once upstream
lands role-boundary checkpoint placement.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Pure readability refactor of the checkpoint-starvation fix — no behavior
change. The two compound `do_checkpoint = do_checkpoint && (empty || ...)`
assignments are lifted into named locals so the final gate reads:

    do_checkpoint = do_checkpoint && checkpoint_well_spaced && checkpoint_not_duplicate;

- checkpoint_well_spaced: the min-step spacing test with the last-user-message
  and near-prompt-end (checkpoint-only-rollback) exemptions
- checkpoint_not_duplicate: the same-position dedup guard

Each named bool keeps the leading `checkpoints.empty() ||` so the
`checkpoints.back()` access stays short-circuit-guarded (identical semantics
to the previous inlined `&&`-chains). Compiles clean; patch re-verified to
apply and reverse-check (idempotence) against pristine b9859 via the same
`git apply` path the CMake applier uses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…enAiServerCli

The fat-jar launcher (OpenAiCompatServer.main) parses args via OpenAiServerCli,
which only understood a subset of flags and threw on anything else. Extend it
with the seven tuning flags that a llama-server.exe user needs so the bundled
`java -jar …-jar-with-dependencies.jar` covers a full invocation without any
custom Java:

  -b/--batch-size, -ub/--ubatch-size        -> ModelParameters.setBatchSize/setUbatchSize
  -tb/--threads-batch                        -> setThreadsBatch
  -ctk/--cache-type-k, -ctv/--cache-type-v   -> setCacheTypeK/V (case-insensitive
                                                CacheType lookup; unknown -> error)
  --jinja                                    -> enableJinja
  --chat-template-kwargs <json>              -> setChatTemplateKwargs

--chat-template-kwargs is parsed here (Jackson, already a server-package dep)
into the raw-per-value map setChatTemplateKwargs expects, so a malformed object
fails fast with usage text instead of at native model load. All setters already
existed; the ints/CacheType/kwargs use 0/null "leave the default" sentinels
mirroring the existing ctx/threads/parallel handling.

+13 unit tests (30 pass total); usage() and README flag list updated; javadoc
and spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…LL via JNI

Second server mode alongside the Java-transport OpenAiCompatServer: NativeServer
runs the *full* upstream llama.cpp HTTP server — embedded WebUI included — inside
libjllama over JNI, with no separate llama-server executable. It forwards the raw
llama-server argv verbatim, so every flag works exactly as for the standalone
binary (no per-flag Java mapping).

How: b9859 already exposes `int llama_server(int, char**)` (no main() in
server.cpp). patches/0006 makes it embeddable — skips installing process-wide
SIGINT/SIGTERM handlers when embedded (they would hijack the JVM), parses the
forwarded argv via common_params_parse instead of common_params_parse_main
(whose GetCommandLineW recovery would grab java.exe's command line — the Windows
bug class 0001 fixes), and adds llama_server_request_shutdown() for out-of-band
stop (ctx_server is loop-local). native_server.cpp's JNI bridge runs llama_server
on a worker thread; start/stop/isRunning map to the three native methods.

CMake: server.cpp + server-tools.cpp are now compiled in (non-Android — both pull
subprocess.h/posix_spawn_*, so they share server-models.cpp's guard), plus
native_server.cpp.

NativeServer is an independent lifecycle (loads its own model from the argv, like
llama-server.exe), single-instance per process (upstream keeps shutdown state in
file-scope globals), and unavailable on Android. Reusing an already-loaded
LlamaModel's context is a documented TODO. libjllama loads lazily in start(), so
construction/arg-parsing/close stay pure-Java unit-testable.

Verified end-to-end on Linux x86_64 with a synthetic granitehybrid model: server
starts, GET /health -> 200 {"status":"ok"}, /v1/models and /props served, / is
the native WebUI route (404 locally with the empty-asset stub; serves index.html
in released jars that bake in webui-generated assets), close() shuts down cleanly.
7 pure-Java NativeServer tests + javadoc + spotless + clang-format(22.1.5) clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…CompatServer)

Two runnable server mains now exist. The fat jar's default Main-Class becomes
NativeServer, so `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080`
runs the full native llama.cpp server with its embedded WebUI, forwarding every
argument. OpenAiCompatServer is unchanged and still runnable via
`java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`.

- NativeServer.main(args): forwards argv, starts the server, registers a JVM
  shutdown hook (the embedded server installs no signal handlers of its own — see
  patches/0006 — so the hook is what stops it cleanly on Ctrl-C/SIGTERM), and
  blocks until the native worker exits.
- llama/pom.xml assembly profile: Main-Class OpenAiCompatServer -> NativeServer.
- README + CLAUDE.md: document the two modes and how to select each.

Verified end-to-end (Linux x86_64, synthetic granitehybrid): `java -cp … NativeServer
-m model --port 8972` serves /health=ok after load; SIGTERM to the JVM fires the
shutdown hook -> clean "cleaning up before exit" -> port down. Javadoc + spotless
clean; 7 pure-Java NativeServer tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…ai-compat

Tiny dispatcher as the fat jar's Main-Class: with --open-ai-compat it runs
OpenAiCompatServer (Java-transport OpenAI API), without it the default
NativeServer (full native llama.cpp server + WebUI). The --open-ai-compat marker
is stripped (it is not a llama.cpp flag); all other args are forwarded verbatim
to the chosen server. Both underlying mains stay runnable directly by class name
via java -cp.

Note: the two servers accept different flag sets — NativeServer forwards every
llama-server flag, OpenAiCompatServer's CLI accepts a curated subset and rejects
unknown flags — so native-only flags can't be combined with --open-ai-compat.

Dispatch logic split into pure static helpers (selectsOpenAiCompat / withoutFlag)
with 7 unit tests. Verified at runtime: `ServerLauncher --open-ai-compat -m model
--port 8973` starts the Java server (/ -> invalid_request_error, its handler),
and without the flag starts NativeServer (/ -> native File Not Found); both shut
down cleanly on SIGTERM. pom Main-Class NativeServer -> ServerLauncher; README +
CLAUDE.md updated. spotless + javadoc clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…g to --openai-compat

Per review: collapse the dispatch to a single pure helper. withoutFlag(args, flag)
strips the selector and returns a (possibly shorter) array; main() selects the
mode purely by whether that shortened the list (present iff result is smaller),
so the separate selectsOpenAiCompat method + its baked-in constant are gone. The
helper takes the flag as a parameter, so it is general and testable independent
of the flag's meaning.

Also rename the selector --open-ai-compat -> --openai-compat ("OpenAI" is one
word, matching the brand and the codebase's oaicompat / OpenAiCompatServer);
constant OPEN_AI_COMPAT_FLAG -> OPENAI_COMPAT_FLAG.

Tests rewritten around withoutFlag: the length-change selection signal (shorter
iff present, same length iff absent, position-independent) plus stripping
behaviour (strips all occurrences, preserves order, no-op when absent, empty).
7 pass. Verified at runtime: `ServerLauncher --openai-compat -m model --port
8974` routes to OpenAiCompatServer (/ -> invalid_request_error) and shuts down
cleanly; no-flag routes to NativeServer. README + CLAUDE.md + pom updated;
spotless + javadoc clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Rename the fat-jar dispatch selector from --openai-compat to
--jllama-openai-compat so it can never collide with a current or future
llama.cpp / llama-server flag: upstream owns the --* space, this launcher
owns --jllama-*. The jllama prefix is the project's native-library name,
which upstream will never use, and it stays a lowercase-hyphen CLI token
(not the verbose FQN, not the class name). ServerLauncher strips it before
forwarding, so it never reaches llama_server (which rejects unknown flags).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Additive-only upgrade — no incompatibilities, no project source changes.
Bumps GIT_TAG (and the TTS provenance banner), the README badge/link, and
the CLAUDE.md pinned-version line + build examples.

The b9859..b9862 diff touches two patch-target files (server-context.cpp/.h,
for the new model_ftype field + get_meta/get_model_info additions). Patches
0002/0003/0005 were applied in sequence against the actual b9862
server-context.{cpp,h} and all apply cleanly (their regions are disjoint from
and far from the additions). Patches 0001/0004/0006 target files not in the
changed-file list (common/arg.*, server-common.cpp, test-chat.cpp, server.cpp,
the ~34 standalone mains) and the OuteTTS generator anchors (tts.cpp) are
unchanged, so they apply byte-identically.

New upstream features surfaced by the bump (documented in the breaking-changes
history):
- Additive C API llama_ftype_name / llama_model_ftype; the native server now
  emits a model 'ftype' in get_model_info(), so NativeServer mode surfaces the
  quant type automatically. Optional follow-up: bind it into LlamaModel and the
  Java OpenAiCompatServer propsJson.
- CUDA gated-delta-net fused snapshot-copy path (decode win for gated-delta /
  hybrid-recurrent models on the cuda13 classifiers).
- Vendored cpp-httplib bumped to v0.49.0 (locale-independent ASCII classifiers,
  additive MultipartFormDataWriter, base64 UB fix) — internal to the compiled
  server transport, no bound symbol.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Wires the new b9862 llama_ftype_name / llama_model_ftype quant-type info
(exposed by server_context_meta::model_ftype) up through JNI to Java:

- jllama.cpp getModelMetaJson now emits "ftype" from server_context_meta.
- ModelMeta.getFtype() and the convenience LlamaModel.getModelFtype() expose
  the quant label (e.g. "Q4_K - Medium"; a guessed type is prefixed with
  "(guessed) "), empty when the native layer does not report it.
- OpenAiCompatServer advertises it as data[].ftype in GET /v1/models, matching
  the upstream server's get_model_info() key. The value is threaded through
  OpenAiServerConfig.modelFtype (new field/builder/getter) from the loaded
  model, mirroring how supportsVision is threaded — keeping the "models built
  from config alone" invariant. The field is omitted when unknown/blank.

Tests: +2 ModelMeta, +1 OpenAiServerConfig, +1 OpenAiSseFormatter (ftype
present/omitted). Verified end-to-end: full native jllama build against b9862
with all six patches applied (100% link), model-free load smoke test green,
and 63 model-free Java unit tests pass; clang-format 22.1.5 + spotless +
javadoc all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Extends the artifact matrix toward upstream llama.cpp's release set with three
new native builds (all wired for CI; the user runs CI to validate on the GPU/
arm runners this environment lacks):

1. vulkan-linux-x86-64  — Linux x86_64 Vulkan classifier JAR
2. vulkan-linux-aarch64 — Linux aarch64 Vulkan classifier JAR
3. Windows arm64 CPU    — folded into the DEFAULT JAR (no classifier)

Linux Vulkan (vendor-neutral GPU jar, no CUDA toolkit) — the intersection of
the existing Vulkan-Windows and CUDA-Linux wiring:
- CMakeLists: the elseif(GGML_VULKAN) branch is now OS-aware like GGML_CUDA
  (Windows -> resources_windows_vulkan, else resources_linux_vulkan/.../Linux/
  ${OS_ARCH}); one tree holds both arches.
- pom.xml: profiles vulkan-linux / vulkan-linux-aarch64, both reading the shared
  resources_linux_vulkan tree with an arch-scoped resource-copy <includes>
  (Linux/x86_64 vs Linux/aarch64), so each classifier JAR carries only its arch.
  Verified locally with staged dummy natives: each jar contains exactly one
  libjllama.so for its arch.
- publish.yml: build-linux-x86_64-vulkan (native ubuntu-latest) +
  build-linux-aarch64-vulkan (ubuntu-24.04-arm, GCC 14); both apt-install the
  Vulkan SDK, build -DGGML_VULKAN=ON -DGGML_NATIVE=OFF, build-only (GPU-less
  runners). Artifacts merge into one resources_linux_vulkan tree in package/
  publish; profiles added to the three -P lists.
- .gitignore: ignore resources_linux_vulkan (also fixed the pre-existing
  resources_cuda_linux -> resources_linux_cuda typo).

Windows arm64 CPU (default JAR):
- build-windows-arm64 on the free windows-11-arm runner (msvc-dev-cmd arch:arm64,
  Ninja Multi-Config, -DOS_ARCH=aarch64, build + ctest), emitting to the canonical
  resources/.../Windows/aarch64 and uploading Windows-aarch64-libraries, which the
  *-libraries glob merges into the default tree. No Java change: OSInfo already
  maps a Windows-on-ARM JVM (os.arch=aarch64) to Windows/aarch64. Matches the
  existing Windows CPU jobs (committed jllama.h + bundled JNI headers, so no
  mvn compile / setup-java needed).

All three added to package.needs. Runtime GPU libs are never bundled (driver
supplies libvulkan.so.1) — same policy as every GPU classifier.

Local verification (CI does the real GPU/arm builds): CMake configures clean and
the CPU branch still routes correctly; pom.xml is well-formed and both new
profiles are recognized and activate; the per-arch classifier split was proven by
packaging staged dummy natives; the workflow YAML parses (40 jobs) with all needs
resolving and all -P lists updated. README classifier table + snippets and CLAUDE.md
document the additions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
KISS analogue of upstream llama.cpp's ccache-action "CCache Statistics" table,
but for this repo's sccache-over-Depot cache (ccache-action can't be dropped in:
it manages its own ccache + actions/cache backend, conflicting with the Depot
WebDAV design). build.sh/build.bat already print `sccache --show-stats` to the
log; now, when running in CI (GITHUB_STEP_SUMMARY set) and sccache was actually
the launcher, they also parse those stats and append a small markdown table:

  ### sccache statistics
  | Cache hits | Requests | Hit rate |
  |------------|----------|----------|
  | 589        | 600      | 98.2%    |

Per-job (GitHub does not merge job summaries), covering every native build job
uniformly — build.sh handles the dockcross/native-Linux/aarch64/vulkan-linux and
macOS jobs; build.bat the Windows jobs. Parses the text stats (top-level
"Compile requests" = total, top-level "Cache hits" = hits; the per-language
"Cache hits (C/C++)" line is skipped by the digit-anchored regex). Best-effort:
skipped silently if the numbers can't be parsed or there were no requests, and
never emitted for local runs (no GITHUB_STEP_SUMMARY) — so local builds are
untouched.

Verified the build.sh parse end-to-end against a realistic sccache --show-stats
sample (req=600, hits=589 -> 98.2%, with the "Compile requests executed" line
correctly excluded); build.sh passes bash -n. The Windows batch path (integer
math with rounding, escaped pipes/parens) is validated by CI on the Windows
runners.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Additive-only upgrade — no incompatibilities, no project source changes.
Bumps GIT_TAG (and the TTS provenance banner), the README badge/link, and the
CLAUDE.md pinned-version line + build examples.

The b9862..b9864 diff is almost entirely the Svelte WebUI (tools/ui/**, which
auto-follows the pinned GIT_TAG via the build-webui CI job) plus one small server
change: a new per-request sse_ping_interval in the completion API (task_params
field + make_llama_cmpl_schema field + handle_completions_impl capture). It's
inside upstream-compiled server TUs the project already links; NativeServer mode
gets it for free, and the project binds no new symbol.

Patch verification: the diff touches exactly one patch-target file
(server-context.cpp, only in handle_completions_impl ~L4089, far below every
patched region). Patches 0002/0003/0005 were applied in sequence against the
actual b9864 server-context.{cpp,h} — all clean; server-context.h is unchanged,
and server-schema.cpp/server-task.h are not patch targets. Patches 0001/0004/0006
target files not in the changed-file list, so they apply unchanged. Confirmed
end-to-end by a clean cmake configure: b9864 fetched and all six patches applied
via the fail-loud PATCH_COMMAND (exit 0), OuteTTS generator anchors held.

Optional future work (documented in the breaking-changes history): expose
sse_ping_interval on the Java InferenceParameters — it would flow through the
OAI-compat completion path via eval_llama_cmpl_schema like any other field.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…meters

Wires the b9864 per-request sse_ping_interval into the Java API and, from a
completion-schema audit, adds the other already-parseable-but-unexposed plain
scalars so callers of the OAI-compat completion path can set them.

New withers (each emits a JSON key honored by eval_llama_cmpl_schema):
- withSsePingInterval(int)  -> sse_ping_interval (b9864; -1 disables pings)
- withXtcProbability(float) / withXtcThreshold(float) -> XTC sampler
- withNDiscard(int)         -> n_discard (context-shift discard)
- withNIndent(int)          -> n_indent (infill indentation)
- withTMaxPredictMs(int)    -> t_max_predict_ms (generation time budget)
- withPostSamplingProbs(boolean) -> post_sampling_probs
- withTimingsPerToken(boolean)   -> timings_per_token
- withReturnTokens(boolean)      -> return_tokens

Audit method: extracted every field name from b9864's make_llama_cmpl_schema and
diffed against the InferenceParameters keys. t_max_prompt_ms was deliberately
skipped (commented out  upstream, so not parseable). The
remaining unexposed fields are OAI aliases already covered (max_tokens/
max_completion_tokens -> n_predict) or OAI/server-internal / array-shaped /
advanced knobs (n, logprobs, echo, verbose, include_usage, return_progress,
response_fields, lora, grammar_lazy/grammar_triggers/preserved_tokens,
chat_format, parse_tool_calls, reasoning_control, backend_sampling, adaptive_*),
left out on purpose and documented in the breaking-changes history.

Tests: +2 Java withers tests (InferenceParametersTest -> 104 pass) and +3 C++
schema round-trip guards in test_server.cpp pinning that the native parser
honors sse_ping_interval (round-trip, -1 disables, below-hard-limit throws,
absent inherits the server default) -> full C++ suite 462 pass (was 459).
javadoc (llama module) + spotless + clang-format all clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…arm64

Fixes three PR #293 CI failures.

1. SpotBugs (spotbugs-exclude.xml): the branch's new server classes tripped 14
   fb-contrib/findsecbugs findings not yet covered by the exclude file. All are
   false-positives or intentional tradeoffs mirroring suppressions the project
   already applies elsewhere. Added six method-scoped Match blocks with rationale:
   - NativeServer: IMC_IMMATURE_CLASS_NO_EQUALS, MDM_THREAD_YIELD (main() poll),
     UVA_USE_VAR_ARGS (native JNI method), WEM_WEAK_EXCEPTION_MESSAGING.
   - ServerLauncher.main: THROWS_METHOD_THROWS_CLAUSE_BASIC_EXCEPTION.
   - OpenAiServerCli parse/usage/parseChatTemplateKwargs: ENMI_NULL_ENUM_VALUE
     (@nullable CacheType unset sentinel), POTENTIAL_XML_INJECTION + PRMC (plain-
     text usage() help, no XML), EXS_EXCEPTION_SOFTENING (Jackson->CLI error, cause
     chained), PSC_PRESIZE_COLLECTIONS.
   - OpenAiServerCli$Options.getChatTemplateKwargs: EI_EXPOSE_REP (returns an
     already-unmodifiable map).
   Verified: mvn spotbugs:check -> BugInstance size 0, BUILD SUCCESS.

2. Linux Vulkan (publish.yml): both build-linux-*-vulkan jobs failed
   find_package(SPIRV-Headers CONFIG REQUIRED) because the apt install omitted it.
   Added spirv-headers to both apt lines (exact parity with upstream llama.cpp's
   build-vulkan.yml: 'glslc libvulkan-dev spirv-headers').

3. Windows arm64 (publish.yml + CLAUDE.md): ggml aborts 'MSVC is not supported for
   ARM, use clang'. The generator (Ninja) was never the issue — the compiler was.
   Switched the job to clang-cl (-DCMAKE_C_COMPILER=clang-cl
   -DCMAKE_CXX_COMPILER=clang-cl): it satisfies ggml's guard
   (if (MSVC AND NOT CMAKE_C_COMPILER_ID STREQUAL "Clang")) while keeping
   CMake's MSVC=TRUE, so our static /MT CRT block still applies and the runner +
   Ninja + ctest all stay. msvc-dev-cmd (arm64) supplies the MSVC headers/libs.
   First-run watch item: clang-cl must be on PATH (VS Clang component / LLVM); if
   CI reports it missing, add an LLVM setup step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Two LlamaArchitectureTest rules failed on PR #293 because of this branch's server
additions:

- layeredArchitecture (12 violations): the branch adds two legitimate new edges
  out of the Server layer -- Server -> Args (OpenAiServerCli maps -ctk/-ctv to the
  args.CacheType enum) and Server -> Loader (NativeServer.start() calls
  LlamaLoader.initialize() before launching the embedded native server). The rule
  documents itself as the EXACT set of accessors today, to be updated when a new
  dependency is intended, so Server is added to the Loader and Args
  mayOnlyBeAccessedByLayers lists (+ a doc note). Server remains the only layer
  allowed to reach the Api root and stays mayNotBeAccessedByAnyLayer.

- noThreadSleep (1 violation): NativeServer.main() kept the JVM alive with a
  while(isRunning()) Thread.sleep(200) poll loop. The rule bans Thread.sleep and has
  no suppression seam (it prefers Condition.await/poll), so main() now blocks on a
  bounded CountDownLatch.await(200ms) signalled by the shutdown hook. This is also a
  behavioural improvement: Ctrl-C/SIGTERM wakes the wait immediately instead of after
  up to a 200 ms tick, while the timeout still re-checks isRunning() to catch a
  self-terminated native worker.

Verified: LlamaArchitectureTest 12/12 pass; server-package tests 44/44 pass
(ServerLauncher, OpenAiServerCli, NativeServerSmoke); javadoc + spotless clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
The clang-cl fix worked (jllama.dll linked for Windows/aarch64), but the next
step failed at test discovery: gtest_discover_tests could not launch
jllama_test.exe -> exit 0xc0000135 (STATUS_DLL_NOT_FOUND).

Root cause: with clang-cl, ggml links LLVM's OpenMP runtime (libomp.lib ->
needs libomp140.aarch64.dll at run time). Unlike MSVC's ambient vcomp140.dll on
x64, that DLL is not on PATH, so neither the test exe nor a consumer could load
the binary. (Upstream llama.cpp works around this by copying libomp140.aarch64.dll
next to its arm64 output.)

Fix: pass -DGGML_OPENMP=OFF for the arm64 job. ggml falls back to its own
std::thread threadpool, so both jllama_test.exe and the shipped arm64 jllama.dll
are self-contained with no libomp dependency to ship — cleaner than bundling an
LLVM OpenMP DLL into the default JAR. The x86_64/x86 jobs keep OpenMP (MSVC vcomp,
which is ambient and already proven). Also updated the job comment + CLAUDE.md to
record that VC\Tools\Llvm\ARM64 supplies clang-cl/lld-link (no separate LLVM
install needed) and the OpenMP rationale.

The getenv/strdup/ctime deprecation messages in the same log are warnings only
(clang-cl flagging POSIX names against the MSVC UCRT headers), not the failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Trivial additive upgrade — no incompatibilities, no project source changes.
Bumps GIT_TAG (+ the TTS provenance banner), the README badge/link, and the
CLAUDE.md pinned-version line + build examples.

The b9864..b9866 diff is backend/WebUI-only: the CUDA topk-moe kernel gains a
case 288 instantiation + accepts n_expert==288 (StepFun 3.7's non-power-of-2
expert count) — device-side, affecting only the cuda13 classifiers; a
test-backend-ops.cpp case (not built here, LLAMA_BUILD_TESTS OFF); and WebUI
changes (a config string-boolean normalization migration + a thinking-default
flip) that auto-follow the pinned GIT_TAG via the build-webui job. The project
binds no new symbol.

Patch verification: the diff touches no patch-target file and no OuteTTS anchor,
so all six patches are byte-identical to b9864. Confirmed end-to-end by a clean
cmake configure: b9866 fetched (case 288 present) and all six patches applied via
the fail-loud PATCH_COMMAND (exit 0; 0005 + 0006 markers present), OuteTTS anchors
held.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Add a diff-size-driven bump workflow so a version upgrade never lands an
unreviewably large diff in one step.

- .github/scripts/llama-next-version.sh: read-only helper that computes the
  next reviewable step. Reads the current pin from llama/CMakeLists.txt and
  the target from an explicit b<nnnn> arg or the GitHub releases atom feed,
  against a cached blobless mirror clone. If git diff cur..target is under the
  threshold (LLAMA_BUMP_MAX_DIFF_KB, default 100 KiB) it bumps straight to the
  target; otherwise it binary-searches the intermediate tags for the largest
  one still under the threshold and prints that chunk plus its compare/.patch
  URLs. LLAMA_BUMP_EXCLUDE_WEBUI sizes the diff excluding the auto-followed
  tools/ui WebUI.
- docs/upgrade/llama-cpp-version-bump.md: the runbook (documentation root) for
  target selection, byte-size chunking, the helper, and the edit/verify/commit
  loop.
- CLAUDE.md: link the runbook from the Upgrading/Downgrading section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
First bump driven by the new .github/scripts/llama-next-version.sh helper:
b9866 -> b9867 is a 2 KiB single-commit final chunk (well under the 100 KiB
threshold), so it bumps straight to the latest release.

b9867 (spec: support spec-draft-p-min in DFlash) changes only
common/speculative.cpp: the DFlash draft path now also clamps n_min to the
block size, raises the draft sampler top_k 1 -> 10, stops drafting when the
top candidate probability drops below p_min, and discards a step producing
fewer than n_min tokens. All three use existing common_speculative_params
fields; common/speculative.h is untouched. Entirely inside upstream-compiled
common; the project binds no common_speculative_* symbol. No project source
changes required.

Re-verified all six patches (0001-0006) apply cleanly against b9867 via a
fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers present);
OuteTTS generator anchors held. Appended the b9866->b9867 history rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
The new docs/upgrade/llama-cpp-version-bump.md lacked copyright/licensing
info, failing the License Compliance (REUSE) check. Add the top-of-file
HTML-comment SPDX block used by the sibling docs (docs/history/*.md,
docs/feature-investigation-*.md). reuse lint now reports 310/310 files
compliant with REUSE 3.3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Driven by .github/scripts/llama-next-version.sh: b9867 -> b9870 is a 21 KiB
(11 KiB excl. WebUI) three-commit final chunk, under the 100 KiB threshold,
so it bumps straight to the latest release.

The only source edit in b9867..b9870 is common/chat.cpp: a StepFun
message-content whitespace workaround (issue #24181) that trims leading and
trailing whitespace from each common_chat_msg content, reasoning_content and
text content-part before Jinja rendering, detected by the StepFun template
signature. It uses existing common_chat_msg fields; common/chat.h is
untouched. The removed stepfun-ai-Step-3.5-Flash.jinja template and the
test-chat additions are not built here (LLAMA_BUILD_TESTS OFF); tools/ui is
the auto-followed WebUI. No project source changes required.

Re-verified all six patches (0001-0006) apply cleanly against b9870 via a
fresh fail-loud cmake PATCH_COMMAND configure (0005/0006 markers and the
b9870 trim_all_content change present); OuteTTS generator anchors held.
Appended the b9867->b9870 history rows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…VINO

Wire eight new GPU-backend classifiers following the exact same 5-place pattern
as the existing CUDA/Vulkan classifiers (fail-loud, in package.needs, no
continue-on-error, no special cases):

  rocm-linux-x86-64        GGML_HIP     Linux  x86_64  (AMD ROCm/HIP)
  rocm-windows-x86-64      GGML_HIP     Windows x86_64 (AMD HIP SDK)
  sycl-fp16-linux-x86-64   GGML_SYCL+F16 Linux x86_64  (Intel oneAPI, fp16)
  sycl-fp32-linux-x86-64   GGML_SYCL    Linux  x86_64  (Intel oneAPI, fp32)
  sycl-windows-x86-64      GGML_SYCL    Windows x86_64 (Intel oneAPI)
  opencl-windows-aarch64   GGML_OPENCL  Windows aarch64 (Snapdragon/Adreno)
  openvino-linux-x86-64    GGML_OPENVINO Linux x86_64  (Intel OpenVINO)
  openvino-windows-x86-64  GGML_OPENVINO Windows x86_64 (Intel OpenVINO)

- llama/CMakeLists.txt: extend the OS-aware backend routing with GGML_HIP,
  GGML_SYCL (Linux fp16/fp32 split by GGML_SYCL_F16) and GGML_OPENVINO branches.
- llama/pom.xml: eight classifier profiles; the existing opencl-windows include
  is now arch-scoped to Windows/x86_64 so the new aarch64 OpenCL build sharing
  the resources_windows_opencl tree does not leak into it (vulkan-linux split
  precedent).
- .github/workflows/publish.yml: eight build jobs (build-only; GitHub runners
  have no matching GPU), all added to package.needs and to the download +
  profile-activation steps of package/publish-snapshot/publish-release. Vendor
  toolchain installs are first-pass and intentionally fail loud if a URL/version
  is stale.
- README.md + CLAUDE.md: classifier table rows, dependency snippets, and a
  wiring/routing section. .gitignore: the seven new resources_* trees.

All build-only, no vendor runtime bundled (consumer's driver/toolkit supplies
it). Validated locally: CMake CPU reconfigure parses the extended routing,
Maven recognizes all 8 profiles, publish.yml is valid YAML, pom.xml is
well-formed, REUSE compliant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Mirror upstream llama.cpp's own release-job recipes for the Windows SYCL and
Windows HIP builds, and fix the two OpenVINO installs:

- Windows ROCm/HIP: the AMD HIP SDK URL 404'd and find_package(hip) could not
  locate the SDK. Use HIP SDK 26.Q1 (upstream's pin), resolve HIP_PATH from the
  installed ROCm dir, and pass -DCMAKE_PREFIX_PATH plus the SDK's own
  clang/clang++ so ggml-hip's find_package(hip) resolves (GPU_TARGETS, upstream
  spelling).
- Windows SYCL: the oneAPI offline installer URL returned 403. Use upstream's
  intel-deep-learning-essentials-2025.3.3.18 offline installer with the
  extract + bootstrapper silent install (DPC++/MKL/oneDNN/TBB components), then
  setvars intel64 --force and build with cl (C) + icx (C++), matching upstream.
- Linux OpenVINO: OpenVINOConfig.cmake's find_package(TBB) failed. Add
  libtbb-dev (supplies TBBConfig.cmake).
- Windows OpenVINO: the archive extracts into a nested versioned folder, so the
  hard-coded C:\openvino\runtime\cmake did not exist. Resolve the nested dir and
  pass -DOpenVINO_DIR explicitly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…O OpenCL

Second round of fail-loud CI fixes for the new GPU classifiers, from the actual
build logs:

- Windows ROCm/HIP: device-code compile failed because ROCm 7.1's HIP clang
  headers cannot overload the __host__ __device__ isgreater/isless/... that the
  very new VS 2026 MSVC <cmath> declares via _CLANG_BUILTIN2. Move the job to
  windows-2022 (MSVC 14.4x), which is what upstream llama.cpp uses for win-hip.
- Windows SYCL: icx rejected the project's static /MT CRT with '-fsycl'
  ("invalid argument 'MT' not allowed with '-fsycl'"). Exempt GGML_SYCL (and
  GGML_OPENVINO, whose import libs are /MD) from the static-CRT force in
  CMakeLists so they build with the dynamic /MD runtime. Those classifiers
  already need the vendor runtime on the host, so the self-contained-DLL
  rationale doesn't apply; CPU + CUDA/Vulkan/OpenCL keep /MT.
- Linux OpenVINO: past the TBB fix, ggml-openvino's find_package(OpenCL) failed.
  Add ocl-icd-opencl-dev + opencl-headers to the apt install.
- Windows OpenVINO: same find_package(OpenCL) need — build it via
  build_opencl_windows.bat (stages the Khronos headers + OpenCL.lib, then
  delegates to build.bat) instead of build.bat directly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
OpenVINO backend now compiles but failed on both platforms:
- ov::Allocator template error (allocate/is_equal on a const void*): version
  mismatch. llama.cpp's ggml-openvino targets OpenVINO 2026.2.1 (what upstream
  ships), not the 2025.0.0 I pinned. Bump Linux apt to openvino-2026.2.1 (repo
  /openvino/2026) and the Windows archive to 2026.2.1.
- Windows 'CL/cl2.hpp' not found: the staged Khronos OpenCL-Headers dropped
  cl2.hpp. Install OpenCL via vcpkg (opencl:x64-windows ships cl2.hpp) and pass
  the vcpkg toolchain file, mirroring upstream's windows-openvino job; drop the
  build_opencl_windows.bat staging for this job.
- Linux: add opencl-clhpp-headers + intel-opencl-icd to the apt set (upstream's
  full OpenCL package list for ubuntu-openvino).

Also raise cmake_minimum_required 3.15 -> 3.22 to match what the build actually
relies on (runners ship 3.31); no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
…2026 apt repo

The apt repo https://apt.repos.intel.com/openvino/2026 returns 404 — Intel only
publishes OpenVINO apt repos up to ~2025, and 2025.x has the older ov::Allocator
API that breaks ggml-openvino's template compile. Switch Linux OpenVINO to the
archive for 2026.2.1, exactly as upstream llama.cpp's linux-setup-openvino
composite action does:
  storage.openvinotoolkit.org/repositories/openvino/packages/2026.2.1/linux/
  openvino_toolkit_ubuntu24_2026.2.1.21919.ede283a88e3_x86_64.tgz
extracted to /opt/intel/openvino, with OpenVINO_DIR set to its runtime/cmake.
OpenCL headers (incl. the C++ CL/cl2.hpp via opencl-clhpp-headers) come from
Ubuntu's own repos, so no Intel apt repo is needed at all.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
Wire build-linux-s390x: cross-compile for IBM Z (s390x, big-endian) with the
GCC cross toolchain (native x86 speed), then run the full 462-test C++ suite
under qemu-user as a real big-endian correctness gate for our byte-order-
sensitive code (the little-endian WAV writer, JSON/token/embedding transforms,
JNI helpers). Model-backed Java tests are not run under emulation (slow/flaky);
the Java<->JNI boundary uses host-native array copies, so the C++ gate covers
the actual endian risk.

- publish.yml: build-linux-s390x (g++-s390x-linux-gnu + qemu-user-static;
  CMAKE_CROSSCOMPILING_EMULATOR + QEMU_LD_PREFIX make ctest run the s390x exe;
  GGML_OPENMP=OFF avoids cross-libgomp). s390x is a default-jar CPU platform
  like aarch64, so the artifact merges via the *-libraries glob (no classifier
  / pom profile). Fail-loud and in package.needs.
- OSInfo.java: map os.arch=s390x -> Linux/s390x (S390X constant + archMapping).
- README/CLAUDE.md: document the platform + the big-endian gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
@bernardladenthin bernardladenthin changed the title Add NativeServer mode, Vulkan Linux classifiers, and llama.cpp b9864 features NativeServer mode, GPU/CPU classifier matrix (ROCm/SYCL/OpenVINO/Win-arm64/s390x), llama.cpp b9870 + bump automation Jul 4, 2026
Address 'Resources should be closed' (java:S2095) on NativeServer.main. The
server was already closed on every real path (shutdown hook on SIGTERM, explicit
close on self-termination), but not in a structure Sonar recognizes. Wrap the
body in try/finally so close() is guaranteed on normal or exceptional exit —
S2095's 'close in a finally clause' option.

try-with-resources is deliberately NOT used: the shutdown hook must also call
close() explicitly, which javac flags under -Werror as 'explicit call to close()
on an auto-closeable resource'. close() is idempotent (guards on a zero handle),
so the finally and the hook both firing is safe. The now-redundant stoppedByHook
flag is dropped. All 7 NativeServerSmokeTest cases still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
@sonarqubecloud

sonarqubecloud Bot commented Jul 4, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
78.9% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@bernardladenthin bernardladenthin merged commit d8523a4 into main Jul 4, 2026
12 of 15 checks passed
@bernardladenthin bernardladenthin deleted the claude/granite-4-llama-cpp-decode-kk9xjp branch July 4, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants