bernardladenthin · bernardladenthin · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026
@@ -56,6 +56,25 @@ jobs:
   # Cross-compile jobs (Docker / dockcross) — produce release artifacts, no testing
   # ---------------------------------------------------------------------------
 
+  code-style:
+    name: Code style (spotless) + package graph
+    needs: startgate
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: actions/setup-java@v5
+        with:
+          java-version: '21'
+          distribution: temurin
+      - name: Spotless check (fail fast on format violations)
+        run: mvn -B --no-transfer-progress spotless:check
+      - name: Print internal package dependency graph (jdeps, informational)
+        continue-on-error: true
+        run: |
+          mvn -B --no-transfer-progress -DskipTests -Denforcer.skip=true compile
+          echo "=== internal package dependency graph (jdeps, bytecode) ==="
+          jdeps -verbose:package target/classes | grep 'net.ladenthin.llama' || true
+
   crosscompile-linux-x86_64-cuda:
     name: Cross-Compile manylinux_2_28 x86_64 (CUDA)
     needs: startgate
@@ -822,7 +841,7 @@ jobs:
 
   publish-snapshot:
     name: Publish Snapshot to Central
-    needs: [check-snapshot, crosscompile-linux-x86_64-cuda, crosscompile-android-aarch64-opencl]
+    needs: [check-snapshot, crosscompile-linux-x86_64-cuda, crosscompile-android-aarch64-opencl, code-style]
     if: needs.check-snapshot.result == 'success'
     runs-on: ubuntu-latest
     environment: maven-central
@@ -898,7 +917,7 @@ jobs:
   publish-release:
     name: Publish Release to Central
     if: needs.check-tag.result == 'success'
-    needs: [check-tag, crosscompile-linux-x86_64-cuda, crosscompile-android-aarch64-opencl]
+    needs: [check-tag, crosscompile-linux-x86_64-cuda, crosscompile-android-aarch64-opencl, code-style]
     runs-on: ubuntu-latest
     environment: maven-central
     permissions:

@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9543**
+Current llama.cpp pinned version: **b9549**
 
 ## Upgrading CUDA Version
 

@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9543
+	GIT_TAG        b9549
 )
 FetchContent_MakeAvailable(llama.cpp)
 
@@ -159,7 +159,7 @@ if(NOT DEFINED OS_NAME)
     find_package(Java REQUIRED)
     find_program(JAVA_EXECUTABLE NAMES java)
 	execute_process(
-      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.OSInfo --os
+      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.loader.OSInfo --os
       OUTPUT_VARIABLE OS_NAME
       OUTPUT_STRIP_TRAILING_WHITESPACE
     )
@@ -177,7 +177,7 @@ if(NOT DEFINED OS_ARCH)
     find_package(Java REQUIRED)
     find_program(JAVA_EXECUTABLE NAMES java)
     execute_process(
-      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.OSInfo --arch
+      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.loader.OSInfo --arch
       OUTPUT_VARIABLE OS_ARCH
       OUTPUT_STRIP_TRAILING_WHITESPACE
     )

@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9543](https://img.shields.io/badge/llama.cpp-%23b9543-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9543)  
+[![llama.cpp b9549](https://img.shields.io/badge/llama.cpp-%23b9549-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9549)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  

@@ -69,12 +69,41 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
   (`07109cc`): 25 sites. The same rule is suppressed in BAF
   (`52c8c95`) for identical reasons.
 
-- **Additional ArchUnit rules to consider** — layered-architecture rules (`layeredArchitecture().consideringAllDependencies()`), per-module banned-imports lists, public-API-surface constraints (no public mutable static state, etc.). Partial progress: `7b6667d` covers the "no public field that is not final" sub-rule.
+- **Additional ArchUnit rules to consider** — the full **`layeredArchitecture()`** rule and a **per-module banned-import** rule (`jacksonBannedFromContractsAndLoader` — Jackson kept out of `args`/`callback`/`exception`/`loader`) are now DONE. Still open: more per-module banned-imports if useful, public-API-surface constraints (no public mutable static state, etc.). Partial progress: `7b6667d` covers the "no public field that is not final" sub-rule.
 
 - **Cross-repo code-quality TODOs** — see [`../workspace/policies/code-quality-todos.md`](../workspace/policies/code-quality-todos.md) for the canonical `@VisibleForTesting` design-fit review, package hierarchy review, and class/method naming review. This repo has no `@VisibleForTesting` usages today; package and naming reviews remain open.
 
 ## Done (kept for history)
 
+### Layered package restructure (flat root package → layered hierarchy)
+
+The flat `net.ladenthin.llama` root package was split (via `git mv`, history
+preserved) into layered packages so boundaries align with the layers, enforced
+by a new `layeredArchitecture()` ArchUnit rule (Api → Loader → Marshalling →
+Foundation):
+
+- **Foundation**: `value` (18 DTOs: ChatMessage, ContentPart, Pair, LlamaOutput,
+  …), `callback` (CancellationToken, LoadProgressCallback, ToolHandler),
+  `exception` (LlamaException, ModelUnavailableException), `args` (existing leaf).
+- **Marshalling**: `json` (response parsers + `TimingsLogger`, its only consumer),
+  `parameters` (Inference/Model/Json/Cli parameters + `ParameterJsonSerializer` +
+  `ChatRequest`).
+- **Loader** (internal, NOT exported): `loader` (LlamaLoader, OSInfo,
+  ProcessRunner, NativeLibraryPermissionSetter, Java8CompatibilityHelper,
+  SkipDownloadFailureTranslator, LlamaSystemProperties).
+- **Api** (root): LlamaModel, Session, LlamaIterable, LlamaIterator.
+
+Cycle-breaking moves: `TimingsLogger` root→`json`, `ParameterJsonSerializer`
+`json`→`parameters`, `ChatRequest` root→`parameters` (it carries an
+`InferenceParameters` customizer). Test classes mirrored into their subjects'
+packages; cross-layer members promoted to `public`. Cross-package Javadoc
+`{@link}` references fully-qualified (palantir's `removeUnusedImports` strips
+javadoc-only imports). `module-info` exports the new public-API packages and
+keeps `loader` internal. All 11 ArchUnit rules green; `javadoc:jar` clean.
+
+**Breaking change**: public-API FQNs changed (e.g. `net.ladenthin.llama.ChatMessage`
+→ `net.ladenthin.llama.value.ChatMessage`) — ship under a major version bump.
+
 - **Reactive `LlamaPublisher` removed in favour of consumer-side adapters.**
   The hand-rolled `LlamaPublisher` + `LlamaModel.streamPublisher` /
   `streamChatPublisher` (shipped in PR #188 as §2.3 of the Kotlin SDK
@@ -95,7 +124,7 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
 - **`javac -Werror` + `-Xlint:all,-serial,-options,-classfile,-processing`** — `3e2efbb`. ~20 EP warnings addressed first (EqualsGetClass on `Pair` via instanceof; MissingOverride on `PoolingType` / `RopeScalingType`; JdkObsolete `LinkedList` → `ArrayList` in `LlamaLoader`; StringSplitter inline-suppressed; 3× StringCaseLocaleUsage `Locale.ROOT` in `OSInfo`; EmptyCatch in `OSInfo.isAlpineLinux`; FutureReturnValueIgnored in `LlamaModel.completeAsync`; Finalize on `LlamaModel.finalize`; MixedMutabilityReturnType in 4 parser methods; EnumOrdinal in `InferenceParameters.setMiroStat`; EscapedEntity in `InferenceParameters` javadoc; 4× TypeParameterUnusedInFormals; AnnotateFormatMethod on `Java8CompatibilityHelper.formatted`; SafeVarargs + varargs on `Java8CompatibilityHelper.listOf`).
 - **`-parameters` javac arg** — `4350cf2`.
 - **`--release N`** — `4350cf2` (`<release>8</release>`).
-- **Mutation-testing threshold enforcement (PIT)** — `62f8a00` + `bb93a8f` (docs) + `3bfa51f` (README badge). "Single class, full plumbing" pattern: PIT runs every CI build with `<mutationThreshold>100</mutationThreshold>`, `<targetClasses>` narrowed to `net.ladenthin.llama.Pair`.
+- **Mutation-testing threshold enforcement (PIT)** — `62f8a00` + `bb93a8f` (docs) + `3bfa51f` (README badge). Runs every CI build with `<mutationThreshold>100</mutationThreshold>`. **Scope expanded 2026-06-07** from the original single `Pair` target (which was stale after the restructure — `llama.Pair`→`value.Pair` matched nothing) to `value.*` + `exception.*` + `args.*` + `json.TimingsLogger` = 27 classes / 163 mutations, all killed. Still open (optional): `json.ChatResponseParser` / `CompletionResponseParser` private-helper survivors (`RerankResponseParser` is excluded — equivalent empty-list mutant).
 - **Checker Framework as a second static-nullness pass** — `c63870b`. The original
   `@PolyNull` on `JsonParameters.toJsonString` was simplified to plain `@Nullable`
   (the only `@PolyNull` site in production; eliminated in a later cleanup).

@@ -312,3 +312,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | ~b9495–b9543 | `ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/*` + `ggml/src/ggml-sycl/*` + `ggml/src/ggml-vulkan/*` + `ggml/src/ggml-webgpu/*` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp` | Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required |
 | ~b9495–b9543 | `conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py` | Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
 | ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 &#x2192; b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change &mdash; `process_mtmd_prompt()`'s new `bool is_placeholder` parameter &mdash; is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |
+| ~b9543–b9549 | `include/llama.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-ext.h` | New `llama_context_params::ctx_other` field (a source/target/parent `llama_context *`, default `nullptr`) used to share results or `llama_memory` between two contexts; mirrored by new `cparams.ctx_other` and the new staging API `llama_get_ctx_other()` (`llama-ext.h`). `llama_get_memory()` was moved earlier in `llama-context.cpp` and made null-safe (returns `nullptr` for a null ctx). `llama_context_default_params()` initializes `ctx_other = nullptr`. Project does not aggregate-init `llama_context_params` (it goes through `llama_context_default_params()` inside upstream `server-context.cpp`) and never includes `llama-ext.h` &mdash; verified via `grep -rn "llama_context_params\|ctx_other\|llama-ext.h\|llama_get_ctx_other\|llama_get_memory" src/main/cpp/` returns zero matches. No project source changes required |
+| ~b9543–b9549 | `src/llama-kv-cache.{h,cpp}` + `llama-kv-cache-iswa.{h,cpp}` + `llama-kv-cache-dsa.cpp` + `llama-memory.h` + `llama-memory-hybrid{,-iswa}.cpp` | KV-cache constructors gained two new parameters: `llama_memory_t mem_other` and `layer_share_cb share` (`std::function<int32_t(int32_t il)>` returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). `llama_memory_params` gained a `mem_other` field. All call sites (iswa/dsa/hybrid wrappers, `llama_model::create_memory`) updated upstream; the project never constructs a `llama_kv_cache*` or `llama_memory_*` directly. No project source changes required |
+| ~b9543–b9549 | `src/llama-arch.{h,cpp}` + new `src/models/gemma4-assistant.cpp` + `src/models/models.h` + `src/llama-model.{h,cpp}` + `src/llama-hparams.{h,cpp}` + `src/llama-graph.{h,cpp}` + `gguf-py/` + `conversion/gemma.py` | **New model architecture `LLM_ARCH_GEMMA4_ASSISTANT` ("gemma4-assistant")** &mdash; a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors `LLM_TENSOR_NEXTN_PROJ_PRE`/`NEXTN_PROJ_POST` (`nextn.pre_projection`/`post_projection`) plus model-level `nextn_proj_pre`/`nextn_proj_post`; new hparams `n_embd_inp_impl` (input-embedding dim override, honoured by `n_embd_inp()`) and graph field `n_layer_nextn`. Python conversion registers `Gemma4AssistantForCausalLM`/`Gemma4UnifiedAssistantForCausalLM`. This is the headline new feature; it is a speculative-decoding / **MTP** mechanism, which this project tracks as deferred-by-policy (see Open TODOs / `spec-draft-backend-sampling` + MTP). Consumed entirely inside upstream-compiled TUs &mdash; loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO |
+| ~b9543–b9549 | `common/chat.cpp` + new `models/templates/LFM2.5-8B-A1B.jinja` | LFM2 chat-template handling: prior-turn `reasoning_content` is now copied into the template's `thinking` field, and `<think>` reasoning extraction is gated on the template source actually containing `<think>` (and no longer on `enable_thinking`). New `LFM2.5-8B-A1B` template + parser test consolidation. Routing happens inside upstream-compiled `chat.cpp`; the project calls no `common_chat_params_init_lfm2*` symbol. Handled automatically when such a model is loaded; no project source or Java API changes required |
+| ~b9543–b9549 | `common/arg.cpp` + `common/speculative.cpp` + `src/llama-graph.cpp` | `common_params_handle_models()` mmproj auto-download now also requires `params.mmproj.path.empty() && params.mmproj.url.empty()` (an explicitly-specified mmproj is no longer re-downloaded). `speculative.cpp` MTP path adds a shared-memory fast path (`is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt`) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to `llama_model_n_embd_out()` for the MTP row width. `llama-graph.cpp` moved the `set_input_kq_mask` / `can_reuse_kq_mask` calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required |
+| ~b9543–b9549 | `tools/server/server-context.cpp` (project-linked) | The one project-linked server TU changed: now `#include`s `ggml-cpp.h` and `../../src/llama-ext.h`; sets `cparams.ctx_other = ctx_tgt` for MTP draft/MTP contexts; moved the `ctx_dft_seq_rm_type = common_context_can_seq_rm(...)` assignment to after context init (guarded by `if (ctx_dft)`); downgraded the spec memory-measure failure log from `SRV_ERR` to `SRV_WRN`; and gated the mtmd draft-processing block on `llama_get_ctx_other(ctx_dft) != ctx_tgt`. All changes are internal to the TU and the new includes resolve against the FetchContent'd `src/` and `ggml` headers. Compiles into `jllama` unchanged from the project's side. No project source changes required |
+| ~b9543–b9549 | `.github/workflows/docker.yml` (upstream CI) | Upstream's `cuda13` Docker image bumped from CUDA `13.1.1` to `13.3.0`. Upstream's own CI only; this project ships its own `publish.yml` and pins CUDA 13.2 via `.github/build_cuda_linux.sh` (see CLAUDE.md "Upgrading CUDA Version"). No impact |
+| ~b9543–b9549 | project `CMakeLists.txt` (pre-existing latent bug, fixed in this bump) | **Not an upstream change** &mdash; surfaced while build-testing this bump locally. The OS/arch detection block invoked `net.ladenthin.llama.OSInfo`, but the class had moved to `net.ladenthin.llama.loader.OSInfo` in the earlier layered-package restructure, so `cmake -B build` failed with "Could not determine OS name" on any host that does not pass `-DOS_NAME`/`-DOS_ARCH` explicitly (CI does, which is why it went unnoticed). Fixed both `execute_process` invocations (`--os` and `--arch`) to the `loader.OSInfo` FQN. Same stale-FQN-after-restructure class as the earlier `spotbugs-exclude.xml` / PIT-`targetClasses` repairs &mdash; the standing reminder to re-validate every FQN-bearing config after a package move now also covers `CMakeLists.txt` |
+| ~b9543–b9549 | upstream build / verification | Local build with `GIT_TAG b9549` verified clean on Linux x86_64: `cmake -B build -DBUILD_TESTING=ON` configures cleanly (after the `loader.OSInfo` FQN fix above), `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit (incl. the changed `server-context.cpp`), and `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |