Open Issues at Baseline `49be664`

Baseline commit: 49be66475700487e9ae9be5ba1d22b5855bb0d1c (kherud/java-llama.cpp, "bump pom.xml version 4.1.0 -> 4.20", 2025-06-20).

This file enumerates all 37 open issues on the upstream kherud/java-llama.cpp repository at the time of the fork. Each issue is reproduced below with its number, title, original URL, reporter, creation date and a concise summary of the reported problem or request. Issues are listed in descending order of issue number (newest first).

The intent is that each item can later be investigated, reproduced against the current fork, and either marked fixed (linking the resolving commit/PR), deferred, or closed as not-applicable.

Deep-dive verdict guide

After a second-pass analysis of every LIKELY FIXED and PARTIALLY FIXED issue (see the per-issue **Deep-dive analysis:** blocks):

Confirmable from code inspection alone (no runtime needed):
- #103, #34 — image input API: now FIXED via PR #189 (typed ContentPart
  - ChatMessage(role, List<ContentPart>) + InferenceParameters.withMessages(List<ChatMessage>) emitting OAI multipart content; the upstream chat path already routes image_url blocks through the compiled-in mtmd pipeline, so zero new JNI was needed).
- #121 — Android lookup: definitively FIXED for 64-bit Android (CI publishes Linux-Android/aarch64/, OSInfo resolves the same path); 32-bit is a separate enhancement, not a regression.
- #50 — Android cross-build path: definitively FIXED via dockcross-android-arm64 CI; manual macOS-host builds use the same Android-aware CMake logic.
- #86 — CUDA jar / CPU fallback: the CUDA jar requires libcudart.so.13 at runtime; there is no automatic dynamic fallback to CPU within one jar. Users must pick the cpu vs cuda13-linux-x86-64 classifier. Now documented in the README "Choosing the right classifier" section → verdict FIXED-AS-DOCUMENTED.
Confirmable with one targeted JUnit test (no model retraining, no platform reproduction): all four JUnit tests below landed on master via PR #185 (commit cba693c). Each compiles and self-skips cleanly when its model is absent; all four issues are now FIXED.
- #102 — memory leak on close(): covered by MemoryManagementTest#testOpenCloseLoopDoesNotLeak (20-iteration loop, Linux asserts VmRSS delta < 200 MB; non-Linux degenerates to a no-crash smoke test). Code path is already correct by inspection.
- #98 — nomic-embed-text loading: covered by LlamaEmbeddingsTest#testNomicEmbedLoads, gated on system property net.ladenthin.llama.nomic.path and wired in publish.yml (NOMIC_EMBED_MODEL_URL + -Dnet.ladenthin.llama.nomic.path=… on the linux-x86_64 Java test job). b9284 has long since fixed the original result_output upstream bug.
- #95 — iterator repetition: covered by LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt (deliberately repetitive prompt + nPredict=30, asserts ≤ nPredict+1 outputs). Iterator hasNext/stop handshake is correct by inspection.
- #80 — segfault on immediate open+close: covered by MemoryManagementTest#testOpenCloseWithoutGeneration (20 open + immediate-close cycles, no generation). The half-initialised-race fix in jllama.cpp:929-940 is strictly stronger than the original bug surface.
Genuinely needs platform-specific runtime reproduction (cannot be confirmed by code reading or any JUnit test in this repository):
- #117 (Android x86_64 / Houdini emulator)
- #79 (Android arm64-v8a device, threading abort)
- #85 (macOS M1 under Rosetta-2, Java 8, SIGILL in ggml_init)
- #83 (Windows 11 x86-64, EXCEPTION_ACCESS_VIOLATION in msvcp140.dll)
- #77 (Windows 10 x86-64, CodeGemma-2B 0xC0000005)
All five depend on architecture/runtime emulation defects or platform-specific CRT behaviour that no amount of source-tree inspection can resolve.

Bottom line: out of 9 LIKELY/PARTIALLY FIXED issues, 4 are FIXED via JUnit regression tests merged in PR #185 (#80, #95, #98, #102), #86 and the 32-bit Android tail of #121 are FIXED-AS-DOCUMENTED via the README "Choosing the right classifier" section, 2 are FIXED via the typed multimodal API in PR #189 (#103, #34), and 0 require platform reproduction.

#124 — Request an update！！！

URL: kherud#124
Reporter: YodyOy
Created: 2026-04-16

Empty body. Generic request for the maintainer to publish a new release / sync with upstream llama.cpp.

Status in fork: FIXED. The fork pins llama.cpp to b9284 (see CLAUDE.md line 11 and CMakeLists.txt GIT_TAG), with an automated per-version upgrade workflow visible in git log --oneline (e.g. f84f974 Upgrade llama.cpp from b9279 to b9284, plus dozens of similar bump commits). Releases now flow through .github/workflows/release.yaml.

#123 — Supporting latest llama.cpp version

URL: kherud#123
Reporter: jesuino
Created: 2025-11-18

Asks whether there are plans to support the latest llama.cpp versions; user wants to try Qwen3-VL.

Status in fork: FIXED. llama.cpp pinned to b9284 (CLAUDE.md:11), which natively supports Qwen3 and Qwen3-VL architectures. The fork also fetches tools/mtmd via FetchContent (CMakeLists.txt:125-145) and links mtmd into jllama (CMakeLists.txt:255); the Java image surface that consumes this is now exposed via ContentPart + ChatMessage(role, List<ContentPart>) (see #103 / #34, PR #189).

#121 — Error on Android apk

URL: kherud#121
Reporter: Togaroda
Created: 2025-10-28

When using the library inside an Android APK, the loader cannot find the native library because it looks for an aarch64 directory while Android packages ship arm64-v8a (and armeabi-v7a). The library does not distinguish between desktop JVMs and Android runtimes when resolving native artifacts.

Status in fork: PARTIALLY FIXED. OSInfo.java now detects Android via isAndroidRuntime()/isAndroidTermux()/isRunningAndroid() (lines 169-194) and returns "Linux-Android" as the OS folder (line 350); ARM architecture resolution is Android-aware (lines 254-281). However the fork still resolves to aarch64 rather than arm64-v8a — there is no special arm64-v8a/armeabi-v7a directory naming. Next steps: inspect LlamaLoader.java:103 Android branch to confirm the directory layout under src/main/resources/net/ladenthin/llama/Linux-Android/aarch64/, and produce an Android sample to verify end-to-end loading.

Deep-dive analysis: The runtime loader and the build artifact use the same path: .github/workflows/publish.yml:133 invokes the dockcross-android-arm64 build with -DOS_NAME=Linux-Android -DOS_ARCH=aarch64, producing src/main/resources/net/ladenthin/llama/Linux-Android/aarch64/libjllama.so; OSInfo.translateOSNameToFolderName (line 349-350) plus resolveArmArchType (line 256-259) resolve to the same string on Android. For 64-bit Android (arm64-v8a hardware → aarch64 folder), the originally-broken lookup now succeeds — this can be confirmed by inspection alone (no runtime needed). 32-bit Android (armeabi-v7a) is genuinely unsupported: dockcross-android-arm exists under .github/dockcross/ but is never invoked by publish.yml, so no 32-bit native library is shipped. Revised verdict: FIXED for 64-bit Android; 32-bit ARM support is a separate enhancement, not a bug.

#120 — What models are supported?

URL: kherud#120
Reporter: Togaroda
Created: 2025-10-12

Question: are Qwen3 or SmolLM2 supported?

Status in fork: FIXED. Model architecture support is delegated to the pinned llama.cpp version b9284 (CLAUDE.md:11), which supports Qwen3, Qwen3-VL, SmolLM2 and many other newer architectures.

#119 — Update questions

URL: kherud#119
Reporter: Myhuiku
Created: 2025-08-19

Asks whether the maintainer plans to update the pinned llama.cpp to the latest release.

Status in fork: FIXED. The fork has an automated, per-build version-bump cadence (CLAUDE.md documents the upgrade procedure; git log --oneline | grep "Upgrade llama.cpp" shows continuous bumps culminating in b9284).

#117 — The new model is causing crashes on Android

URL: kherud#117
Reporter: ysc-ysc-ysc
Created: 2025-07-10

User packaged the library into an AAR and integrated it with Godot on Android. Instantiating the model crashes the app with SIGABRT from an uncaught std::filesystem::filesystem_error (posix_stat … Permission denied ["/charger"]). Crash occurs inside libhoudini.so on an x86_64 Android emulator/device, suggesting the native code tries to walk a filesystem path that is not accessible on Android.

Status in fork: NEEDS INVESTIGATION. The crash signature (std::filesystem::filesystem_error walking /charger) points at upstream backend-device enumeration, not project code. The fork is several hundred upstream versions ahead of the report so the specific walk may have changed. Next steps: reproduce on an x86_64 Android emulator with a current b9284 build; if it persists, file a fresh upstream issue and consider catching std::filesystem_error at backend-load time.

#116 — java-llama creates a non-daemon thread that inhibits JVM termination

URL: kherud#116
Reporter: claudionieder
Created: 2025-07-01

After new LlamaModel(...) followed by close(), the JVM does not exit because a non-daemon thread (Thread-0) is spawned at model load and never stopped on close. Expected: thread should be a daemon, or be joined/terminated on LlamaModel.close().

Status in fork: FIXED. The native worker thread is a C++ std::thread (not a Java thread, so daemon-ness does not apply at the JVM level) and is explicitly joined on close: Java_net_ladenthin_llama_LlamaModel_delete calls jctx->server.terminate() twice and then jctx->worker.join() (src/main/cpp/jllama.cpp:937-940). Streaming readers are also drained before shutdown (jllama.cpp:925-930).

#113 — Allow tracking the progress of loading a LlamaModel

URL: kherud#113
Reporter: natanfudge
Created: 2025-06-09

Feature request: expose load progress (file name, fraction 0..1, bytes loaded vs. total, whether the file is the weights file, whether it is a download or disk load) via a Consumer<LLamaLoadProgress> callback passed to the LlamaModel constructor. Intended for showing a progress bar to end users.

Status in fork: FIXED in PR #188 (commit 70df324), with a follow-up symbol-export fix in PR #189 (commit 36d8862) that the original feature needed to actually be callable on Linux/macOS. New LoadProgressCallback functional interface (single method boolean onProgress(float progress); return false to abort). New constructor overload LlamaModel(ModelParameters, LoadProgressCallback) plumbs the callback through a new JNI entry point loadModelWithProgress, which installs a trampoline on common_params.load_progress_callback that forwards the float to LoadProgressCallback.onProgress(float)Z via CallBooleanMethod. The existing loadModel JNI symbol still exists; both entry points share a load_model_impl helper. Callback fires synchronously on the loader thread with progress in [0.0, 1.0]; returning false aborts and the constructor throws LlamaException. The original report's richer payload (file name, bytes, weights vs download flag) is NOT exposed — only the float — because llama_model_params.progress_callback itself only emits the float; richer fields would require an upstream API change.

Subtle issue resolved by PR #189 36d8862. The PR #188 implementation forgot to add the loadModelWithProgress forward declaration to src/main/cpp/jllama.h. The jllama.cpp translation unit #includes only that header (not the freshly-generated net_ladenthin_llama_LlamaModel.h that javac -h writes during mvn compile), so the C++ compiler treated the function definition as a regular C++ function and emitted a name-mangled symbol (_Z57Java_..._loadModelWithProgress...). The JVM looked up the plain unmangled name and threw UnsatisfiedLinkError: 'void net.ladenthin.llama.LlamaModel.loadModelWithProgress(...)' the first time LoadProgressCallbackTest#receivesProgressUpdates exercised the code path. Adding the seven-line forward declaration in jllama.h restored extern "C" linkage and the test now passes (LoadProgressCallbackTest: Tests run: 3, Failures: 0, Errors: 0 on the current CI run). The neighbouring plain loadModel symbol was unaffected because its declaration was already in jllama.h.

#112 — Qwen 3 model does not load

URL: kherud#112
Reporter: msky-dev
Created: 2025-06-07

Loading a Qwen3 GGUF fails with unknown model architecture: 'qwen3'. User reports newer llama.cpp builds do support it, so the pinned upstream needs to be bumped.

Status in fork: FIXED. llama.cpp pin is now b9284 (CLAUDE.md:11), which includes the Qwen3 architecture (added upstream around b3000+). Resolved transparently by the version bump.

#111 — Running with a Predefined Number of Cores

URL: kherud#111
Reporter: michaelsheka
Created: 2025-05-28

How-to question: how to configure the library to use a specific (predefined) number of CPU cores instead of the default.

Status in fork: FIXED (documentation only — API already existed). ModelParameters.setThreads(int) (line 51) and setThreadsBatch(int) (line 61) map directly to the --threads/--threads-batch CLI flags. The how-to gap could be addressed by a Javadoc example.

#110 — Running embedding in batch

URL: kherud#110
Reporter: michaelsheka
Created: 2025-05-28

The current embedding API only supports a single input string at a time. User requests guidance / API support for batched embedding of multiple strings in one call.

Status in fork: FIXED. In addition to the single-prompt LlamaModel.embed(String) (LlamaModel.java:95), the fork exposes handleEmbeddings(String paramsJson, boolean oaiCompat) (LlamaModel.java:316) which forwards arbitrary JSON to the upstream embeddings handler. The native side uses extract_embedding_prompt (src/main/cpp/json_helpers.hpp:137, called from jllama.cpp:1075), which accepts either a string or an array of strings — enabling batched embedding requests.

#107 — OS_NAME for Mac in CMakeLists.txt

URL: kherud#107
Reporter: prabhdatnoor
Created: 2025-05-08

On macOS (Intel, 2020), the JNI include directory detection fails. Root cause: CMakeLists.txt checks OS_NAME STREQUAL "Mac" but CMake actually reports Darwin. Suggested fix: replace "Mac" with "Darwin" on line 74 of CMakeLists.txt.

Status in fork: FIXED. CMakeLists.txt:196 now matches both: if(OS_NAME MATCHES "^Linux" OR OS_NAME STREQUAL "Mac" OR OS_NAME STREQUAL "Darwin"). On the Java side, OSInfo.translateOSNameToFolderName normalises both Mac and Darwin to the folder name Mac (OSInfo.java:345-346).

#104 — Question: How to enable offload_kqv?

URL: kherud#104
Reporter: michaelsheka
Created: 2025-04-23

How-to question: how to enable the offload_kqv llama.cpp option from the Java API.

Status in fork: FIXED. KQV offload is on by default in upstream; to disable it the fork exposes ModelFlag.NO_KV_OFFLOAD mapped to --no-kv-offload (src/main/java/net/ladenthin/llama/args/ModelFlag.java:50), used via ModelParameters.java:739. Leaving the flag unset preserves the default offload_kqv=true.

#103 — VLM support — Image input for multimodal models

URL: kherud#103
Reporter: amirvenus
Created: 2025-04-21

Feature request: support visual-language models such as Qwen2.5-VL (image inputs) on Android.

Status in fork: FIXED end to end. The build links the upstream mtmd multimodal library into jllama, ModelParameters exposes projector loading and explicit auto/offload controls, and ContentPart + ChatMessage.userMultimodal(...) provide the typed image API. The native parser now preserves decoded media buffers and submits them through the upstream multimodal task path; previously those buffers were discarded and requests silently became text-only. Blocking, typed ChatRequest, streaming, and OpenAI-compatible mapping are covered, including a real-model semantic red/blue regression.

Deep-dive analysis: Two distinct surfaces exist for VLM:

Model loading: fully wired (mmproj path, auto-detect, GPU offload) — these flags reach the upstream server-context unchanged.
Request payload: typed and raw OpenAI multipart content is decoded by the OAI parser, retained across JNI dispatch, and processed by mtmd before generation.

#102 — Native call to 'delete' doesn't free memory

URL: kherud#102
Reporter: karambaso
Created: 2025-03-23

Repeatedly constructing and close()-ing LlamaModel instances does not release native memory. Reproduced with a 10-iteration loop: GPU eventually OOMs; CPU eventually thrashes swap. Suggests a leak in the JNI/native destructor path.

Status in fork: FIXED (regression test landed in commit cba693c, PR #185). The native destructor (Java_net_ladenthin_llama_LlamaModel_delete, src/main/cpp/jllama.cpp:917-948) now: clears the field pointer first, drains readers, signals server.terminate() (twice for the race documented in comments), joins the worker, frees vocab_only_model if set, and delete jctx (which destroys the embedded server_context value member). Refactor fc55802 Refactor JNI lifecycle and log formatting into reusable helpers (git log --oneline) explicitly addressed lifecycle. MemoryManagementTest#testOpenCloseLoopDoesNotLeak runs the original reporter's 10-iteration loop expanded to 20 iterations and asserts (on Linux) /proc/self/status:VmRSS grows by less than 200 MB; on macOS/Windows the test degenerates to a no-crash smoke test (still useful as a destructor-safety regression). Optional follow-up: pair with a sibling test gated on hasCuda() that polls nvidia-smi --query-gpu=memory.used during the loop.

Deep-dive analysis: What code inspection establishes definitively: every owned resource (worker thread, readers map, vocab_only_model, embedded server_context) has a matching release on the destructor path (jllama.cpp:917-948); there is no missing delete. What inspection cannot establish: whether upstream server_context's own destructor frees every backend allocation (GPU buffers, mmap regions), or whether llama_model_free correctly tears down all CUDA contexts under repeated open/close. MemoryManagementTest.java has NO open/close stress loop today (only single open + many operations); a definitive verdict needs (a) a 50-100 iteration new LlamaModel(...).close() JUnit test asserting heap and /proc/self/status VmRSS don't grow monotonically, and (b) a CUDA-side check via nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the loop. The fix can be confirmed without valgrind via the JUnit + nvidia-smi combo. Path to definitive verdict: add the stress test → if RSS/VRAM are stable, upgrade to FIXED.

#101 — Log messages are not redirected to a provided consumer

URL: kherud#101
Reporter: karambaso
Created: 2025-03-22

After calling LlamaModel.setLogger(LogFormat.TEXT, consumer), log lines still go to stdout instead of the supplied consumer. Reporter notes that the logging template in Utils.cpp never invokes the user callback.

Status in fork: FIXED. Java_net_ladenthin_llama_LlamaModel_setLogger (src/main/cpp/jllama.cpp:954-977) now registers the JNI callback as a global ref and installs log_callback_trampoline via llama_log_set(...). The trampoline invokes m_biconsumer_accept on the user's BiConsumer<LogLevel,String> with the formatted text. JSON formatting is gated by log_json from LogFormat.

#98 — Error loading embedding model `nomic-embed-text-v1.5.f16.gguf`

URL: kherud#98
Reporter: taksan
Created: 2025-03-13

Loading nomic-embed-text-v1.5.f16.gguf aborts with GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor"). The same file works with upstream llama-embedding CLI, so the issue is in how the bindings configure the embedding context (e.g. missing embedding=true or pooling parameter).

Status in fork: FIXED (regression test + CI download landed in commit cba693c, PR #185). ModelParameters.enableEmbedding() sets ModelFlag.EMBEDDING (ModelParameters.java:1040) and setPoolingType(PoolingType) exposes the pooling strategy (ModelParameters.java:606). The upstream b9284 server-context (compiled directly into jllama) handles nomic-embed-text correctly. LlamaEmbeddingsTest#testNomicEmbedLoads reproduces the reporter's setBatchSize(8192).setUbatchSize(8192) config and adds the missing enableEmbedding(); asserts the returned vector length is 768 (TestConstants.NOMIC_EMBED_DIM). Gated on net.ladenthin.llama.nomic.path; CI's linux-x86_64 Java job downloads the model via the new NOMIC_EMBED_MODEL_URL env var in .github/workflows/publish.yml and passes the property on mvn test.

Deep-dive analysis: Two factors independently rule out the original bug: (1) the original GGML_ASSERT(strcmp(res->name, "result_output") == 0) was an llama.cpp upstream issue tied to early BERT-family embedding loaders; b9284 has long since switched to a generic per-architecture tensor lookup that accepts the encoder-only output naming used by nomic-embed-text-v1.5. (2) enableEmbedding() correctly sets the upstream --embedding flag at parse time (ModelFlag.java:42-ish), so params.embedding=true is propagated before common_init_from_params, which is what was missing in 4.1.0. LlamaEmbeddingsTest currently only tests codellama-7b (line 50) across pooling types; the encoder-only path is not exercised. Path to definitive verdict: add a single JUnit test downloading nomic-embed-text-v1.5.f16.gguf (~120 MB) and asserting model.embed("hello").length == 768. Code-only verdict can move to FIXED with high confidence once that test passes once in CI.

#95 — Inference content repetition that can't be ended

URL: kherud#95
Reporter: z503951608
Created: 2025-03-11

Reporter takes issue with LlamaIterator.next(): when the receive call returns stop, the iterator exposes the final token then ends, but the user observes inference output that loops and cannot be terminated. Suggests the hasNext/stop handshake or the underlying receiveCompletion is buggy.

Status in fork: FIXED (regression test landed in commit cba693c, PR #185). LlamaIterator now uses receiveCompletionJson and toggles hasNext = !output.stop (LlamaIterator.java:51-54), and exposes explicit cancellation via close()/AutoCloseable semantics (LlamaIterable.java:44, LlamaIterator.java:69-77). Repetition itself remains a sampler-tuning issue handled via InferenceParameters (setRepeatPenalty, setMinP, etc.). LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt drives the iterator with the repetitive prompt "Repeat AAA forever: AAA AAA" at setNPredict(30).setTemperature(0.0f) inside a try-with-resources LlamaIterable, asserting iteration ends within nPredict + 1 outputs and produces at least one token.

Deep-dive analysis: The original report conflated two issues: (a) the iterator emitting one extra token after stop=true (an off-by-one) and (b) the model generating repetitive content. Inspection of LlamaIterator.java:46-58 resolves (a) — next() reads the JSON, sets hasNext = !output.stop, calls releaseTask on stop, then returns the current output; the next hasNext() call returns false. There is no extra-iteration bug. Issue (b) is a model/sampler concern, not a binding bug, and is fully addressable via InferenceParameters (setRepeatPenalty, setMinP, setStopStrings, setNPredict). Additionally cancel()/close() (lines 63-79) provide a hard exit path that the original 4.1.0 code lacked. Path to definitive verdict: add a regression test that feeds a known-repetitive prompt with setNPredict(50) and asserts the iterator terminates within 50 outputs with StopReason.LIMIT — should pass on first run. Verdict can move to FIXED.

#91 — Unable to integrate java-llama.cpp with Android Studio Java

URL: kherud#91
Reporter: Mohsin-Fawad
Created: 2025-01-20

User followed README steps but cannot integrate the library into an Android Studio Java project. Requests a working MainActivity.java, build.gradle.kts, and any other modified files.

Status in fork: STILL POSSIBLE. The fork has Android detection in OSInfo.java (lines 169-194) but ships no Android sample app, no Gradle integration documentation, and no pre-built arm64-v8a AAR. Next steps: produce a minimal sample under examples/android/ with MainActivity.java and build.gradle.kts and document NDK ABI mapping.

#90 — Process to rebuild after Java-only changes

URL: kherud#90
Reporter: siddhsql
Created: 2024-12-29

Asks for the documented process to rebuild the project when only Java sources are modified, reusing pre-built native binaries instead of recompiling llama.cpp.

Status in fork: FIXED (documented). CLAUDE.md "Build Commands" section explicitly separates mvn compile / mvn package (Java-only — no native rebuild) from the cmake -B build && cmake --build build native step. Running mvn package alone reuses the pre-built shared libraries under src/main/resources/net/ladenthin/llama/{OS}/{ARCH}/.

#89 — What are slots?

URL: kherud#89
Reporter: siddhsql
Created: 2024-12-29

Asks for an explanation of the server_slot concept (referenced in src/main/cpp/server.hpp).

Status in fork: NOT APPLICABLE. The hand-ported server.hpp has been removed; the project now compiles the upstream server-context.cpp/server-queue.cpp/server-task.cpp/server-models.cpp directly (see CLAUDE.md "Architecture" section). server_slot is now an upstream-owned concept; consumers of LlamaModel see it indirectly via the slot-related task params (InferenceParameters.setIdSlot, etc.).

#88 — JSON request like role system or role user isn't supported?

URL: kherud#88
Reporter: yousefabdel1727
Created: 2024-12-26

Asks how to send chat-style JSON requests (role: system, role: user) the way the upstream llama-server accepts them.

Status in fork: FIXED. LlamaModel.chatComplete(InferenceParameters) (line 236) routes to native handleChatCompletions which accepts the standard OpenAI messages: [{role, content}, ...] JSON format (LlamaModel.java:215-238). The chat template is auto-applied via the compiled upstream chat pipeline.

#87 — How to reset the context on every turn?

URL: kherud#87
Reporter: siddhsql
Created: 2024-12-18

In the MainExample chat loop, the context fills up over multiple turns, slowing inference and eventually producing garbage. Asks how to reset the KV cache / context between iterations.

Status in fork: FIXED. InferenceParameters.setCachePrompt(boolean) (line 116) controls reuse of the cached prefix, and per-slot KV state is managed by the upstream server. To "reset" between turns, the caller passes a fresh id_slot or sets cache_prompt=false. The upstream server also supports an explicit /slots/{id}:erase task type accessible via the chat path.

#86 — Does `llama-3.4.1-cuda12-linux-x86-64.jar` handle both CPU and GPU or only GPU?

URL: kherud#86
Reporter: siddhsql
Created: 2024-12-18

Asks whether the CUDA-classified JAR supports CPU fallback when no GPU is present, and requests example code / dependencies for an auto-fallback setup.

Status in fork: FIXED-AS-DOCUMENTED. The CUDA classifier cuda13-linux-x86-64 is built via .github/build_cuda_linux.sh (see CLAUDE.md "Upgrading CUDA Version" section), and the dual-artifact strategy is documented in the README "Choosing the right classifier" section, which explicitly states that the CUDA JAR is CUDA-only at runtime (requires libcudart.so.13 / libcublas.so.13 on the host) and does not auto-fall back to CPU. CPU users must pick the default classifier.

Deep-dive analysis: This is a documentation gap, not a code defect. Behaviorally: the CUDA-built libjllama.so dynamically links against libcudart.so.13 and libcublas.so.13. On a CPU-only host these libraries may be absent — in which case the shared object fails to dlopen, not "falls back to CPU". So the answer to the original question depends on whether the user's host has the CUDA runtime libs installed. Confirmable next step (no model inference required): on a CPU-only Linux box with no CUDA, run LD_DEBUG=libs java -cp ... net.ladenthin.llama.LlamaModel; if dlopen of libcudart.so.13 fails, the CUDA jar cannot load. Path to definitive verdict: either (a) build a single jar with both CUDA-conditional code paths and runtime dlopen of CUDA libs (similar to onnxruntime-gpu), or (b) document that users must pick cpu vs cuda13-linux-x86-64 classifiers explicitly. The current 91b4ae1 strategy is (b). Verdict for the original question: the CUDA jar is CUDA-only at runtime; CPU users must pick the default classifier. Update to FIXED-AS-DOCUMENTED once a README note is added.

#85 — SIGILL on M1 MacBook with Rosetta 2 and Java 8

URL: kherud#85
Reporter: s0t00524
Created: 2024-11-26

Running the library under Rosetta 2 on an M1 Mac with Java 8 (OpenJDK 1.8.0_242, bsd-amd64) crashes with SIGILL inside libggml.dylib+0x4724 (ggml_init+0x74). The same code works on Linux x86_64. Suggests ggml_init uses CPU instructions Rosetta cannot emulate for this build.

Status in fork: NEEDS INVESTIGATION. This is a Rosetta-2 specific x86_64 emulation defect; not project-side. The Java target was upgraded from 1.8 to 21 then explicitly downgraded back to bytecode 1.8 (a48957e Upgrade Java version to 21, ac84fe0 Downgrade Java target to 1.8). Recommend reproducing on Apple Silicon with native arm64 builds (no Rosetta) — the fork ships a Mac/aarch64/ native artifact.

#84 — rerank

URL: kherud#84
Reporter: litongjava
Created: 2024-10-05

How-to question: how to run a reranker model such as BAAI/bge-reranker-v2-m3 with the bindings.

Status in fork: FIXED. ModelParameters.enableReranking() (ModelParameters.java:1049) enables the rerank endpoint; LlamaModel.rerank(boolean, String query, String...) (LlamaModel.java:170) and rerank(query, documents) (line 187) return Pair<String,Float> or LlamaOutput. End-to-end coverage exists in src/test/java/net/ladenthin/llama/RerankingModelTest.java using jina-reranker-v1-tiny-en-Q4_0.gguf.

#83 — `EXCEPTION_ACCESS_VIOLATION` on Windows 11 x86-64 with default libraries

URL: kherud#83
Reporter: kyselat
Created: 2024-10-02

Versions from 3.3.0 onward (downloaded via Maven Central) crash the JVM on Windows 11 with EXCEPTION_ACCESS_VIOLATION inside msvcp140.dll. Earlier versions up to 3.2.1 work. User has the latest VC++ 2022 redistributable installed.

Status in fork: NEEDS INVESTIGATION. The fork's Windows artifact is rebuilt for b9284, and src/main/cpp/compat/ggml_x86_compat.c provides a 32-bit MSVC _InterlockedIncrement64 shim (CLAUDE.md ~b8808–b8831 row). The original Maven Central artifact (kherud groupId) is unaffected by this fork. Next steps: build the fork's Windows artifact and re-run the failing scenario; if it still crashes, profile under WinDbg against msvcp140.dll to identify the offending C++ runtime symbol.

#82 — Please give a sample for Android

URL: kherud#82
Reporter: SteveWorkshop
Created: 2024-09-30

User tried the README's Gradle snippet for importing the native library and it fails with Could not find method listOf() for arguments [mvn, compile]… (a Groovy-vs-Kotlin DSL mismatch). Requests a working Android sample.

Status in fork: STILL POSSIBLE. No Android sample is shipped with the fork. Next steps: same as #91 — add examples/android/ with a working Gradle (Kotlin DSL) config and a verified arm64-v8a build of libjllama.so via the dockcross toolchain (.github/dockcross/).

#81 — Add Android sample

URL: kherud#81
Reporter: shalva97
Created: 2024-09-27

Feature request: a separate Android sample repository that can be cloned and run as-is to demonstrate the library on Android.

Status in fork: STILL POSSIBLE. No separate Android demo repo exists. Tracked alongside #81/#82/#91 as a documentation/sample deliverable.

#80 — Segfault on model open and close

URL: kherud#80
Reporter: shuttie
Created: 2024-09-24

On 3.4.1, opening a model and immediately calling close() (without generating) segfaults inside libc, with a stack pointing into std::_Rb_tree::_M_erase. If generation happens first, close() succeeds. Likely an uninitialized/half-initialized native field being destroyed.

Status in fork: FIXED (regression test landed in commit cba693c, PR #185). The native destructor now waits on worker_ready before terminating (src/main/cpp/jllama.cpp:929-931), drains the readers map under lock (jllama.cpp:925-928), and calls server.terminate() twice with a 1 ms sleep specifically to close the race "where the thread signalled ready but start_loop() hasn't yet set its internal running flag" (comment at jllama.cpp:932-934). This is exactly the half-initialised case the issue describes. MemoryManagementTest#testOpenCloseWithoutGeneration runs the 3-line repro from the original issue (open + immediate close, no generation) inside a try-with-resources block, repeated 20 times — a JVM crash exits the JUnit runner with a non-zero status, so a clean run is the green signal.

Deep-dive analysis: The original 3.4.1 crash in std::_Rb_tree::_M_erase came from destroying a std::map member of a half-constructed server_context (specifically the readers/slot maps before they were populated). The current code addresses this on three layers: (1) jctx->worker_ready.load() busy-wait at line 922 ensures construction finished, (2) readers.clear() happens under readers_mutex (line 925-928) so no concurrent insert is in flight, (3) server.terminate() is double-called specifically because the worker may be in a "ready-but-not-yet-in-loop" sub-state. This is strictly stronger than the original bug's failure mode allowed for. Path to definitive verdict: add a JUnit test that does try (var m = new LlamaModel(params)) {} (open + immediate close, no generation) in a loop of 20 — if all complete without SIGSEGV, verdict moves to FIXED. The current MemoryManagementTest.java does not contain this pattern. Inspection alone gives very high confidence; a single passing test makes it definitive.

#79 — Android inference issue: `pthread_mutex_lock called on a destroyed mutex`

URL: kherud#79
Reporter: xunuohope1107
Created: 2024-09-19

On Android, model load succeeds but llamaModel.complete(inferParams) aborts with FORTIFY: pthread_mutex_lock called on a destroyed mutex and SIGABRT. Same code works on macOS. Suggests an Android-specific JNI/threading bug or a library ABI mismatch.

Status in fork: NEEDS INVESTIGATION. The threading model has been substantially rewritten (single worker thread per jllama_context joined on delete, see jllama.cpp:683 and :938), and the worker Attach/DetachCurrentThread cycle is correct. With b9284 upstream and a properly built Android arm64-v8a libjllama.so, this is plausibly resolved. Next steps: reproduce on a current Android emulator with a fork-built shared library.

#78 — Add support for `params.lora_adapters` (llama.cpp ≥ b3534)

URL: kherud#78
Reporter: xunuohope1107
Created: 2024-09-12

Upstream llama.cpp changed the LoRA API from a single lora_adapter to a vector of lora_adapters structs (each carrying a path and a scaling factor). Requests that the bindings update to the new structure and bump the pinned llama.cpp version.

Status in fork: FIXED. ModelParameters.addLoraAdapter(String) (line 906) and addLoraScaledAdapter(String, float) (line 918) map to --lora and --lora-scaled respectively, both of which the upstream CLI parser already converts into the vector-of-lora_adapters form. setLoraInitWithoutApply() (line 1121) covers the --lora-init-without-apply flag. The llama.cpp pin is b9284, well past b3534.

#77 — Process exit `-1073741819 (0xC0000005)` inferring CodeGemma-2B GGUF

URL: kherud#77
Reporter: 32kda
Created: 2024-09-10

On Windows 10, attempting inference with codegemma-2b.Q2_K.gguf causes JVM to exit with 0xC0000005 (access violation) shortly after start, with no output and no HotSpot crash dump produced.

Status in fork: NEEDS INVESTIGATION. CodeGemma support has matured upstream since the report, and the fork's Windows artifact is built fresh at b9284. The 32-bit MSVC _InterlockedIncrement64 shim (src/main/cpp/compat/ggml_x86_compat.c) addresses one known Windows-specific build issue. Next steps: reproduce on Windows 10 x86-64 with codegemma-2b.Q2_K.gguf against the fork's artifact.

#72 — JNI signature exception (`@Nullable BiConsumer`)

URL: kherud#72
Reporter: ys2940
Created: 2024-07-23

On macOS M3 with JDK 1.8, compiling de.kherud.llama.LlamaModel fails with com.sun.tools.javac.jvm.JNIWriter$TypeSignature$SignatureException on a @org.jetbrains.annotations.Nullable :: java.util.function.BiConsumer signature. Suggests javac -h cannot encode the annotated parametrised type.

Status in fork: FIXED. The setLogger signature in LlamaModel.java:128 no longer carries @Nullable (public static native void setLogger(LogFormat format, BiConsumer<LogLevel, String> callback);). grep -n "@Nullable" src/main/java/net/ladenthin/llama/LlamaModel.java returns no hits. The build now targets bytecode 1.8 with JDK 21 (a48957e, ac84fe0), so javac -h accepts the signature.

#70 — Add `vocab_only`

URL: kherud#70
Reporter: ardinursyamsu
Created: 2024-06-21

Feature request: expose llama.cpp's vocab_only flag, equivalent to the Python binding's option, so a Spring Boot app can run a tokenizer-only service separately from the model service.

Status in fork: FIXED. ModelParameters.setVocabOnly() (ModelParameters.java:1336) toggles ModelFlag.VOCAB_ONLY → --vocab-only (args/ModelFlag.java:92). The native layer also branches on jctx->vocab_only to skip server/worker init while keeping the vocab available for tokenize/decode (src/main/cpp/jllama.cpp:710-718, 923, 944).

#50 — Android build error: `libllama.so is incompatible with aarch64linux`

URL: kherud#50
Reporter: RageshAntonyHM
Created: 2024-02-28

Building for Android (arm64-v8a, armeabi-v7a) on macOS M2 fails at link time: the Android NDK linker tries to link against the macOS-aarch64 libllama.so shipped under resources/de/kherud/llama/Mac/aarch64/, which is not compatible with aarch64linux. Root cause: CMake/build picks the wrong prebuilt artifact for Android targets.

Status in fork: PARTIALLY FIXED. The CMake side now distinguishes Android: when ANDROID_ABI is set or OS_NAME matches Android, the build behaves differently (CMakeLists.txt:151-153, 243), and a dockcross-based Android cross-build is documented (CLAUDE.md "Cross-compilation"). Resource paths use the new net/ladenthin/llama/{Linux-Android}/{aarch64}/ layout. Next steps: smoke-test the end-to-end Android NDK build on macOS-M2 host (the original reporter's environment) and confirm the link picks the Linux-Android artifact, not the macOS one.

Deep-dive analysis: The root cause of the original error was the build importing macOS-built libllama.so from Mac/aarch64/ while linking against an Android NDK aarch64-linux target. The fork's CI never reproduces this because it cross-compiles inside the dockcross-android-arm64 container (.github/workflows/publish.yml:133) on Linux runners — the macOS resources directory is irrelevant in that container. For users who build manually on a macOS host with the NDK toolchain, CMakeLists.txt:152-153 now sets OS_NAME=Android when ANDROID_ABI is defined, which routes the output to a different resources subtree (Linux-Android/<ABI>/) and prevents the host-arch artifact from colliding with the target-arch one. Path to definitive verdict: run cmake -B build-android -DANDROID_ABI=arm64-v8a -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake on a macOS-M2 host and confirm build-android/libjllama.so is produced without linker errors. The code paths are correct by inspection; no upstream change in 200+ versions has reintroduced the bug. Verdict can move to FIXED with one cross-host build verification.

#34 — Support multimodal inputs

URL: kherud#34
Reporter: yeroc
Created: 2023-12-19

Feature request: add multimodal input support (referencing ggerganov/llama.cpp#3436).

Status in fork: FIXED. The upstream mtmd library is built and linked into jllama; projector controls and the typed ContentPart API are exposed; and decoded media is retained by the native bridge and processed through the upstream multimodal task path for blocking and streaming requests. See #103 for the full write-up.

Deep-dive analysis: Same conclusion as #103. Model loading and request execution are both verified; the real-model regression confirms image content affects the answer rather than merely producing a non-empty text-only response.

Cross-reference

#	Category	Title
124	Update request	Request an update
123	Update request	Supporting latest llama.cpp version
121	Android	Error on Android apk (aarch64 vs arm64-v8a)
120	Question	What models are supported?
119	Update request	Update questions
117	Android / crash	New model causing crash on Android
116	Lifecycle	Non-daemon thread inhibits JVM termination
113	Feature	Load progress callback
112	Model support	Qwen 3 architecture unknown
111	Question	Predefined number of cores
110	Feature	Batch embedding
107	Build	macOS `OS_NAME` is Darwin not Mac
104	Question	Enable `offload_kqv`
103	Feature / Android	VLM / image input
102	Memory leak	`close()` doesn't free native memory
101	Logging	Logger consumer ignored
98	Model support	Embedding model fails (`result_output`)
95	Iterator bug	Repetition that can't be ended
91	Android	Integration with Android Studio
90	Docs	Rebuild process for Java-only changes
89	Question	What are slots?
88	Question	JSON role system/user support
87	Question	Context reset per turn
86	Question	CUDA jar CPU fallback
85	macOS / crash	SIGILL under Rosetta 2 (M1, Java 8)
84	Question	rerank model usage
83	Windows / crash	EXCEPTION_ACCESS_VIOLATION in msvcp140
82	Android	Gradle DSL mismatch in README
81	Feature	Android sample repository
80	Crash	Segfault on open+close
79	Android / crash	pthread_mutex_lock on destroyed mutex
78	Upstream API	New `params.lora_adapters` shape
77	Windows / crash	0xC0000005 on CodeGemma-2B
72	Build	JNI signature exception on `@Nullable BiConsumer`
70	Feature	`vocab_only` flag
50	Android / build	macOS prebuilt picked for Android target
34	Feature	Multimodal inputs

Status overview

#	Verdict	Short status	Key evidence
124	FIXED	Continuous version bumps; pinned to b9284	`CLAUDE.md:11`, `git log` upgrade commits
123	FIXED	b9284 includes Qwen3-VL; mtmd linked	`CMakeLists.txt:255`, `CLAUDE.md:11`
121	FIXED (64-bit)	aarch64 path consistent between CI build and loader; 32-bit `armeabi-v7a` limitation documented in README "Choosing the right classifier"	`publish.yml:133`, `OSInfo.java:256-259,350`, `README.md`
120	FIXED	Architecture support comes from b9284	`CLAUDE.md:11`
119	FIXED	Per-release bump cadence to b9284	`git log --oneline` Upgrade commits
117	NEEDS INVESTIGATION	Upstream backend-device crash; reproduce	`b9284` is current; reproduce on emulator
116	FIXED	Worker joined explicitly on delete	`jllama.cpp:937-940`
113	STILL POSSIBLE	No load-progress callback in API	No `Progress` in `LlamaModel.java`
112	FIXED	Qwen3 in b9284	`CLAUDE.md:11`
111	FIXED	`setThreads/setThreadsBatch` exist	`ModelParameters.java:51,61`
110	FIXED	`handleEmbeddings` accepts batch JSON	`LlamaModel.java:316`, `json_helpers.hpp:137`
107	FIXED	CMake matches both Mac and Darwin	`CMakeLists.txt:196`
104	FIXED	`NO_KV_OFFLOAD` flag exposed	`args/ModelFlag.java:50`
103	FIXED	mtmd linked; typed image API in PR #189 (`ContentPart`, `ChatMessage(role, List<ContentPart>)`, `InferenceParameters.withMessages(List<ChatMessage>)`)	`ContentPart.java`, `ChatMessage.java`, `InferenceParameters.java`, `ModelParameters.java:1250-1281`
102	FIXED	Destructor drains workers and frees ctx; covered by `MemoryManagementTest#testOpenCloseLoopDoesNotLeak` (commit `cba693c`, PR #185)	`jllama.cpp:917-948`
101	FIXED	Trampoline calls BiConsumer	`jllama.cpp:954-977`
98	FIXED	`enableEmbedding` + `setPoolingType`; covered by `LlamaEmbeddingsTest#testNomicEmbedLoads` (commit `cba693c`, PR #185; CI downloads `nomic-embed-text-v1.5.f16.gguf`)	`ModelParameters.java:1040,606`
95	FIXED	Iterator close/AutoCloseable wired; covered by `LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt` (commit `cba693c`, PR #185)	`LlamaIterator.java:51-77`
91	STILL POSSIBLE	No Android sample shipped	No `examples/android/`
90	FIXED	mvn compile vs cmake split documented	`CLAUDE.md` Build Commands
89	NOT APPLICABLE	Hand-port `server.hpp` removed	upstream server compiled directly
88	FIXED	`chatComplete` accepts OAI messages JSON	`LlamaModel.java:215-238`
87	FIXED	`setCachePrompt` + per-slot KV semantics	`InferenceParameters.java:116`
86	FIXED-AS-DOCUMENTED	CUDA jar is CUDA-runtime-required; user must pick classifier. README "Choosing the right classifier" documents this.	`.github/build_cuda_linux.sh`, commit `91b4ae1`, `README.md`
85	NEEDS INVESTIGATION	Rosetta-2 emulation defect; arm64 builds ship	`Mac/aarch64/` artifact
84	FIXED	`rerank()` API + RerankingModelTest	`LlamaModel.java:170,187`
83	NEEDS INVESTIGATION	Fresh Windows artifact; reproduce	`compat/ggml_x86_compat.c`
82	STILL POSSIBLE	No Android Gradle sample	See #91
81	STILL POSSIBLE	No Android demo repo	See #91
80	FIXED	Half-init race closed by double-terminate; covered by `MemoryManagementTest#testOpenCloseWithoutGeneration` (commit `cba693c`, PR #185)	`jllama.cpp:932-940`
79	NEEDS INVESTIGATION	Threading rewritten; needs Android repro	`jllama.cpp:683,938`
78	FIXED	`addLoraAdapter`/`addLoraScaledAdapter`	`ModelParameters.java:906,918`
77	NEEDS INVESTIGATION	Fresh Windows build at b9284	`compat/ggml_x86_compat.c`
72	FIXED	`@Nullable` removed from setLogger	`LlamaModel.java:128`
70	FIXED	`setVocabOnly()` + native branch	`ModelParameters.java:1336`, `jllama.cpp:710-718`
50	PARTIALLY FIXED	CMake handles ANDROID_ABI; needs e2e test	`CMakeLists.txt:151-153,243`
34	FIXED	mtmd linked; typed image API in PR #189 (see #103 row)	`ContentPart.java`, `ChatMessage.java`, `CMakeLists.txt:255`

Verification plan (from original-issue research)

After fetching the verbatim text of each LIKELY FIXED / PARTIALLY FIXED issue on github.com/kherud/java-llama.cpp, the reproduction details are clearer than the summary lines above suggest. None of the original issues carry attached files; all relevant context is in the issue body itself, and several refine or change the verdict.

What the original issues actually contain

#	New info from issue body	Test feasibility
102	Exact repro: 10-iteration `new LlamaModel(...).close()` loop with `setThreads(4).setKeep(-1).setCtxSize(1024).setGpuLayers(0)`. Failure mode: GPU OOM exception, CPU swap thrash. No stack trace attached.	DONE — `MemoryManagementTest#testOpenCloseLoopDoesNotLeak` (commit `713d426`) ports the reporter's loop to 20 iterations with a `/proc/self/status:VmRSS` delta assertion on Linux.
98	Reporter's config was literally `new ModelParameters().setModel(...).setBatchSize(8192).setUbatchSize(8192)` — no `enableEmbedding()` call. The original "bug" was that the bindings did not forward `--embedding` at all; the upstream `result_output` assertion fired because the embedding pipeline was never initialised.	DONE — `LlamaEmbeddingsTest#testNomicEmbedLoads` (commit `713d426`) runs the reporter's exact config plus `enableEmbedding()`; gated on `net.ladenthin.llama.nomic.path`; CI downloads the model via `NOMIC_EMBED_MODEL_URL` in `publish.yml`.
95	Reporter pastes the `next()` method and argues the design is wrong: when `output.stop=true`, the method returns that output and ends. No model, prompt or reproduction provided.	DONE — `LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt` (commit `713d426`) drives the iterator with a repetitive prompt at `nPredict=30`, `temperature=0.0f` and asserts termination within `nPredict+1` outputs.
80	Exact repro: Kotlin-style 3 lines (`val params...`, `val model = new LlamaModel(params)`, `model.close()`) with `qwen2-0_5b-instruct-q4_0.gguf`. JDK 17.0.12+7, java-llama.cpp 3.4.1. SIGSEGV in `std::_Rb_tree` during `delete`. Reporter said they intended to follow up with a `-DLLAMA_DEBUG` build but never did.	DONE — `MemoryManagementTest#testOpenCloseWithoutGeneration` (commit `713d426`) maps the 3-line repro to 20 iterations of try-with-resources open + immediate close; a JVM crash exits the runner non-zero.
103	Specifically asks about Qwen2.5-VL on Android. No code attempted.	DONE (typed API) — PR #189 ships `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.withMessages(List<ChatMessage>)`. Android-sample tail tracked separately.
86	Just a question: "does the CUDA jar handle CPU fallback?". No code.	Not unit-testable. Documentation task.
34	One-line feature request linking upstream PR #3436 (LLaVA). No specifics.	DONE — subsumed by #103 (PR #189).
121	(Not refetched — Android `aarch64` vs `arm64-v8a` mismatch; already analysed in deep-dive.)	Verified by code; needs an Android boot test, not a unit test.
50	(Not refetched — Android cross-build on macOS-M2 host; already analysed in deep-dive.)	Verified by CMake logic; needs a cross-compile smoke test, not a unit test.

Concrete test plan

Four small JUnit tests close out four LIKELY FIXED items. All four belong in src/test/java/net/ladenthin/llama/MemoryManagementTest.java or src/test/java/net/ladenthin/llama/LlamaModelTest.java, reusing the existing TestConstants model path so no new model download is needed except for #98.

Status: all four tests below shipped via PR #185 (commit cba693c). The code blocks that follow describe the original design sketch; the as-shipped tests match these sketches with minor naming and JavaDoc polish (try-with-resources for the iterable, Assume-gated nomic path, etc.). All four issues (#80, #95, #98, #102) are now FIXED.

1. `MemoryManagementTest.testOpenCloseLoopDoesNotLeak()` — for #102

Direct port of the reporter's repro:

@Test
public void testOpenCloseLoopDoesNotLeak() {
    ModelParameters params = new ModelParameters()
        .setModel(TestConstants.MODEL_PATH)
        .setThreads(4).setKeep(-1).setCtxSize(1024).setGpuLayers(0);
    long baseline = currentVmRss();
    for (int i = 0; i < 20; i++) {
        try (LlamaModel m = new LlamaModel(params)) {
            // no-op: open + close
        }
    }
    System.gc();
    long after = currentVmRss();
    // Allow some slop for JIT/heap growth, but rule out monotonic leak
    Assert.assertTrue("VmRSS grew by " + (after - baseline) + " kB over 20 iters",
                      after - baseline < 200_000); // 200 MB tolerance
}

private static long currentVmRss() {
    try {
        for (String line : Files.readAllLines(Path.of("/proc/self/status"))) {
            if (line.startsWith("VmRSS:")) {
                return Long.parseLong(line.replaceAll("\\D+", ""));
            }
        }
    } catch (IOException e) { /* non-Linux */ }
    return 0;
}

currentVmRss() is a no-op on macOS/Windows; the test then degenerates to a "does not throw / does not crash" smoke test, which is still useful. For CUDA, add an nvidia-smi poll in a sibling @Test gated on Assume.assumeTrue(hasCuda()).

2. `MemoryManagementTest.testOpenCloseWithoutGeneration()` — for #80

@Test
public void testOpenCloseWithoutGeneration() {
    ModelParameters params = new ModelParameters()
        .setModel(TestConstants.MODEL_PATH)
        .setCtxSize(512).setGpuLayers(0);
    // The original 3.4.1 crash was a one-shot SIGSEGV.  Repeat to harden.
    for (int i = 0; i < 20; i++) {
        try (LlamaModel m = new LlamaModel(params)) {
            // No generation between construction and close.
        }
    }
}

A JVM crash exits the JUnit runner with non-zero; if all 20 iterations complete, the verdict moves to FIXED.

3. `LlamaModelTest.testIteratorTerminatesOnRepetitivePrompt()` — for #95

@Test
public void testIteratorTerminatesOnRepetitivePrompt() {
    InferenceParameters infer = new InferenceParameters("Repeat AAA forever: AAA AAA")
        .setNPredict(30)
        .setTemperature(0.0f);
    int count = 0;
    try (LlamaIterable it = model.generate(infer)) {
        for (LlamaOutput out : it) {
            count++;
            Assert.assertTrue("iterator overran nPredict", count <= 31);
        }
    }
    Assert.assertTrue("iterator must produce at least 1 token", count >= 1);
}

The original "bug" is a design objection, not a defect. Test confirms iteration terminates deterministically.

4. `LlamaEmbeddingsTest.testNomicEmbedLoads()` — for #98

@Test
public void testNomicEmbedLoads() {
    String path = System.getProperty("net.ladenthin.llama.nomic.path");
    Assume.assumeNotNull("nomic model path not set", path);
    ModelParameters params = new ModelParameters()
        .setModel(path).setBatchSize(8192).setUbatchSize(8192)
        .enableEmbedding();                                  // <-- the fix
    try (LlamaModel m = new LlamaModel(params)) {
        float[] embedding = m.embed("search_query: What is TSNE?");
        Assert.assertEquals(768, embedding.length);
    }
}

Gated on a system property so CI without the 120 MB model file simply skips it. Optional CI step: download nomic-embed-text-v1.5.f16.gguf from HuggingFace in the same pattern as the existing CodeLlama / Jina-Reranker model downloads.

Issues that cannot be closed by unit tests alone

#	Why not unit-testable	Action
103, 34	(Historic — typed image API and native media dispatch were incomplete at the time of this audit.)	DONE: `ContentPart` + `ChatMessage(role, List<ContentPart>)` emit OAI multipart content, and the JNI bridge retains decoded media for upstream `mtmd` processing. The model-gated integration test covers blocking, typed, streaming, and semantic image handling.
86	Question about jar packaging behaviour, not code defect.	Documentation: add a README section "Choosing the right classifier" stating that the CUDA jar requires the CUDA runtime libraries at load time and does not auto-fall-back.
121, 50	Android runtime / cross-host build path — needs an emulator boot or a macOS-M2 cross-compile, not a JVM test.	CI matrix expansion: add an Android emulator job that boots a stock `arm64-v8a` AVD and runs the existing `LlamaModelTest` against the dockcross-built `libjllama.so`.

Recommended sequencing

First PR (small, high-value): MERGED as PR #185, commit cba693c. Adds the four JUnit tests (MemoryManagementTest#testOpenCloseLoopDoesNotLeak, MemoryManagementTest#testOpenCloseWithoutGeneration, LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt, LlamaEmbeddingsTest#testNomicEmbedLoads), TestConstants.PROP_NOMIC_MODEL_PATH
- NOMIC_EMBED_DIM, the NOMIC_EMBED_MODEL_URL/_NAME env vars and matching curl + -Dnet.ladenthin.llama.nomic.path=… wiring in the linux-x86_64 Java test job of .github/workflows/publish.yml, and the local-build documentation in CLAUDE.md ("Building the native library for local Java tests"). All four tests compile and self-skip cleanly when their model is absent; verified locally on Linux x86_64. Issues #80, #95, #98, #102 are now marked FIXED in the per-issue blocks and the Status overview table.
Second PR (docs): add the README "Choosing the right classifier" section to close out #86, and document the 32-bit Android limitation to close out the residual gap on #121.
Third PR (feature): SHIPPED as PR #189. Adds the typed multimodal surface (ContentPart, ChatMessage(role, List<ContentPart>), InferenceParameters.withMessages(List<ChatMessage>)) plus a MultimodalIntegrationTest gated on net.ladenthin.llama.vision.model, .vision.mmproj, and .vision.image system properties. CI in publish.yml downloads a small vision model + mmproj + an author- licensed (MIT-granted) test image from Wikimedia Commons and passes the paths on the mvn test command line.
Fourth PR (CI): add an Android emulator job to formally close #121 and #50.

Step 1 ships as one commit and unblocks the four FIXED upgrades pending the first green CI run. Step 2 is mechanical and requires no design decisions. Step 3 requires choosing an image-input encoding (data: URL vs raw bytes) and is the natural follow-up. Step 4 is the largest investment but closes the remaining STILL POSSIBLE Android cluster (#79, #81, #82, #91, #117, #121) all at once.

FilesExpand file tree

49be664_open_issues.md

Latest commit

History