Baseline commit: 49be66475700487e9ae9be5ba1d22b5855bb0d1c (kherud/java-llama.cpp,
"bump pom.xml version 4.1.0 -> 4.20", 2025-06-20).
This file enumerates all 37 open issues on the upstream
kherud/java-llama.cpp repository at
the time of the fork. Each issue is reproduced below with its number, title,
original URL, reporter, creation date and a concise summary of the reported
problem or request. Issues are listed in descending order of issue number
(newest first).
The intent is that each item can later be investigated, reproduced against the current fork, and either marked fixed (linking the resolving commit/PR), deferred, or closed as not-applicable.
After a second-pass analysis of every LIKELY FIXED and PARTIALLY FIXED issue
(see the per-issue **Deep-dive analysis:** blocks):
-
Confirmable from code inspection alone (no runtime needed):
- #103, #34 — image input API: now FIXED via PR #189 (typed
ContentPartChatMessage(role, List<ContentPart>)+InferenceParameters.withMessages(List<ChatMessage>)emitting OAI multipart content; the upstream chat path already routesimage_urlblocks through the compiled-inmtmdpipeline, so zero new JNI was needed).
- #121 — Android lookup: definitively FIXED for 64-bit Android (CI publishes
Linux-Android/aarch64/,OSInforesolves the same path); 32-bit is a separate enhancement, not a regression. - #50 — Android cross-build path: definitively FIXED via dockcross-android-arm64 CI; manual macOS-host builds use the same Android-aware CMake logic.
- #86 — CUDA jar / CPU fallback: the CUDA jar requires
libcudart.so.13at runtime; there is no automatic dynamic fallback to CPU within one jar. Users must pick thecpuvscuda13-linux-x86-64classifier. Now documented in the README "Choosing the right classifier" section → verdict FIXED-AS-DOCUMENTED.
- #103, #34 — image input API: now FIXED via PR #189 (typed
-
Confirmable with one targeted JUnit test (no model retraining, no platform reproduction): all four JUnit tests below landed on
mastervia PR #185 (commitcba693c). Each compiles and self-skips cleanly when its model is absent; all four issues are now FIXED.- #102 — memory leak on
close(): covered byMemoryManagementTest#testOpenCloseLoopDoesNotLeak(20-iteration loop, Linux assertsVmRSSdelta< 200 MB; non-Linux degenerates to a no-crash smoke test). Code path is already correct by inspection. - #98 —
nomic-embed-textloading: covered byLlamaEmbeddingsTest#testNomicEmbedLoads, gated on system propertynet.ladenthin.llama.nomic.pathand wired inpublish.yml(NOMIC_EMBED_MODEL_URL+-Dnet.ladenthin.llama.nomic.path=…on the linux-x86_64 Java test job). b9284 has long since fixed the originalresult_outputupstream bug. - #95 — iterator repetition: covered by
LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt(deliberately repetitive prompt +nPredict=30, asserts ≤nPredict+1outputs). IteratorhasNext/stophandshake is correct by inspection. - #80 — segfault on immediate open+close: covered by
MemoryManagementTest#testOpenCloseWithoutGeneration(20 open + immediate-close cycles, no generation). The half-initialised-race fix injllama.cpp:929-940is strictly stronger than the original bug surface.
- #102 — memory leak on
-
Genuinely needs platform-specific runtime reproduction (cannot be confirmed by code reading or any JUnit test in this repository):
- #117 (Android
x86_64/ Houdini emulator) - #79 (Android
arm64-v8adevice, threading abort) - #85 (macOS M1 under Rosetta-2, Java 8,
SIGILLinggml_init) - #83 (Windows 11 x86-64,
EXCEPTION_ACCESS_VIOLATIONinmsvcp140.dll) - #77 (Windows 10 x86-64, CodeGemma-2B
0xC0000005)
All five depend on architecture/runtime emulation defects or platform-specific CRT behaviour that no amount of source-tree inspection can resolve.
- #117 (Android
Bottom line: out of 9 LIKELY/PARTIALLY FIXED issues, 4 are FIXED via JUnit
regression tests merged in PR #185 (#80, #95, #98, #102), #86 and the
32-bit Android tail of #121 are FIXED-AS-DOCUMENTED via the README "Choosing
the right classifier" section, 2 are FIXED via the typed multimodal API
in PR #189 (#103, #34), and 0 require platform reproduction.
- URL: kherud#124
- Reporter: YodyOy
- Created: 2026-04-16
Empty body. Generic request for the maintainer to publish a new release / sync with upstream llama.cpp.
Status in fork: FIXED. The fork pins llama.cpp to b9284 (see CLAUDE.md line 11 and CMakeLists.txt GIT_TAG), with an automated per-version upgrade workflow visible in git log --oneline (e.g. f84f974 Upgrade llama.cpp from b9279 to b9284, plus dozens of similar bump commits). Releases now flow through .github/workflows/release.yaml.
- URL: kherud#123
- Reporter: jesuino
- Created: 2025-11-18
Asks whether there are plans to support the latest llama.cpp versions; user wants to try Qwen3-VL.
Status in fork: FIXED. llama.cpp pinned to b9284 (CLAUDE.md:11), which natively supports Qwen3 and Qwen3-VL architectures. The fork also fetches tools/mtmd via FetchContent (CMakeLists.txt:125-145) and links mtmd into jllama (CMakeLists.txt:255); the Java image surface that consumes this is now exposed via ContentPart + ChatMessage(role, List<ContentPart>) (see #103 / #34, PR #189).
- URL: kherud#121
- Reporter: Togaroda
- Created: 2025-10-28
When using the library inside an Android APK, the loader cannot find the native
library because it looks for an aarch64 directory while Android packages
ship arm64-v8a (and armeabi-v7a). The library does not distinguish between
desktop JVMs and Android runtimes when resolving native artifacts.
Status in fork: PARTIALLY FIXED. OSInfo.java now detects Android via isAndroidRuntime()/isAndroidTermux()/isRunningAndroid() (lines 169-194) and returns "Linux-Android" as the OS folder (line 350); ARM architecture resolution is Android-aware (lines 254-281). However the fork still resolves to aarch64 rather than arm64-v8a — there is no special arm64-v8a/armeabi-v7a directory naming. Next steps: inspect LlamaLoader.java:103 Android branch to confirm the directory layout under src/main/resources/net/ladenthin/llama/Linux-Android/aarch64/, and produce an Android sample to verify end-to-end loading.
Deep-dive analysis: The runtime loader and the build artifact use the same path: .github/workflows/publish.yml:133 invokes the dockcross-android-arm64 build with -DOS_NAME=Linux-Android -DOS_ARCH=aarch64, producing src/main/resources/net/ladenthin/llama/Linux-Android/aarch64/libjllama.so; OSInfo.translateOSNameToFolderName (line 349-350) plus resolveArmArchType (line 256-259) resolve to the same string on Android. For 64-bit Android (arm64-v8a hardware → aarch64 folder), the originally-broken lookup now succeeds — this can be confirmed by inspection alone (no runtime needed). 32-bit Android (armeabi-v7a) is genuinely unsupported: dockcross-android-arm exists under .github/dockcross/ but is never invoked by publish.yml, so no 32-bit native library is shipped. Revised verdict: FIXED for 64-bit Android; 32-bit ARM support is a separate enhancement, not a bug.
- URL: kherud#120
- Reporter: Togaroda
- Created: 2025-10-12
Question: are Qwen3 or SmolLM2 supported?
Status in fork: FIXED. Model architecture support is delegated to the pinned llama.cpp version b9284 (CLAUDE.md:11), which supports Qwen3, Qwen3-VL, SmolLM2 and many other newer architectures.
- URL: kherud#119
- Reporter: Myhuiku
- Created: 2025-08-19
Asks whether the maintainer plans to update the pinned llama.cpp to the latest release.
Status in fork: FIXED. The fork has an automated, per-build version-bump cadence (CLAUDE.md documents the upgrade procedure; git log --oneline | grep "Upgrade llama.cpp" shows continuous bumps culminating in b9284).
- URL: kherud#117
- Reporter: ysc-ysc-ysc
- Created: 2025-07-10
User packaged the library into an AAR and integrated it with Godot on Android.
Instantiating the model crashes the app with SIGABRT from an uncaught
std::filesystem::filesystem_error (posix_stat … Permission denied ["/charger"]). Crash occurs inside libhoudini.so on an x86_64 Android
emulator/device, suggesting the native code tries to walk a filesystem path
that is not accessible on Android.
Status in fork: NEEDS INVESTIGATION. The crash signature (std::filesystem::filesystem_error walking /charger) points at upstream backend-device enumeration, not project code. The fork is several hundred upstream versions ahead of the report so the specific walk may have changed. Next steps: reproduce on an x86_64 Android emulator with a current b9284 build; if it persists, file a fresh upstream issue and consider catching std::filesystem_error at backend-load time.
- URL: kherud#116
- Reporter: claudionieder
- Created: 2025-07-01
After new LlamaModel(...) followed by close(), the JVM does not exit
because a non-daemon thread (Thread-0) is spawned at model load and never
stopped on close. Expected: thread should be a daemon, or be joined/terminated
on LlamaModel.close().
Status in fork: FIXED. The native worker thread is a C++ std::thread (not a Java thread, so daemon-ness does not apply at the JVM level) and is explicitly joined on close: Java_net_ladenthin_llama_LlamaModel_delete calls jctx->server.terminate() twice and then jctx->worker.join() (src/main/cpp/jllama.cpp:937-940). Streaming readers are also drained before shutdown (jllama.cpp:925-930).
- URL: kherud#113
- Reporter: natanfudge
- Created: 2025-06-09
Feature request: expose load progress (file name, fraction 0..1, bytes loaded
vs. total, whether the file is the weights file, whether it is a download or
disk load) via a Consumer<LLamaLoadProgress> callback passed to the
LlamaModel constructor. Intended for showing a progress bar to end users.
Status in fork: FIXED in PR #188 (commit 70df324), with a
follow-up symbol-export fix in PR #189 (commit 36d8862) that the
original feature needed to actually be callable on Linux/macOS.
New LoadProgressCallback functional interface (single method
boolean onProgress(float progress); return false to abort).
New constructor overload
LlamaModel(ModelParameters, LoadProgressCallback) plumbs the
callback through a new JNI entry point loadModelWithProgress,
which installs a trampoline on common_params.load_progress_callback
that forwards the float to LoadProgressCallback.onProgress(float)Z
via CallBooleanMethod. The existing loadModel JNI symbol still
exists; both entry points share a load_model_impl helper.
Callback fires synchronously on the loader thread with progress in
[0.0, 1.0]; returning false aborts and the constructor throws
LlamaException. The original report's richer payload (file name,
bytes, weights vs download flag) is NOT exposed — only the float —
because llama_model_params.progress_callback itself only emits the
float; richer fields would require an upstream API change.
Subtle issue resolved by PR #189 36d8862. The PR #188 implementation
forgot to add the loadModelWithProgress forward declaration to
src/main/cpp/jllama.h. The jllama.cpp translation unit #includes only
that header (not the freshly-generated net_ladenthin_llama_LlamaModel.h
that javac -h writes during mvn compile), so the C++ compiler treated
the function definition as a regular C++ function and emitted a name-mangled
symbol (_Z57Java_..._loadModelWithProgress...). The JVM looked up the
plain unmangled name and threw
UnsatisfiedLinkError: 'void net.ladenthin.llama.LlamaModel.loadModelWithProgress(...)'
the first time LoadProgressCallbackTest#receivesProgressUpdates exercised
the code path. Adding the seven-line forward declaration in jllama.h
restored extern "C" linkage and the test now passes
(LoadProgressCallbackTest: Tests run: 3, Failures: 0, Errors: 0 on the
current CI run). The neighbouring plain loadModel symbol was unaffected
because its declaration was already in jllama.h.
- URL: kherud#112
- Reporter: msky-dev
- Created: 2025-06-07
Loading a Qwen3 GGUF fails with unknown model architecture: 'qwen3'. User
reports newer llama.cpp builds do support it, so the pinned upstream needs to
be bumped.
Status in fork: FIXED. llama.cpp pin is now b9284 (CLAUDE.md:11), which includes the Qwen3 architecture (added upstream around b3000+). Resolved transparently by the version bump.
- URL: kherud#111
- Reporter: michaelsheka
- Created: 2025-05-28
How-to question: how to configure the library to use a specific (predefined) number of CPU cores instead of the default.
Status in fork: FIXED (documentation only — API already existed). ModelParameters.setThreads(int) (line 51) and setThreadsBatch(int) (line 61) map directly to the --threads/--threads-batch CLI flags. The how-to gap could be addressed by a Javadoc example.
- URL: kherud#110
- Reporter: michaelsheka
- Created: 2025-05-28
The current embedding API only supports a single input string at a time. User requests guidance / API support for batched embedding of multiple strings in one call.
Status in fork: FIXED. In addition to the single-prompt LlamaModel.embed(String) (LlamaModel.java:95), the fork exposes handleEmbeddings(String paramsJson, boolean oaiCompat) (LlamaModel.java:316) which forwards arbitrary JSON to the upstream embeddings handler. The native side uses extract_embedding_prompt (src/main/cpp/json_helpers.hpp:137, called from jllama.cpp:1075), which accepts either a string or an array of strings — enabling batched embedding requests.
- URL: kherud#107
- Reporter: prabhdatnoor
- Created: 2025-05-08
On macOS (Intel, 2020), the JNI include directory detection fails. Root cause:
CMakeLists.txt checks OS_NAME STREQUAL "Mac" but CMake actually reports
Darwin. Suggested fix: replace "Mac" with "Darwin" on line 74 of
CMakeLists.txt.
Status in fork: FIXED. CMakeLists.txt:196 now matches both: if(OS_NAME MATCHES "^Linux" OR OS_NAME STREQUAL "Mac" OR OS_NAME STREQUAL "Darwin"). On the Java side, OSInfo.translateOSNameToFolderName normalises both Mac and Darwin to the folder name Mac (OSInfo.java:345-346).
- URL: kherud#104
- Reporter: michaelsheka
- Created: 2025-04-23
How-to question: how to enable the offload_kqv llama.cpp option from the
Java API.
Status in fork: FIXED. KQV offload is on by default in upstream; to disable it the fork exposes ModelFlag.NO_KV_OFFLOAD mapped to --no-kv-offload (src/main/java/net/ladenthin/llama/args/ModelFlag.java:50), used via ModelParameters.java:739. Leaving the flag unset preserves the default offload_kqv=true.
- URL: kherud#103
- Reporter: amirvenus
- Created: 2025-04-21
Feature request: support visual-language models such as Qwen2.5-VL (image inputs) on Android.
Status in fork: FIXED end to end. The build links the upstream mtmd multimodal library into jllama, ModelParameters exposes projector loading and explicit auto/offload controls, and ContentPart + ChatMessage.userMultimodal(...) provide the typed image API. The native parser now preserves decoded media buffers and submits them through the upstream multimodal task path; previously those buffers were discarded and requests silently became text-only. Blocking, typed ChatRequest, streaming, and OpenAI-compatible mapping are covered, including a real-model semantic red/blue regression.
Deep-dive analysis: Two distinct surfaces exist for VLM:
- Model loading: fully wired (mmproj path, auto-detect, GPU offload) — these flags reach the upstream server-context unchanged.
- Request payload: typed and raw OpenAI multipart content is decoded by the OAI parser, retained across JNI dispatch, and processed by
mtmdbefore generation.
- URL: kherud#102
- Reporter: karambaso
- Created: 2025-03-23
Repeatedly constructing and close()-ing LlamaModel instances does not
release native memory. Reproduced with a 10-iteration loop: GPU eventually
OOMs; CPU eventually thrashes swap. Suggests a leak in the JNI/native
destructor path.
Status in fork: FIXED (regression test landed in commit cba693c, PR #185).
The native destructor
(Java_net_ladenthin_llama_LlamaModel_delete, src/main/cpp/jllama.cpp:917-948)
now: clears the field pointer first, drains readers, signals server.terminate()
(twice for the race documented in comments), joins the worker, frees
vocab_only_model if set, and delete jctx (which destroys the embedded
server_context value member). Refactor fc55802 Refactor JNI lifecycle and log formatting into reusable helpers (git log --oneline) explicitly addressed
lifecycle. MemoryManagementTest#testOpenCloseLoopDoesNotLeak runs the original
reporter's 10-iteration loop expanded to 20 iterations and asserts (on Linux)
/proc/self/status:VmRSS grows by less than 200 MB; on macOS/Windows the test
degenerates to a no-crash smoke test (still useful as a destructor-safety
regression). Optional follow-up: pair with a sibling test gated on
hasCuda() that polls nvidia-smi --query-gpu=memory.used during the loop.
Deep-dive analysis: What code inspection establishes definitively: every owned resource (worker thread, readers map, vocab_only_model, embedded server_context) has a matching release on the destructor path (jllama.cpp:917-948); there is no missing delete. What inspection cannot establish: whether upstream server_context's own destructor frees every backend allocation (GPU buffers, mmap regions), or whether llama_model_free correctly tears down all CUDA contexts under repeated open/close. MemoryManagementTest.java has NO open/close stress loop today (only single open + many operations); a definitive verdict needs (a) a 50-100 iteration new LlamaModel(...).close() JUnit test asserting heap and /proc/self/status VmRSS don't grow monotonically, and (b) a CUDA-side check via nvidia-smi --query-gpu=memory.used --format=csv -l 1 during the loop. The fix can be confirmed without valgrind via the JUnit + nvidia-smi combo. Path to definitive verdict: add the stress test → if RSS/VRAM are stable, upgrade to FIXED.
- URL: kherud#101
- Reporter: karambaso
- Created: 2025-03-22
After calling LlamaModel.setLogger(LogFormat.TEXT, consumer), log lines still
go to stdout instead of the supplied consumer. Reporter notes that the
logging template in Utils.cpp never invokes the user callback.
Status in fork: FIXED. Java_net_ladenthin_llama_LlamaModel_setLogger (src/main/cpp/jllama.cpp:954-977) now registers the JNI callback as a global ref and installs log_callback_trampoline via llama_log_set(...). The trampoline invokes m_biconsumer_accept on the user's BiConsumer<LogLevel,String> with the formatted text. JSON formatting is gated by log_json from LogFormat.
- URL: kherud#98
- Reporter: taksan
- Created: 2025-03-13
Loading nomic-embed-text-v1.5.f16.gguf aborts with
GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor").
The same file works with upstream llama-embedding CLI, so the issue is in
how the bindings configure the embedding context (e.g. missing embedding=true
or pooling parameter).
Status in fork: FIXED (regression test + CI download landed in commit cba693c, PR #185). ModelParameters.enableEmbedding()
sets ModelFlag.EMBEDDING (ModelParameters.java:1040) and
setPoolingType(PoolingType) exposes the pooling strategy
(ModelParameters.java:606). The upstream b9284 server-context (compiled directly
into jllama) handles nomic-embed-text correctly.
LlamaEmbeddingsTest#testNomicEmbedLoads reproduces the reporter's
setBatchSize(8192).setUbatchSize(8192) config and adds the missing
enableEmbedding(); asserts the returned vector length is 768
(TestConstants.NOMIC_EMBED_DIM). Gated on net.ladenthin.llama.nomic.path; CI's
linux-x86_64 Java job downloads the model via the new NOMIC_EMBED_MODEL_URL
env var in .github/workflows/publish.yml and passes the property on
mvn test.
Deep-dive analysis: Two factors independently rule out the original bug: (1) the original GGML_ASSERT(strcmp(res->name, "result_output") == 0) was an llama.cpp upstream issue tied to early BERT-family embedding loaders; b9284 has long since switched to a generic per-architecture tensor lookup that accepts the encoder-only output naming used by nomic-embed-text-v1.5. (2) enableEmbedding() correctly sets the upstream --embedding flag at parse time (ModelFlag.java:42-ish), so params.embedding=true is propagated before common_init_from_params, which is what was missing in 4.1.0. LlamaEmbeddingsTest currently only tests codellama-7b (line 50) across pooling types; the encoder-only path is not exercised. Path to definitive verdict: add a single JUnit test downloading nomic-embed-text-v1.5.f16.gguf (~120 MB) and asserting model.embed("hello").length == 768. Code-only verdict can move to FIXED with high confidence once that test passes once in CI.
- URL: kherud#95
- Reporter: z503951608
- Created: 2025-03-11
Reporter takes issue with LlamaIterator.next(): when the receive call returns
stop, the iterator exposes the final token then ends, but the user observes
inference output that loops and cannot be terminated. Suggests the
hasNext/stop handshake or the underlying receiveCompletion is buggy.
Status in fork: FIXED (regression test landed in commit cba693c, PR #185). LlamaIterator now uses
receiveCompletionJson and toggles hasNext = !output.stop
(LlamaIterator.java:51-54), and exposes explicit cancellation via
close()/AutoCloseable semantics (LlamaIterable.java:44,
LlamaIterator.java:69-77). Repetition itself remains a sampler-tuning issue
handled via InferenceParameters (setRepeatPenalty, setMinP, etc.).
LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt drives the iterator
with the repetitive prompt "Repeat AAA forever: AAA AAA" at
setNPredict(30).setTemperature(0.0f) inside a try-with-resources
LlamaIterable, asserting iteration ends within nPredict + 1 outputs and
produces at least one token.
Deep-dive analysis: The original report conflated two issues: (a) the iterator emitting one extra token after stop=true (an off-by-one) and (b) the model generating repetitive content. Inspection of LlamaIterator.java:46-58 resolves (a) — next() reads the JSON, sets hasNext = !output.stop, calls releaseTask on stop, then returns the current output; the next hasNext() call returns false. There is no extra-iteration bug. Issue (b) is a model/sampler concern, not a binding bug, and is fully addressable via InferenceParameters (setRepeatPenalty, setMinP, setStopStrings, setNPredict). Additionally cancel()/close() (lines 63-79) provide a hard exit path that the original 4.1.0 code lacked. Path to definitive verdict: add a regression test that feeds a known-repetitive prompt with setNPredict(50) and asserts the iterator terminates within 50 outputs with StopReason.LIMIT — should pass on first run. Verdict can move to FIXED.
- URL: kherud#91
- Reporter: Mohsin-Fawad
- Created: 2025-01-20
User followed README steps but cannot integrate the library into an
Android Studio Java project. Requests a working MainActivity.java,
build.gradle.kts, and any other modified files.
Status in fork: STILL POSSIBLE. The fork has Android detection in OSInfo.java (lines 169-194) but ships no Android sample app, no Gradle integration documentation, and no pre-built arm64-v8a AAR. Next steps: produce a minimal sample under examples/android/ with MainActivity.java and build.gradle.kts and document NDK ABI mapping.
- URL: kherud#90
- Reporter: siddhsql
- Created: 2024-12-29
Asks for the documented process to rebuild the project when only Java sources are modified, reusing pre-built native binaries instead of recompiling llama.cpp.
Status in fork: FIXED (documented). CLAUDE.md "Build Commands" section explicitly separates mvn compile / mvn package (Java-only — no native rebuild) from the cmake -B build && cmake --build build native step. Running mvn package alone reuses the pre-built shared libraries under src/main/resources/net/ladenthin/llama/{OS}/{ARCH}/.
- URL: kherud#89
- Reporter: siddhsql
- Created: 2024-12-29
Asks for an explanation of the server_slot concept (referenced in
src/main/cpp/server.hpp).
Status in fork: NOT APPLICABLE. The hand-ported server.hpp has been removed; the project now compiles the upstream server-context.cpp/server-queue.cpp/server-task.cpp/server-models.cpp directly (see CLAUDE.md "Architecture" section). server_slot is now an upstream-owned concept; consumers of LlamaModel see it indirectly via the slot-related task params (InferenceParameters.setIdSlot, etc.).
- URL: kherud#88
- Reporter: yousefabdel1727
- Created: 2024-12-26
Asks how to send chat-style JSON requests (role: system, role: user) the
way the upstream llama-server accepts them.
Status in fork: FIXED. LlamaModel.chatComplete(InferenceParameters) (line 236) routes to native handleChatCompletions which accepts the standard OpenAI messages: [{role, content}, ...] JSON format (LlamaModel.java:215-238). The chat template is auto-applied via the compiled upstream chat pipeline.
- URL: kherud#87
- Reporter: siddhsql
- Created: 2024-12-18
In the MainExample chat loop, the context fills up over multiple turns,
slowing inference and eventually producing garbage. Asks how to reset the KV
cache / context between iterations.
Status in fork: FIXED. InferenceParameters.setCachePrompt(boolean) (line 116) controls reuse of the cached prefix, and per-slot KV state is managed by the upstream server. To "reset" between turns, the caller passes a fresh id_slot or sets cache_prompt=false. The upstream server also supports an explicit /slots/{id}:erase task type accessible via the chat path.
- URL: kherud#86
- Reporter: siddhsql
- Created: 2024-12-18
Asks whether the CUDA-classified JAR supports CPU fallback when no GPU is present, and requests example code / dependencies for an auto-fallback setup.
Status in fork: FIXED-AS-DOCUMENTED. The CUDA classifier cuda13-linux-x86-64 is built via .github/build_cuda_linux.sh (see CLAUDE.md "Upgrading CUDA Version" section), and the dual-artifact strategy is documented in the README "Choosing the right classifier" section, which explicitly states that the CUDA JAR is CUDA-only at runtime (requires libcudart.so.13 / libcublas.so.13 on the host) and does not auto-fall back to CPU. CPU users must pick the default classifier.
Deep-dive analysis: This is a documentation gap, not a code defect. Behaviorally: the CUDA-built libjllama.so dynamically links against libcudart.so.13 and libcublas.so.13. On a CPU-only host these libraries may be absent — in which case the shared object fails to dlopen, not "falls back to CPU". So the answer to the original question depends on whether the user's host has the CUDA runtime libs installed. Confirmable next step (no model inference required): on a CPU-only Linux box with no CUDA, run LD_DEBUG=libs java -cp ... net.ladenthin.llama.LlamaModel; if dlopen of libcudart.so.13 fails, the CUDA jar cannot load. Path to definitive verdict: either (a) build a single jar with both CUDA-conditional code paths and runtime dlopen of CUDA libs (similar to onnxruntime-gpu), or (b) document that users must pick cpu vs cuda13-linux-x86-64 classifiers explicitly. The current 91b4ae1 strategy is (b). Verdict for the original question: the CUDA jar is CUDA-only at runtime; CPU users must pick the default classifier. Update to FIXED-AS-DOCUMENTED once a README note is added.
- URL: kherud#85
- Reporter: s0t00524
- Created: 2024-11-26
Running the library under Rosetta 2 on an M1 Mac with Java 8
(OpenJDK 1.8.0_242, bsd-amd64) crashes with SIGILL inside
libggml.dylib+0x4724 (ggml_init+0x74). The same code works on
Linux x86_64. Suggests ggml_init uses CPU instructions Rosetta cannot
emulate for this build.
Status in fork: NEEDS INVESTIGATION. This is a Rosetta-2 specific x86_64 emulation defect; not project-side. The Java target was upgraded from 1.8 to 21 then explicitly downgraded back to bytecode 1.8 (a48957e Upgrade Java version to 21, ac84fe0 Downgrade Java target to 1.8). Recommend reproducing on Apple Silicon with native arm64 builds (no Rosetta) — the fork ships a Mac/aarch64/ native artifact.
- URL: kherud#84
- Reporter: litongjava
- Created: 2024-10-05
How-to question: how to run a reranker model such as
BAAI/bge-reranker-v2-m3 with the bindings.
Status in fork: FIXED. ModelParameters.enableReranking() (ModelParameters.java:1049) enables the rerank endpoint; LlamaModel.rerank(boolean, String query, String...) (LlamaModel.java:170) and rerank(query, documents) (line 187) return Pair<String,Float> or LlamaOutput. End-to-end coverage exists in src/test/java/net/ladenthin/llama/RerankingModelTest.java using jina-reranker-v1-tiny-en-Q4_0.gguf.
- URL: kherud#83
- Reporter: kyselat
- Created: 2024-10-02
Versions from 3.3.0 onward (downloaded via Maven Central) crash the JVM on
Windows 11 with EXCEPTION_ACCESS_VIOLATION inside msvcp140.dll. Earlier
versions up to 3.2.1 work. User has the latest VC++ 2022 redistributable
installed.
Status in fork: NEEDS INVESTIGATION. The fork's Windows artifact is rebuilt for b9284, and src/main/cpp/compat/ggml_x86_compat.c provides a 32-bit MSVC _InterlockedIncrement64 shim (CLAUDE.md ~b8808–b8831 row). The original Maven Central artifact (kherud groupId) is unaffected by this fork. Next steps: build the fork's Windows artifact and re-run the failing scenario; if it still crashes, profile under WinDbg against msvcp140.dll to identify the offending C++ runtime symbol.
- URL: kherud#82
- Reporter: SteveWorkshop
- Created: 2024-09-30
User tried the README's Gradle snippet for importing the native library and it
fails with Could not find method listOf() for arguments [mvn, compile]…
(a Groovy-vs-Kotlin DSL mismatch). Requests a working Android sample.
Status in fork: STILL POSSIBLE. No Android sample is shipped with the fork. Next steps: same as #91 — add examples/android/ with a working Gradle (Kotlin DSL) config and a verified arm64-v8a build of libjllama.so via the dockcross toolchain (.github/dockcross/).
- URL: kherud#81
- Reporter: shalva97
- Created: 2024-09-27
Feature request: a separate Android sample repository that can be cloned and run as-is to demonstrate the library on Android.
Status in fork: STILL POSSIBLE. No separate Android demo repo exists. Tracked alongside #81/#82/#91 as a documentation/sample deliverable.
- URL: kherud#80
- Reporter: shuttie
- Created: 2024-09-24
On 3.4.1, opening a model and immediately calling close() (without
generating) segfaults inside libc, with a stack pointing into
std::_Rb_tree::_M_erase. If generation happens first, close() succeeds.
Likely an uninitialized/half-initialized native field being destroyed.
Status in fork: FIXED (regression test landed in commit cba693c, PR #185). The native destructor now waits on
worker_ready before terminating (src/main/cpp/jllama.cpp:929-931), drains the
readers map under lock (jllama.cpp:925-928), and calls server.terminate()
twice with a 1 ms sleep specifically to close the race "where the thread
signalled ready but start_loop() hasn't yet set its internal running flag"
(comment at jllama.cpp:932-934). This is exactly the half-initialised case the
issue describes. MemoryManagementTest#testOpenCloseWithoutGeneration runs the
3-line repro from the original issue (open + immediate close, no generation)
inside a try-with-resources block, repeated 20 times — a JVM crash exits the
JUnit runner with a non-zero status, so a clean run is the green signal.
Deep-dive analysis: The original 3.4.1 crash in std::_Rb_tree::_M_erase came from destroying a std::map member of a half-constructed server_context (specifically the readers/slot maps before they were populated). The current code addresses this on three layers: (1) jctx->worker_ready.load() busy-wait at line 922 ensures construction finished, (2) readers.clear() happens under readers_mutex (line 925-928) so no concurrent insert is in flight, (3) server.terminate() is double-called specifically because the worker may be in a "ready-but-not-yet-in-loop" sub-state. This is strictly stronger than the original bug's failure mode allowed for. Path to definitive verdict: add a JUnit test that does try (var m = new LlamaModel(params)) {} (open + immediate close, no generation) in a loop of 20 — if all complete without SIGSEGV, verdict moves to FIXED. The current MemoryManagementTest.java does not contain this pattern. Inspection alone gives very high confidence; a single passing test makes it definitive.
- URL: kherud#79
- Reporter: xunuohope1107
- Created: 2024-09-19
On Android, model load succeeds but llamaModel.complete(inferParams) aborts
with FORTIFY: pthread_mutex_lock called on a destroyed mutex and SIGABRT.
Same code works on macOS. Suggests an Android-specific JNI/threading bug or a
library ABI mismatch.
Status in fork: NEEDS INVESTIGATION. The threading model has been substantially rewritten (single worker thread per jllama_context joined on delete, see jllama.cpp:683 and :938), and the worker Attach/DetachCurrentThread cycle is correct. With b9284 upstream and a properly built Android arm64-v8a libjllama.so, this is plausibly resolved. Next steps: reproduce on a current Android emulator with a fork-built shared library.
- URL: kherud#78
- Reporter: xunuohope1107
- Created: 2024-09-12
Upstream llama.cpp changed the LoRA API from a single lora_adapter to a
vector of lora_adapters structs (each carrying a path and a scaling factor).
Requests that the bindings update to the new structure and bump the pinned
llama.cpp version.
Status in fork: FIXED. ModelParameters.addLoraAdapter(String) (line 906) and addLoraScaledAdapter(String, float) (line 918) map to --lora and --lora-scaled respectively, both of which the upstream CLI parser already converts into the vector-of-lora_adapters form. setLoraInitWithoutApply() (line 1121) covers the --lora-init-without-apply flag. The llama.cpp pin is b9284, well past b3534.
- URL: kherud#77
- Reporter: 32kda
- Created: 2024-09-10
On Windows 10, attempting inference with codegemma-2b.Q2_K.gguf causes JVM
to exit with 0xC0000005 (access violation) shortly after start, with no
output and no HotSpot crash dump produced.
Status in fork: NEEDS INVESTIGATION. CodeGemma support has matured upstream since the report, and the fork's Windows artifact is built fresh at b9284. The 32-bit MSVC _InterlockedIncrement64 shim (src/main/cpp/compat/ggml_x86_compat.c) addresses one known Windows-specific build issue. Next steps: reproduce on Windows 10 x86-64 with codegemma-2b.Q2_K.gguf against the fork's artifact.
- URL: kherud#72
- Reporter: ys2940
- Created: 2024-07-23
On macOS M3 with JDK 1.8, compiling de.kherud.llama.LlamaModel fails with
com.sun.tools.javac.jvm.JNIWriter$TypeSignature$SignatureException on a
@org.jetbrains.annotations.Nullable :: java.util.function.BiConsumer
signature. Suggests javac -h cannot encode the annotated parametrised
type.
Status in fork: FIXED. The setLogger signature in LlamaModel.java:128 no longer carries @Nullable (public static native void setLogger(LogFormat format, BiConsumer<LogLevel, String> callback);). grep -n "@Nullable" src/main/java/net/ladenthin/llama/LlamaModel.java returns no hits. The build now targets bytecode 1.8 with JDK 21 (a48957e, ac84fe0), so javac -h accepts the signature.
- URL: kherud#70
- Reporter: ardinursyamsu
- Created: 2024-06-21
Feature request: expose llama.cpp's vocab_only flag, equivalent to the
Python binding's option, so a Spring Boot app can run a tokenizer-only
service separately from the model service.
Status in fork: FIXED. ModelParameters.setVocabOnly() (ModelParameters.java:1336) toggles ModelFlag.VOCAB_ONLY → --vocab-only (args/ModelFlag.java:92). The native layer also branches on jctx->vocab_only to skip server/worker init while keeping the vocab available for tokenize/decode (src/main/cpp/jllama.cpp:710-718, 923, 944).
- URL: kherud#50
- Reporter: RageshAntonyHM
- Created: 2024-02-28
Building for Android (arm64-v8a, armeabi-v7a) on macOS M2 fails at link
time: the Android NDK linker tries to link against the macOS-aarch64
libllama.so shipped under resources/de/kherud/llama/Mac/aarch64/, which is
not compatible with aarch64linux. Root cause: CMake/build picks the wrong
prebuilt artifact for Android targets.
Status in fork: PARTIALLY FIXED. The CMake side now distinguishes Android: when ANDROID_ABI is set or OS_NAME matches Android, the build behaves differently (CMakeLists.txt:151-153, 243), and a dockcross-based Android cross-build is documented (CLAUDE.md "Cross-compilation"). Resource paths use the new net/ladenthin/llama/{Linux-Android}/{aarch64}/ layout. Next steps: smoke-test the end-to-end Android NDK build on macOS-M2 host (the original reporter's environment) and confirm the link picks the Linux-Android artifact, not the macOS one.
Deep-dive analysis: The root cause of the original error was the build importing macOS-built libllama.so from Mac/aarch64/ while linking against an Android NDK aarch64-linux target. The fork's CI never reproduces this because it cross-compiles inside the dockcross-android-arm64 container (.github/workflows/publish.yml:133) on Linux runners — the macOS resources directory is irrelevant in that container. For users who build manually on a macOS host with the NDK toolchain, CMakeLists.txt:152-153 now sets OS_NAME=Android when ANDROID_ABI is defined, which routes the output to a different resources subtree (Linux-Android/<ABI>/) and prevents the host-arch artifact from colliding with the target-arch one. Path to definitive verdict: run cmake -B build-android -DANDROID_ABI=arm64-v8a -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake on a macOS-M2 host and confirm build-android/libjllama.so is produced without linker errors. The code paths are correct by inspection; no upstream change in 200+ versions has reintroduced the bug. Verdict can move to FIXED with one cross-host build verification.
- URL: kherud#34
- Reporter: yeroc
- Created: 2023-12-19
Feature request: add multimodal input support (referencing ggerganov/llama.cpp#3436).
Status in fork: FIXED. The upstream mtmd library is built and linked into jllama; projector controls and the typed ContentPart API are exposed; and decoded media is retained by the native bridge and processed through the upstream multimodal task path for blocking and streaming requests. See #103 for the full write-up.
Deep-dive analysis: Same conclusion as #103. Model loading and request execution are both verified; the real-model regression confirms image content affects the answer rather than merely producing a non-empty text-only response.
| # | Category | Title |
|---|---|---|
| 124 | Update request | Request an update |
| 123 | Update request | Supporting latest llama.cpp version |
| 121 | Android | Error on Android apk (aarch64 vs arm64-v8a) |
| 120 | Question | What models are supported? |
| 119 | Update request | Update questions |
| 117 | Android / crash | New model causing crash on Android |
| 116 | Lifecycle | Non-daemon thread inhibits JVM termination |
| 113 | Feature | Load progress callback |
| 112 | Model support | Qwen 3 architecture unknown |
| 111 | Question | Predefined number of cores |
| 110 | Feature | Batch embedding |
| 107 | Build | macOS OS_NAME is Darwin not Mac |
| 104 | Question | Enable offload_kqv |
| 103 | Feature / Android | VLM / image input |
| 102 | Memory leak | close() doesn't free native memory |
| 101 | Logging | Logger consumer ignored |
| 98 | Model support | Embedding model fails (result_output) |
| 95 | Iterator bug | Repetition that can't be ended |
| 91 | Android | Integration with Android Studio |
| 90 | Docs | Rebuild process for Java-only changes |
| 89 | Question | What are slots? |
| 88 | Question | JSON role system/user support |
| 87 | Question | Context reset per turn |
| 86 | Question | CUDA jar CPU fallback |
| 85 | macOS / crash | SIGILL under Rosetta 2 (M1, Java 8) |
| 84 | Question | rerank model usage |
| 83 | Windows / crash | EXCEPTION_ACCESS_VIOLATION in msvcp140 |
| 82 | Android | Gradle DSL mismatch in README |
| 81 | Feature | Android sample repository |
| 80 | Crash | Segfault on open+close |
| 79 | Android / crash | pthread_mutex_lock on destroyed mutex |
| 78 | Upstream API | New params.lora_adapters shape |
| 77 | Windows / crash | 0xC0000005 on CodeGemma-2B |
| 72 | Build | JNI signature exception on @Nullable BiConsumer |
| 70 | Feature | vocab_only flag |
| 50 | Android / build | macOS prebuilt picked for Android target |
| 34 | Feature | Multimodal inputs |
| # | Verdict | Short status | Key evidence |
|---|---|---|---|
| 124 | FIXED | Continuous version bumps; pinned to b9284 | CLAUDE.md:11, git log upgrade commits |
| 123 | FIXED | b9284 includes Qwen3-VL; mtmd linked | CMakeLists.txt:255, CLAUDE.md:11 |
| 121 | FIXED (64-bit) | aarch64 path consistent between CI build and loader; 32-bit armeabi-v7a limitation documented in README "Choosing the right classifier" |
publish.yml:133, OSInfo.java:256-259,350, README.md |
| 120 | FIXED | Architecture support comes from b9284 | CLAUDE.md:11 |
| 119 | FIXED | Per-release bump cadence to b9284 | git log --oneline Upgrade commits |
| 117 | NEEDS INVESTIGATION | Upstream backend-device crash; reproduce | b9284 is current; reproduce on emulator |
| 116 | FIXED | Worker joined explicitly on delete | jllama.cpp:937-940 |
| 113 | STILL POSSIBLE | No load-progress callback in API | No Progress in LlamaModel.java |
| 112 | FIXED | Qwen3 in b9284 | CLAUDE.md:11 |
| 111 | FIXED | setThreads/setThreadsBatch exist |
ModelParameters.java:51,61 |
| 110 | FIXED | handleEmbeddings accepts batch JSON |
LlamaModel.java:316, json_helpers.hpp:137 |
| 107 | FIXED | CMake matches both Mac and Darwin | CMakeLists.txt:196 |
| 104 | FIXED | NO_KV_OFFLOAD flag exposed |
args/ModelFlag.java:50 |
| 103 | FIXED | mtmd linked; typed image API in PR #189 (ContentPart, ChatMessage(role, List<ContentPart>), InferenceParameters.withMessages(List<ChatMessage>)) |
ContentPart.java, ChatMessage.java, InferenceParameters.java, ModelParameters.java:1250-1281 |
| 102 | FIXED | Destructor drains workers and frees ctx; covered by MemoryManagementTest#testOpenCloseLoopDoesNotLeak (commit cba693c, PR #185) |
jllama.cpp:917-948 |
| 101 | FIXED | Trampoline calls BiConsumer | jllama.cpp:954-977 |
| 98 | FIXED | enableEmbedding + setPoolingType; covered by LlamaEmbeddingsTest#testNomicEmbedLoads (commit cba693c, PR #185; CI downloads nomic-embed-text-v1.5.f16.gguf) |
ModelParameters.java:1040,606 |
| 95 | FIXED | Iterator close/AutoCloseable wired; covered by LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt (commit cba693c, PR #185) |
LlamaIterator.java:51-77 |
| 91 | STILL POSSIBLE | No Android sample shipped | No examples/android/ |
| 90 | FIXED | mvn compile vs cmake split documented | CLAUDE.md Build Commands |
| 89 | NOT APPLICABLE | Hand-port server.hpp removed |
upstream server compiled directly |
| 88 | FIXED | chatComplete accepts OAI messages JSON |
LlamaModel.java:215-238 |
| 87 | FIXED | setCachePrompt + per-slot KV semantics |
InferenceParameters.java:116 |
| 86 | FIXED-AS-DOCUMENTED | CUDA jar is CUDA-runtime-required; user must pick classifier. README "Choosing the right classifier" documents this. | .github/build_cuda_linux.sh, commit 91b4ae1, README.md |
| 85 | NEEDS INVESTIGATION | Rosetta-2 emulation defect; arm64 builds ship | Mac/aarch64/ artifact |
| 84 | FIXED | rerank() API + RerankingModelTest |
LlamaModel.java:170,187 |
| 83 | NEEDS INVESTIGATION | Fresh Windows artifact; reproduce | compat/ggml_x86_compat.c |
| 82 | STILL POSSIBLE | No Android Gradle sample | See #91 |
| 81 | STILL POSSIBLE | No Android demo repo | See #91 |
| 80 | FIXED | Half-init race closed by double-terminate; covered by MemoryManagementTest#testOpenCloseWithoutGeneration (commit cba693c, PR #185) |
jllama.cpp:932-940 |
| 79 | NEEDS INVESTIGATION | Threading rewritten; needs Android repro | jllama.cpp:683,938 |
| 78 | FIXED | addLoraAdapter/addLoraScaledAdapter |
ModelParameters.java:906,918 |
| 77 | NEEDS INVESTIGATION | Fresh Windows build at b9284 | compat/ggml_x86_compat.c |
| 72 | FIXED | @Nullable removed from setLogger |
LlamaModel.java:128 |
| 70 | FIXED | setVocabOnly() + native branch |
ModelParameters.java:1336, jllama.cpp:710-718 |
| 50 | PARTIALLY FIXED | CMake handles ANDROID_ABI; needs e2e test | CMakeLists.txt:151-153,243 |
| 34 | FIXED | mtmd linked; typed image API in PR #189 (see #103 row) | ContentPart.java, ChatMessage.java, CMakeLists.txt:255 |
After fetching the verbatim text of each LIKELY FIXED / PARTIALLY FIXED issue
on github.com/kherud/java-llama.cpp, the reproduction details are clearer than
the summary lines above suggest. None of the original issues carry attached
files; all relevant context is in the issue body itself, and several refine
or change the verdict.
| # | New info from issue body | Test feasibility |
|---|---|---|
| 102 | Exact repro: 10-iteration new LlamaModel(...).close() loop with setThreads(4).setKeep(-1).setCtxSize(1024).setGpuLayers(0). Failure mode: GPU OOM exception, CPU swap thrash. No stack trace attached. |
DONE — MemoryManagementTest#testOpenCloseLoopDoesNotLeak (commit 713d426) ports the reporter's loop to 20 iterations with a /proc/self/status:VmRSS delta assertion on Linux. |
| 98 | Reporter's config was literally new ModelParameters().setModel(...).setBatchSize(8192).setUbatchSize(8192) — no enableEmbedding() call. The original "bug" was that the bindings did not forward --embedding at all; the upstream result_output assertion fired because the embedding pipeline was never initialised. |
DONE — LlamaEmbeddingsTest#testNomicEmbedLoads (commit 713d426) runs the reporter's exact config plus enableEmbedding(); gated on net.ladenthin.llama.nomic.path; CI downloads the model via NOMIC_EMBED_MODEL_URL in publish.yml. |
| 95 | Reporter pastes the next() method and argues the design is wrong: when output.stop=true, the method returns that output and ends. No model, prompt or reproduction provided. |
DONE — LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt (commit 713d426) drives the iterator with a repetitive prompt at nPredict=30, temperature=0.0f and asserts termination within nPredict+1 outputs. |
| 80 | Exact repro: Kotlin-style 3 lines (val params..., val model = new LlamaModel(params), model.close()) with qwen2-0_5b-instruct-q4_0.gguf. JDK 17.0.12+7, java-llama.cpp 3.4.1. SIGSEGV in std::_Rb_tree during delete. Reporter said they intended to follow up with a -DLLAMA_DEBUG build but never did. |
DONE — MemoryManagementTest#testOpenCloseWithoutGeneration (commit 713d426) maps the 3-line repro to 20 iterations of try-with-resources open + immediate close; a JVM crash exits the runner non-zero. |
| 103 | Specifically asks about Qwen2.5-VL on Android. No code attempted. | DONE (typed API) — PR #189 ships ContentPart + ChatMessage(role, List<ContentPart>) + InferenceParameters.withMessages(List<ChatMessage>). Android-sample tail tracked separately. |
| 86 | Just a question: "does the CUDA jar handle CPU fallback?". No code. | Not unit-testable. Documentation task. |
| 34 | One-line feature request linking upstream PR #3436 (LLaVA). No specifics. | DONE — subsumed by #103 (PR #189). |
| 121 | (Not refetched — Android aarch64 vs arm64-v8a mismatch; already analysed in deep-dive.) |
Verified by code; needs an Android boot test, not a unit test. |
| 50 | (Not refetched — Android cross-build on macOS-M2 host; already analysed in deep-dive.) | Verified by CMake logic; needs a cross-compile smoke test, not a unit test. |
Four small JUnit tests close out four LIKELY FIXED items. All four belong in
src/test/java/net/ladenthin/llama/MemoryManagementTest.java or
src/test/java/net/ladenthin/llama/LlamaModelTest.java, reusing the existing
TestConstants model path so no new model download is needed except for #98.
Status: all four tests below shipped via PR #185 (commit
cba693c). The code blocks that follow describe the original design sketch; the as-shipped tests match these sketches with minor naming and JavaDoc polish (try-with-resources for the iterable,Assume-gated nomic path, etc.). All four issues (#80, #95, #98, #102) are now FIXED.
Direct port of the reporter's repro:
@Test
public void testOpenCloseLoopDoesNotLeak() {
ModelParameters params = new ModelParameters()
.setModel(TestConstants.MODEL_PATH)
.setThreads(4).setKeep(-1).setCtxSize(1024).setGpuLayers(0);
long baseline = currentVmRss();
for (int i = 0; i < 20; i++) {
try (LlamaModel m = new LlamaModel(params)) {
// no-op: open + close
}
}
System.gc();
long after = currentVmRss();
// Allow some slop for JIT/heap growth, but rule out monotonic leak
Assert.assertTrue("VmRSS grew by " + (after - baseline) + " kB over 20 iters",
after - baseline < 200_000); // 200 MB tolerance
}
private static long currentVmRss() {
try {
for (String line : Files.readAllLines(Path.of("/proc/self/status"))) {
if (line.startsWith("VmRSS:")) {
return Long.parseLong(line.replaceAll("\\D+", ""));
}
}
} catch (IOException e) { /* non-Linux */ }
return 0;
}currentVmRss() is a no-op on macOS/Windows; the test then degenerates to a
"does not throw / does not crash" smoke test, which is still useful. For
CUDA, add an nvidia-smi poll in a sibling @Test gated on
Assume.assumeTrue(hasCuda()).
@Test
public void testOpenCloseWithoutGeneration() {
ModelParameters params = new ModelParameters()
.setModel(TestConstants.MODEL_PATH)
.setCtxSize(512).setGpuLayers(0);
// The original 3.4.1 crash was a one-shot SIGSEGV. Repeat to harden.
for (int i = 0; i < 20; i++) {
try (LlamaModel m = new LlamaModel(params)) {
// No generation between construction and close.
}
}
}A JVM crash exits the JUnit runner with non-zero; if all 20 iterations complete, the verdict moves to FIXED.
@Test
public void testIteratorTerminatesOnRepetitivePrompt() {
InferenceParameters infer = new InferenceParameters("Repeat AAA forever: AAA AAA")
.setNPredict(30)
.setTemperature(0.0f);
int count = 0;
try (LlamaIterable it = model.generate(infer)) {
for (LlamaOutput out : it) {
count++;
Assert.assertTrue("iterator overran nPredict", count <= 31);
}
}
Assert.assertTrue("iterator must produce at least 1 token", count >= 1);
}The original "bug" is a design objection, not a defect. Test confirms iteration terminates deterministically.
@Test
public void testNomicEmbedLoads() {
String path = System.getProperty("net.ladenthin.llama.nomic.path");
Assume.assumeNotNull("nomic model path not set", path);
ModelParameters params = new ModelParameters()
.setModel(path).setBatchSize(8192).setUbatchSize(8192)
.enableEmbedding(); // <-- the fix
try (LlamaModel m = new LlamaModel(params)) {
float[] embedding = m.embed("search_query: What is TSNE?");
Assert.assertEquals(768, embedding.length);
}
}Gated on a system property so CI without the 120 MB model file simply skips it.
Optional CI step: download nomic-embed-text-v1.5.f16.gguf from HuggingFace in
the same pattern as the existing CodeLlama / Jina-Reranker model downloads.
| # | Why not unit-testable | Action |
|---|---|---|
| 103, 34 | (Historic — typed image API and native media dispatch were incomplete at the time of this audit.) | DONE: ContentPart + ChatMessage(role, List<ContentPart>) emit OAI multipart content, and the JNI bridge retains decoded media for upstream mtmd processing. The model-gated integration test covers blocking, typed, streaming, and semantic image handling. |
| 86 | Question about jar packaging behaviour, not code defect. | Documentation: add a README section "Choosing the right classifier" stating that the CUDA jar requires the CUDA runtime libraries at load time and does not auto-fall-back. |
| 121, 50 | Android runtime / cross-host build path — needs an emulator boot or a macOS-M2 cross-compile, not a JVM test. | CI matrix expansion: add an Android emulator job that boots a stock arm64-v8a AVD and runs the existing LlamaModelTest against the dockcross-built libjllama.so. |
- First PR (small, high-value): MERGED as PR #185, commit
cba693c. Adds the four JUnit tests (MemoryManagementTest#testOpenCloseLoopDoesNotLeak,MemoryManagementTest#testOpenCloseWithoutGeneration,LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt,LlamaEmbeddingsTest#testNomicEmbedLoads),TestConstants.PROP_NOMIC_MODEL_PATHNOMIC_EMBED_DIM, theNOMIC_EMBED_MODEL_URL/_NAMEenv vars and matching curl +-Dnet.ladenthin.llama.nomic.path=…wiring in the linux-x86_64 Java test job of.github/workflows/publish.yml, and the local-build documentation inCLAUDE.md("Building the native library for local Java tests"). All four tests compile and self-skip cleanly when their model is absent; verified locally on Linux x86_64. Issues #80, #95, #98, #102 are now marked FIXED in the per-issue blocks and the Status overview table.
- Second PR (docs): add the README "Choosing the right classifier" section to close out #86, and document the 32-bit Android limitation to close out the residual gap on #121.
- Third PR (feature): SHIPPED as PR #189. Adds the typed multimodal
surface (
ContentPart,ChatMessage(role, List<ContentPart>),InferenceParameters.withMessages(List<ChatMessage>)) plus aMultimodalIntegrationTestgated onnet.ladenthin.llama.vision.model,.vision.mmproj, and.vision.imagesystem properties. CI inpublish.ymldownloads a small vision model + mmproj + an author- licensed (MIT-granted) test image from Wikimedia Commons and passes the paths on themvn testcommand line. - Fourth PR (CI): add an Android emulator job to formally close #121 and #50.
Step 1 ships as one commit and unblocks the four FIXED upgrades pending the
first green CI run. Step 2 is mechanical and requires no design decisions.
Step 3 requires choosing an image-input encoding (data: URL vs raw bytes)
and is the natural follow-up. Step 4 is the largest investment but closes the
remaining STILL POSSIBLE Android cluster (#79, #81, #82, #91, #117, #121)
all at once.