Comparison source: ogx-ai/llama-stack-client-kotlin (version 0.2.14, a Stainless-generated Kotlin SDK for the Llama Stack REST API with an optional ExecuTorch-backed local-inference path).
This document inventories candidate features for java-llama.cpp derived from
that comparison, with rough effort estimates. Effort is given in
T-shirt sizes:
| Size | Calendar effort (1 engineer) | Description |
|---|---|---|
| XS | < 0.5 day | Trivial Java-side change, no JNI |
| S | 0.5 – 2 days | Java surface + minor JNI/JSON wiring |
| M | 2 – 5 days | New JNI methods, native plumbing, tests |
| L | 1 – 2 weeks | New native subsystem or large API surface |
| XL | > 2 weeks | Architectural addition |
| Capability | Status |
|---|---|
Chat / completion (blocking + streaming via LlamaIterable) |
✅ |
Embeddings (embed(String)) |
✅ |
Rerank (rerank(query, docs)) |
✅ |
| Grammar + JSON-schema constrained output | ✅ |
| Rich sampling (DRY, mirostat, dyn-temp, XTC, top-n-σ) | ✅ |
| Reasoning budget / reasoning format | ✅ |
| Slot save / restore / erase | ✅ |
continueFinalMessage |
✅ |
| Tokenize / decode / template apply | ✅ |
Metrics string (getMetrics()) + typed ServerMetrics |
✅ |
| Speculative draft model wiring | ✅ |
Typed ChatRequest / ChatResponse + tool calling |
✅ (§2.2) |
CompletableFuture async wrappers |
✅ (§2.3) |
Reactive Streams Publisher<LlamaOutput> token stream |
✅ (§2.3) |
completeBatch / chatBatch parallel dispatch |
✅ (§2.4) |
Typed Usage / Timings / CompletionResult |
✅ (§2.5) |
Session helper (thread-safe per instance) |
✅ (§2.6) |
AutoCloseable iterator + cancel polish |
✅ (§2.7) |
Per-request setJsonSchema + completeAsJson<T> |
✅ (§2.8) |
Typed TokenLogprob in LlamaOutput |
✅ (§2.9) |
CancellationToken (cooperative) |
✅ (§2.10) |
LoadProgressCallback model-load progress |
✅ (#113) |
These do not need work — they already match or exceed the Kotlin client.
Each §2.x subsection below carries a Status: line at the top:
| Marker | Meaning |
|---|---|
SHIPPED |
Fully landed; commit refs follow. |
PARTIAL |
Core landed; a documented follow-up remains (called out inline). |
OPEN |
Not started. |
All references point to PR #188 on the claude/upbeat-hypatia-wPdK5
branch unless noted.
Status: SHIPPED (typed Java surface). The original L-effort scope assumed
new JNI plumbing was required, but on inspection the upstream OAI chat path
(oaicompat_chat_params_parse in server-common.cpp) already detects
{"type":"image_url","image_url":{"url":"data:..."}} blocks and routes them
through the compiled-in mtmd pipeline, and the project's
handleChatCompletions JNI method forwards the request JSON intact. Only the
Java-side convenience to emit the multipart-array content was missing.
This pass adds:
ContentPartvalue type (TEXT/IMAGE_URL) with static factoriestext(...),imageUrl(...),imageBytes(byte[], mime), andimageFile(Path)(auto-detects png/jpeg/webp/gif from the extension and base64-encodes into adata:URI).ChatMessage(String role, List<ContentPart> parts)constructor plususerMultimodal(ContentPart...)factory,getParts(), andhasParts(). The legacyChatMessage(role, content)ctor and existing serializer path are unchanged.InferenceParameters.setMessages(List<ChatMessage>)overload that routes through a newParameterJsonSerializer.buildMessages(List<ChatMessage>)emitting array-formcontentonly when a message has parts.- 25 unit tests in
ContentPartTestandMultimodalMessagesTestcover the factory contracts, the parts/legacy split, and the OAI multipart JSON shape; the 123 existingChatMessage/InferenceParameters/ParameterJsonSerializertests still pass.
A multimodal call from Java now looks like:
LlamaModel model = new LlamaModel(new ModelParameters()
.setModel("vision-model.gguf")
.setMmproj("vision-projector.gguf"));
String reply = model.chatCompleteText(new InferenceParameters("")
.setMessages(java.util.Collections.singletonList(
ChatMessage.userMultimodal(
ContentPart.text("What is in this image?"),
ContentPart.imageFile(java.nio.file.Paths.get("photo.jpg"))))));Zero new JNI symbols; zero risk to existing text-only chat callers.
Gap. Upstream llama.cpp ships mtmd (vision + audio for some models) and
the compiled-in server already pulls it in via mtmd.h / mtmd-helper.h. No
Java method currently accepts image input. Kotlin examples show base64 image
chat against vision models.
Proposal.
InferenceParameters.addImage(byte[] png)/addImage(Path)/addImageBase64(String).ModelParameters.setMmproj(Path)to load the mmproj projector file.- JNI: feed images into the server task params (
mtmd_*API).
Effort: L — non-trivial JNI plumbing, lifecycle of mtmd_context,
test fixtures for vision models, but most of the heavy lifting is already
upstream.
Value. Biggest user-visible capability missing today. Unlocks Qwen-VL, Gemma 3, MiniCPM-V, LLaVA, etc.
Status: SHIPPED (PR #188, commit f2c7ed1). New value types
ChatRequest, ChatResponse, ChatChoice, ChatMessage (extended),
ToolCall, ToolDefinition, plus ToolHandler functional interface.
LlamaModel.chat(ChatRequest) returns a typed ChatResponse;
chatWithTools(ChatRequest, Map<String, ToolHandler>) runs the agent
auto-loop (capturing handler exceptions as {"error":...} tool results
so the loop continues; cap via ChatRequest.maxToolRounds, default 8).
The tier-1 (typed response only) and tier-2 (manual tool round-trip via
ChatMessage.toolResult) APIs are equally usable.
Gap. Today: setMessages(String system, List<Pair<String,String>>) and
chatComplete → String. The server parses tool calls
(common_chat_* infrastructure) but Java callers must scrape JSON
themselves. Kotlin exposes typed ChatCompletionRequest / ChatResponse
with toolCalls, finishReason, usage.
Proposal.
ChatMessagerecord:role(enum: system/user/assistant/tool),content(text + optional image parts),toolCalls,toolCallId,name.ToolDefinitionrecord (name, description, JSON-schema parameters) andInferenceParameters.setTools(List<ToolDefinition>).ChatResponserecord:content,finishReason(enum),toolCalls,usage(see §2.5),timings, optionallogprobs.- New
LlamaModel.chat(ChatRequest)/chatStream(ChatRequest).
Effort: M — Java-side data model + JSON marshalling + a wrapper around the existing native chat path. No new native code needed.
Value. Brings the library in line with every modern Java LLM SDK (LangChain4j, Spring AI). Removes the "parse the JSON string yourself" papercut.
Status: SHIPPED + REVERTED REACTIVE PUBLISHER. CompletableFuture wrappers
(completeAsync, chatCompleteAsync, chatCompleteTextAsync, plus a
completeAsync(params, CancellationToken) bridge that propagates
future.cancel(true) into the cooperative token) in commit 1e673a9 —
still shipped.
The reactive Publisher<LlamaOutput> follow-up was shipped in commit afa4f65
as LlamaModel.streamPublisher(...) / streamChatPublisher(...) backed by
LlamaPublisher. It has since been removed — see the README section
"Reactive integration" for the rationale and the canonical replacement
patterns. In short: LlamaIterable is already
Iterable<LlamaOutput> & AutoCloseable, and every mainstream reactive
library (Project Reactor, RxJava 3, Kotlin coroutines Flow, Akka Streams)
wraps it in a few lines via its own resource-management primitive
(Flux.using, Flowable.using, Kotlin use {}, etc.). Keeping a hand-rolled
Publisher in the binding imposed a mandatory org.reactivestreams runtime
dep on every consumer for a class that had zero non-test callers —
including the canonical Android sample LLaMAndroid,
which uses LlamaIterable directly inside a Kotlin flow { } block. Pattern
correctness is now pinned end-to-end by ReactorIntegrationTest
(test-scope reactor-core); zero runtime deps added.
Gap. All LlamaModel methods are blocking. Kotlin offers
suspend fun + Flow variants. JVM users currently dedicate platform
threads per inference.
Proposal.
CompletableFuture<String> completeAsync(InferenceParameters)CompletableFuture<ChatResponse> chatAsync(ChatRequest)- Reactive Streams
Publisher<String> generateReactive(...)(orFlow.Publisherfromjava.util.concurrent.Flowto avoid a new dependency). - Backed by the existing native worker thread inside
jllama_context; no extra Java thread pool.
Effort: S for CompletableFuture wrappers, M for a proper
backpressure-aware Flow.Publisher streaming token producer.
Value. Enables composition with virtual threads, project Reactor, RxJava, Kotlin coroutines from Java consumers.
Status: SHIPPED (PR #188, commit de457b2).
LlamaModel.completeBatch(List<InferenceParameters>),
completeBatchWithStats(...), and chatBatch(List<ChatRequest>) dispatch
all requests at once via the existing async wrappers; results returned in
input order. Throughput scales with ModelParameters.setParallel(N)
(default N=1 runs sequentially across the single slot).
Gap. llama.cpp natively serves parallel slots; the compiled-in server
handles concurrent tasks. LlamaModel exposes no batch entry point.
Proposal.
List<String> completeBatch(List<InferenceParameters>).List<ChatResponse> chatBatch(List<ChatRequest>).Stream<String> generateBatch(...)returning results in submission order.- Configure parallelism via existing
ModelParameters(e.g. n_parallel).
Effort: M — server already supports it; needs a Java entry that dispatches N tasks, awaits all, and maps results.
Value. Throughput multiplier for embedding / classification / rerank pipelines; close to a free win.
Status: SHIPPED (PR #188, commits fe1cf3b + c529499). Usage,
Timings, and ServerMetrics value classes + LlamaModel.getMetricsTyped()
parse server-wide metrics. Per-completion Usage/Timings land in
ChatResponse (§2.2) and in the new CompletionResult returned by
LlamaModel.completeWithStats(InferenceParameters).
Gap. getMetrics() returns a raw JSON String. Kotlin exposes
Usage(promptTokens, completionTokens, totalTokens) plus a richer
Timings (tokensPerSecond, promptMs, predictedMs, cacheHit,
drafted, accepted).
Proposal.
UsageandTimingsrecords.- Returned as part of
ChatResponse(§2.2) and from a newLlamaModel.getMetricsTyped()accessor.
Effort: XS if scoped to Timings/Usage parsing only,
S when wired into the new ChatResponse.
Value. Removes ad-hoc JSON parsing. Cheap.
Status: SHIPPED. Initial AutoCloseable wrapper with send(...),
stream(...), commitStreamedReply(...), save(Path) / restore(Path),
and an optional InferenceParameters customizer landed in PR #188
(commit e4f531c). Per-session locking landed as the M-effort
follow-up: every public Session method is now serialized on a private
intrinsic lock, and stream(...) sets a "streaming in progress" guard
that causes send(...), a second stream(...), save(...), and
restore(...) to fail-fast with IllegalStateException until
commitStreamedReply(...) clears it. Covered by
SessionConcurrencyTest.
Gap. Slots exist as a low-level primitive. Kotlin offers "agents/sessions/turns" with persistence and resume.
Proposal.
Session(AutoCloseable) owning a slot id, an accumulatingList<ChatMessage>, andsave(Path)/restore(Path)that delegate tosaveSlot/restoreSlot.session.send(userMessage)→ChatResponse.session.stream(userMessage)→LlamaIterable.
Effort: S if it stays a thin wrapper over existing slot APIs, M if we add proper concurrency / per-session locking.
Value. Common conversational use case becomes a one-liner.
Status: SHIPPED (PR #188, commit d1c9fb0). LlamaIterator already
implemented AutoCloseable with cancel()/close(); this commit
audited the path, documented the cancel-vs-stop nuance and idempotency
in the javadoc, added a try-with-resources example on
LlamaModel.generate(...), and added testIteratorCloseIdempotent.
Gap. LlamaIterable / LlamaIterator cannot be cancelled mid-stream;
the underlying slot task keeps running until natural stop. Kotlin marks
streaming returns @MustBeClosed.
Proposal.
- Make
LlamaIteratorimplementAutoCloseablewithclose()cancelling the server task (slot stop). - Document try-with-resources usage.
Effort: S — small JNI addition (cancel_task(task_id)), straightforward
Java side.
Value. Prevents wasted compute when the consumer gives up early.
Status: SHIPPED (PR #188, commit 80e5c13).
InferenceParameters.setJsonSchema(String) mirrors the existing
setGrammar. LlamaModel.completeAsJson(Class<T>, String schema, InferenceParameters)
sets the schema and Jackson-deserializes the result. The
single-argument overload completeAsJson(Class<T>, InferenceParameters)
trusts that the caller already set schema/grammar.
Gap. setJsonSchema / setGrammar already exist on ModelParameters
but not on InferenceParameters. No typed-result helper.
Proposal.
- Add
setJsonSchema(String)/setGrammar(String)toInferenceParameters(per-request schema). <T> T completeAsJson(Class<T>, InferenceParameters)— derive the JSON schema from the class via Jackson (JacksonModulealready on classpath?) or document a manual-schema variant.
Effort: S.
Value. Matches the "structured outputs" UX of OpenAI / Llama Stack SDKs.
Status: SHIPPED (PR #188, commit a8077b6). TokenLogprob value
type carries token, tokenId, logprob, and the nested
topLogprobs alternatives. LlamaOutput.logprobs is populated by
CompletionResponseParser.parseLogprobs (post-sampling prob or
pre-sampling logprob mode auto-detected). Also surfaces in
CompletionResult.getLogprobs() (§2.5).
Gap. setNProbs exists; the result type is a plain String, so
per-token probabilities are not surfaced.
Proposal.
- Add
List<TokenLogprob> logprobstoChatResponse(§2.2) and to a newLlamaOutputfield for non-chat completion. TokenLogprob { String token; int tokenId; float logprob; List<TokenLogprob> topLogprobs; }.
Effort: S.
Value. Required for evaluation / uncertainty estimation use cases.
Status: SHIPPED — cooperative only (PR #188, commits
ad66e3a + e3b9043). CancellationToken.cancel() sets a
volatile flag observed between tokens by the inference loop in
LlamaModel.complete(InferenceParameters, CancellationToken).
Effective latency is one token interval (the loop checks at each
token boundary, typically tens to a few hundred ms).
Sub-token follow-up: attempted and reverted. Two design passes
tried to make cancel() post a SERVER_TASK_TYPE_CANCEL message to
the upstream server_queue from the cancelling thread so the worker
would interrupt mid-generation:
- First attempt eagerly erased the reader's
unique_ptrfromjctx->readerson the cancelling thread. That raced against the inference thread's rawrdpointer borrowed insidereceiveCompletionJson's call tord->next(), producing a use-after-free and astd::system_errorJVM abort in CI. - Second attempt left the reader alive and posted CANCEL through
the reader's public
queue_tasks.post(...)reference, expecting the slot's release to emit a natural stop result onqueue_results. Inspection ofbuild/_deps/llama.cpp-src/tools/server/server-context.cpp:356-376showedserver_slot::release()only setsstate = SLOT_STATE_IDLEand callsreset(); it does not post any result toqueue_results. The inference thread therefore blocked forever insiderecv_with_timeout(id_tasks, 1 s), polling at 1 Hz with a hard-codedshould_stop = []{ return false; }. CI hung onLlamaModelTestand had to be cancelled manually.
Both attempts were reverted on the same branch (705f980 and the
follow-up revert commit). The cooperative path already achieves
one-token-interval latency, which matches the sub-token goal in
practice. A genuine sub-token cancel would require an upstream
change to SERVER_TASK_TYPE_CANCEL's slot handler so it emits a
stop result on queue_results, or a should_stop predicate plumbed
into receiveCompletionJson that observes the token (latency ≥
1 s polling interval — worse than cooperative).
Gap. A blocking complete(...) cannot be aborted from another thread.
Proposal.
complete(params, CancellationToken token)overload.- Token wraps a native task id;
token.cancel()issues a stop.
Effort: S — overlaps with §2.7's cancel_task primitive.
Value. Enables UI timeouts and request-scoped cancellation in web frameworks.
| Kotlin feature | Why we should not port it |
|---|---|
| Agents framework / Turns | Belongs in a higher-level "stack" layer (LangChain4j, Spring AI). Out of scope for a thin JNI binding. |
| Shields / Safety service | Same — separate library concern. |
| Eval / Benchmarks / Scoring | Separate test/eval tooling, not inference. |
| SyntheticDataGeneration | Out of scope. |
| Telemetry spans (OTEL) | Users can add their own around LlamaModel calls. |
| VectorDB / VectorStore / RAG | Would duplicate ObjectBox / Lucene / pgvector / Chroma integrations. Higher-level libs do this better. |
| Files / Datasets | REST-only concept; no on-device analogue. |
| HTTP client conveniences (retries, proxy, timeouts) | N/A for in-process JNI. |
Historical note (kept for context). All ten items below are now SHIPPED — the original §2.1 / §2.2 / §2.5 / §2.7 / §2.3 / §2.4 / §2.8 / §2.9 / §2.6 / §2.10 sequence was delivered across PR #188 and PR #189. §2.10 ships cooperative only; the M-effort sub-token follow-up was attempted twice and reverted (see the postmortem in the §2.10 entry above for why). §2.1 ended up much smaller than the original L estimate once the upstream OAI chat path was found to already accept
image_urlblocks. This section is preserved so the original effort estimates can be compared against what actually shipped; it is no longer a roadmap.
- §2.1 Multimodal (L) — biggest capability gap, isolated subsystem.
- §2.2 Typed Chat model + tool calling (M) — foundational; other features (usage, logprobs, async) all return / accept these types.
- §2.5 Usage / Timings (XS) — quick, lands inside §2.2.
- §2.7 + §2.10 Cancellation primitive (S) — small JNI add, unblocks §2.3.
- §2.3 Async API (S–M) —
CompletableFuturefirst, reactive later. - §2.4 Batch inference (M).
- §2.6 Session helper (S).
- §2.8 Structured-output helpers (S).
- §2.9 Logprobs in result (S).
Total realistic effort for the full list: ~5–7 engineer-weeks, with multimodal alone accounting for roughly a third of that.
- Multimodal scope — vision only first, or also audio (Qwen-Audio, Whisper-style mmproj)?
- Reactive dep — use
java.util.concurrent.Flow(JDK 9+ but project targets bytecode 1.8) or add Reactive Streams as an optional Maven classifier? - Tool calling — expose the raw
common_chat_*parser output, or normalize to an OpenAI-compatible shape? - Session persistence format — reuse llama.cpp's slot binary file, or also store the message history as JSON next to it?