Feature Investigation — ideas from pure-Java sibling runtimes and `llamacpp4j`

Comparison sources (all surveyed in one pass for this document):

Repo	Shape	License	Survey notes
mukel/llama3.java	Pure Java, single-file (~3.4k LOC), Vector API + GraalVM Native Image	MIT	Llama 3 / 3.1 / 3.2
mukel/gemma4.java	Pure Java, single-file (~3.9k LOC)	Apache 2.0	Gemma 4 + earlier Gemma 2/3
mukel/gptoss.java	Pure Java, single-file	Apache 2.0	OpenAI GPT-OSS (Harmony chat format)
mukel/qwen35.java	Pure Java, single-file	Apache 2.0	Qwen 3.5 dense + MoE
mukel/nemotron3.java	Pure Java, single-file	Apache 2.0	NVIDIA Nemotron-3 (dense + MoE + recurrent SSM)
sebicom/llamacpp4j	Alternative JNI binding (SWIG-generated facade over `llama.h`)	unspecified	Dormant — 1 commit (2023-07-04), pre-GGUF (llama.cpp build 491), no LICENSE, no tests, no CI

The 5 mukel projects are written by the same author (Alfonso² Peterssen), share a single-file template, and re-implement GGUF parsing + tensor kernels in pure Java. They are NOT direct competitors to java-llama.cpp (which delegates inference to llama.cpp via JNI); they are interesting because they have better operator-facing ergonomics at the CLI and example layers.

llamacpp4j is the only other Java-side JNI binding to llama.cpp; the survey looked specifically for API-shape ideas and capabilities not currently exposed here.

Effort sizing (mirrors feature-investigation-llama-stack-client-kotlin.md):

Size	Calendar effort (1 engineer)	Description
XS	< 0.5 day	Trivial Java-side change, no JNI
S	0.5 – 2 days	Java surface + minor JNI/JSON wiring
M	2 – 5 days	New JNI methods, native plumbing, tests
L	1 – 2 weeks	New native subsystem or large API surface

1. What this project already covers

The following are confirmed present in java-llama.cpp as of this survey — flagged so we do not re-investigate them:

Capability	Status
`setOffline(boolean)` (was `setSkipDownload`) + typed `ModelUnavailableException`	✅ (commit `37754d4`)
Reasoning-format toggle, reasoning-budget tokens	✅ (`InferenceParameters#setReasoningFormat` etc.)
Tool calls + custom chat templates	✅
Speculative draft model	✅
Multimodal vision (mmproj)	✅
Infill (fill-in-the-middle)	✅
Streaming via `LlamaIterator` / Reactive Streams `Publisher`	✅
`CompletableFuture` async + `CancellationToken`	✅
`LoadProgressCallback` model-load progress	✅

2. Cross-cutting themes — universal across the 5 `mukel` projects

These ideas appear in every (or nearly every) mukel runtime; portability across reasoning-model families makes them the highest-leverage items.

2.1 Streaming UTF-8 decoder for multi-byte boundary safety (S, medium-high priority)

Sources: qwen35.java (StreamingDecoder, L2929–2987), nemotron3.java, gemma4.java.

GGUF byte-fallback tokenisation can split a single Unicode codepoint across two consecutive token pieces. LlamaIterator callers today can receive a LlamaOutput.text value containing a partial UTF-8 sequence and either render mojibake (CJK, emoji) or hand-roll their own buffering. The mukel runtimes wrap the token stream in a small decoder that holds back trailing bytes until a complete codepoint is available, then flushes.

Why: silent correctness bug for non-ASCII users; ~50-LOC fix.
Shape: Utf8BoundaryStreamingDecoder helper in the Java layer (no JNI change); optional setUtf8BoundarySafe(true) opt-in on InferenceParameters, or always-on inside LlamaIterator.
Test: use any of the existing CJK / emoji prompts; assert no partial codepoint ever crosses the iterator boundary.

2.2 Tri-state thinking-channel router for reasoning models (S, medium priority)

Sources: gemma4.java, gptoss.java (Harmony channels), qwen35.java, nemotron3.java.

A --think off|on|inline flag with three semantics: off strips reasoning tokens from the visible stream (and from chat history), on (default) routes them to a separate sink (e.g. stderr in CLI examples), inline interleaves them in the main output. Pairs cleanly with this project's existing setReasoningFormat/setReasoningBudgetTokens.

Why: every reasoning model in this project's test matrix (Qwen3-0.6B, plus any GPT-OSS / Gemma / Nemotron load) exposes thought tokens, but operators currently hand-roll the routing.
Shape: helper class ThinkingChannelRouter (or analogous) that consumes a LlamaIterator and produces two streams (visible / reasoning), plus an enum knob on InferenceParameters.
gptoss specifically: needs a Harmony-channel state machine that recognises <|start|>, <|channel|>, <|message|>, <|end|> and exposes analysis / commentary / final channels separately. Worth shipping as a separate HarmonyChannelDecoder if GPT-OSS users materialise. (M for the Harmony variant; S for the generic <think> variant.)

2.3 Interactive chat REPL with slash commands (XS, low-medium priority)

Sources: llama3.java, gemma4.java, gptoss.java, qwen35.java, nemotron3.java.

/quit, /exit, /context (the latter prints used / max / remaining tokens for the current chat session). Users currently Ctrl-C out of ChatExample.

Shape: a ChatRepl example under src/test/java/examples/. No new production API surface — it composes existing LlamaModel calls.
Effort: 1 new file, ~150 LOC.

2.4 ANSI colour auto-detection honouring `NO_COLOR` + `TERM=dumb` (XS, low priority)

Sources: gemma4.java, gptoss.java, qwen35.java, nemotron3.java.

Tri-state --color on|off|auto helper that honours the NO_COLOR informal standard, detects TERM=dumb, and falls back to no-colour when System.console() is null. ~15 LOC; useful in every example CLI that prints reasoning tokens or perf summaries in a different style.

2.5 Operator-grade timing line on stderr (XS, medium priority)

Sources: qwen35.java, nemotron3.java.

After every generation: a one-line prompt: X tok/s (P tokens) | generation: Y tok/s (G tokens) | context: U/M summary to stderr. LlamaModel.getTimings() already has all the inputs; no example formats them.

2.6 `AutoCloseable Timer.log("label")` idiom (XS, low priority)

Sources: gemma4.java (Timer class, L320–333), qwen35.java.

try (var t = Timer.log("Load tensors")) { ... } prints Load tensors: 312 ms to stderr on close. 12-line helper. The project already times model load + JNI init + first-token latency in ad-hoc places; one helper would unify them. Friendly to LogCaptor (already wired in tests).

2.7 `jbang`-runnable single-file example (XS, medium priority)

Sources: all 5 mukel runtimes.

Ship a self-contained Example.java with the ///usr/bin/env jbang shebang and //DEPS net.ladenthin:llama:5.0.0. Lowers the "try it once" barrier from mvn dependency:get + classpath wrangling to one curl-and-run line. Pairs naturally with publishing on Maven Central.

2.8 Documented system-properties table in the README (XS, medium priority)

Sources: all mukel runtimes (each documents its own -D… knobs alongside --flag parameters).

Currently the LlamaSystemProperties setters (net.ladenthin.llama.lib.path, .tmpdir, .osinfo.architecture, .test.ngl, the per-test .vision.* and .nomic.path properties) are scattered across CLAUDE.md, source javadoc, and test setup. A single README table listing every supported property + default + meaning improves discoverability.

3. Per-repo unique ideas

3.1 `llama3.java`

--echo debug mode (XS, low) — dump every token to stderr separately from --stream. Useful for teaching / first-time-user debugging.
-Dllama.VectorBitSize=0|128|256|512 (XS, low) — runtime knob to pin SIMD width / benchmark when multiple ISA variants are co-located. Equivalent for this project: a system property selecting GGML CPU backend variant when multiple are on the library path.

3.2 `gemma4.java`

README note about llama-quantize --pure (XS, low) — mixed-quant GGUF files (e.g. Q4_0 with embedded F16 tensors) cause subtle issues that users discover only by trawling the upstream issue tracker. Surface the workaround in the troubleshooting section.

3.3 `gptoss.java`

Reasoning: low|medium|high system-message injection (S, high if GPT-OSS users present) — add InferenceParameters.setReasoningEffort(LOW|MEDIUM|HIGH) that synthesises the Harmony Reasoning: X line. Encodes a contract operators currently discover only by reading the Jinja template.
See also Harmony channel decoder under §2.2.

3.4 `qwen35.java`

"Empty <think></think> injection" to disable thinking on Qwen models (S, medium) — prefill the assistant header with <think>\n\n</think>\n\n so the model produces only the visible answer with zero reasoning tokens, regardless of whether llama.cpp's reasoning_format understands the family. Complements existing setReasoningFormat / setReasoningBudgetTokens. Should land as a ChatRequest option or a thin Qwen-aware preset.

3.5 `nemotron3.java`

All unique-value findings overlap with §2 themes; no Nemotron-specific item warranted its own row beyond what §2.1 / §2.2 already cover.

3.6 `llamacpp4j`

llamacpp4j is dormant (single commit, July 2023, pre-GGUF era) and its design is largely uninteresting (SWIG-generated facade with opaque SWIGTYPE_p_* pointers leaking through). The useful ideas come from the underlying llama.h API surface that SWIG happens to expose, not from anything Sebicom designed:

llama_state_* save/load API (M, medium) — llama_copy_state_data, llama_set_state_data, llama_save_session_file / llama_load_session_file. Useful for prompt-warm-start, multi-tenant resumption, and benchmarking. ModelParameters doesn't surface KV-cache snapshotting as first-class Java API.
llama_apply_lora_* hot-apply at runtime (M, medium) — adapter hot-swap without reloading the base model (common multi-tenant pattern). Use the modern llama_adapter_lora_* API, not the deprecated file-based one Sebicom exposes.
llama_model_quantize exposure (S, low) — one-line wrapper that converts FP16 → Q4/Q5/Q8 GGUF in-process. Lets Java apps build a "download FP16 → quantize for this device" path without shelling out.
llama_print_system_info() wrapper (XS, low) — trivial diagnostic that prints AVX = 1 | AVX2 = 1 | … etc. Useful for bug reports.

Explicitly skip from llamacpp4j: the SWIG-generated facade itself (brittle, opaque pointer types leak), the mainn(argv) shortcut that forwards to llama.cpp's reference CLI, the single-OS prebuilt .so checked into git, the README-documented "install JAR into local Maven repo" workflow. java-llama.cpp's JSON-over-JNI + classifier-based packaging is strictly better.

4. Explicitly out of scope

Recurring "don't port" themes across all 6 sources:

Pure-Java tensor kernels / GGUF parser / quantization classes — redundant with llama.cpp; the entire raison d'être of this project is to delegate these to the upstream C++.
GraalVM Native Image AOT model preloading — already captured as its own design-investigation TODO in CLAUDE.md; not duplicated here.
Reimplementations of samplers (ToppSampler, CategoricalSampler) — llama.cpp's sampler chain already covers TOP_P, TYP_P, MIN_P, XTC, DRY, etc.
Single-file jbang distribution of the whole library — wrong shape for a JNI library that ships per-OS classifier JARs. (A single-file jbang example per §2.7 is fine; the library itself stays multi-module.)
Hard-coded per-model chat-template token strings (e.g. Gemma's <|turn> / <|think|>) — llama.cpp's chat-template engine handles these generically.

5. Prioritised backlog (top picks across all 6 sources)

Sorted by priority × (1 / effort). Items in bold are the recommended first batch.

#	Item	Source(s)	Effort	Priority
1	UTF-8 boundary-safe streaming decoder	§2.1	S	medium-high
2	Tri-state thinking-channel router (generic `<think>`)	§2.2	S	medium
3	Operator-grade per-run timing line on stderr	§2.5	XS	medium
4	`jbang`-runnable single-file example	§2.7	XS	medium
5	System-properties table in README	§2.8	XS	medium
6	Empty `<think></think>` injection (Qwen)	§3.4	S	medium
7	`llama_state_*` save/load Java API	§3.6	M	medium
8	`llama_adapter_lora_*` hot-apply API	§3.6	M	medium
9	Chat REPL with `/quit /exit /context`	§2.3	XS	low-medium
10	Harmony channel decoder for GPT-OSS	§2.2	M	conditional (ship when GPT-OSS users ask)
11	`Reasoning: X` system-message injection	§3.3	S	conditional
12	ANSI colour auto-detection helper	§2.4	XS	low
13	`AutoCloseable Timer.log()` idiom	§2.6	XS	low
14	`llama_print_system_info()` wrapper	§3.6	XS	low
15	`llama_model_quantize` Java surface	§3.6	S	low
16	README note on `llama-quantize --pure`	§3.2	XS	low
17	`--echo` debug knob in example	§3.1	XS	low
18	`-Dllama.VectorBitSize`-style ISA knob	§3.1	XS	low

Items 1–5 are the recommended first batch — none requires JNI changes and each closes a documented operator pain point.

6. Recommended next action

Implement items 1, 3, 4, 5 in one focused "operator-facing ergonomics" commit:

UTF-8 boundary-safe streaming decoder (genuine correctness fix)
Per-run timing line (cheap operator signal)
One jbang-runnable example file
README system-properties table

Estimated total: ~1–2 days of work, zero JNI changes, all backed by Java-only tests. Items 2 and 6–8 are good follow-ups once a real user asks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Investigation — ideas from pure-Java sibling runtimes and `llamacpp4j`

1. What this project already covers

2. Cross-cutting themes — universal across the 5 `mukel` projects

2.1 Streaming UTF-8 decoder for multi-byte boundary safety (S, medium-high priority)

2.2 Tri-state thinking-channel router for reasoning models (S, medium priority)

2.3 Interactive chat REPL with slash commands (XS, low-medium priority)

2.4 ANSI colour auto-detection honouring `NO_COLOR` + `TERM=dumb` (XS, low priority)

2.5 Operator-grade timing line on stderr (XS, medium priority)

2.6 `AutoCloseable Timer.log("label")` idiom (XS, low priority)

2.7 `jbang`-runnable single-file example (XS, medium priority)

2.8 Documented system-properties table in the README (XS, medium priority)

3. Per-repo unique ideas

3.1 `llama3.java`

3.2 `gemma4.java`

3.3 `gptoss.java`

3.4 `qwen35.java`

3.5 `nemotron3.java`

3.6 `llamacpp4j`

4. Explicitly out of scope

5. Prioritised backlog (top picks across all 6 sources)

6. Recommended next action

FilesExpand file tree

feature-investigation-similar-projects.md

Latest commit

History

feature-investigation-similar-projects.md

File metadata and controls

Feature Investigation — ideas from pure-Java sibling runtimes and llamacpp4j

1. What this project already covers

2. Cross-cutting themes — universal across the 5 mukel projects

2.1 Streaming UTF-8 decoder for multi-byte boundary safety (S, medium-high priority)

2.2 Tri-state thinking-channel router for reasoning models (S, medium priority)

2.3 Interactive chat REPL with slash commands (XS, low-medium priority)

2.4 ANSI colour auto-detection honouring NO_COLOR + TERM=dumb (XS, low priority)

2.5 Operator-grade timing line on stderr (XS, medium priority)

2.6 AutoCloseable Timer.log("label") idiom (XS, low priority)

2.7 jbang-runnable single-file example (XS, medium priority)

2.8 Documented system-properties table in the README (XS, medium priority)

3. Per-repo unique ideas

3.1 llama3.java

3.2 gemma4.java

3.3 gptoss.java

3.4 qwen35.java

3.5 nemotron3.java

3.6 llamacpp4j

4. Explicitly out of scope

5. Prioritised backlog (top picks across all 6 sources)

6. Recommended next action

Feature Investigation — ideas from pure-Java sibling runtimes and `llamacpp4j`

2. Cross-cutting themes — universal across the 5 `mukel` projects

2.4 ANSI colour auto-detection honouring `NO_COLOR` + `TERM=dumb` (XS, low priority)

2.6 `AutoCloseable Timer.log("label")` idiom (XS, low priority)

2.7 `jbang`-runnable single-file example (XS, medium priority)

3.1 `llama3.java`

3.2 `gemma4.java`

3.3 `gptoss.java`

3.4 `qwen35.java`

3.5 `nemotron3.java`

3.6 `llamacpp4j`