Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -818,7 +818,12 @@ jobs:
distribution: 'temurin'
java-version: ${{ env.JAVA_VERSION }}
- name: Build JARs
run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android -Dmaven.test.skip=true -Dgpg.skip=true package
# `assembly` additionally produces the fat jar-with-dependencies uber JAR
# (llama-<version>-jar-with-dependencies.jar: library classes + Java runtime deps +
# default-platform native libs in one drop-on-classpath JAR, runnable via its
# LlamaServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
# artifact below - a CI run artifact only, not a Maven Central / GitHub-Release asset.
run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android,assembly -Dmaven.test.skip=true -Dgpg.skip=true package
- name: Upload JARs
uses: actions/upload-artifact@v7
with:
Expand Down
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@ For the full record of upstream API breaks across version ranges (b5022 &#x2192;
mvn compile # Compiles Java and generates JNI headers
mvn test # Run all tests (requires native library and model files)
mvn package # Build JAR
mvn -P assembly package # Also build the fat jar-with-dependencies uber JAR (library + Java deps + native libs); CI builds it and uploads it in the `llama-jars` artifact
mvn test -Dtest=LlamaModelTest#testGenerate # Run a single test method
```

Expand Down Expand Up @@ -452,6 +453,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
- `OSInfo` — Detects OS and architecture for library resolution.
- `server.LlamaServer` — Optional OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".

**Native layer** (`src/main/cpp/`):
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.
Expand Down
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
- **Infilling** (fill-in-the-middle) for code models.
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
- **Runnable OpenAI-compatible HTTP server** (`LlamaServer`, the fat-jar `Main-Class`): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.

Expand Down Expand Up @@ -396,6 +397,37 @@ a JSON response, matching the HTTP server's contract:
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
`restoreSlot(int, String)`, and `getModelMeta()`.

### OpenAI-compatible HTTP server

The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
request body to the matching `LlamaModel.handle*` method:

```bash
java -jar target/llama-<version>-jar-with-dependencies.jar \
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
```

| Method &amp; path | Backed by |
|---|---|
| `POST /v1/chat/completions` | `LlamaModel.handleChatCompletions` |
| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
| `GET /v1/models` | the configured model alias |
| `GET /health` | static `{"status":"ok"}` |

```bash
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Hello!"}]}'
```

Run with `--help` for all options (`--ctx-size`, `--threads`, `--model-alias`, …). Responses are
non-streaming (the full JSON result is returned per request). The NanoHTTPD dependency is declared
`<optional>`, so it is bundled in the fat jar but **not** inherited by projects that use this
library as a Maven dependency; running the server requires the fat jar (or adding NanoHTTPD yourself).

### Model/Inference Configuration

There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder
Expand Down
34 changes: 34 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,40 @@ These are JNI plumbing items for upstream API additions. Policy: add only after

**Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.

### OpenAI-compatible server: token streaming (SSE) + Java-8 HTTP layer

The `net.ladenthin.llama.server.LlamaServer` MVP is **non-streaming**: every request calls
the blocking `LlamaModel.handle*` method and returns the full JSON response in one shot. A
client that sends `"stream": true` still receives a single response, not the incremental
`text/event-stream` (SSE) `data: {chunk}\n\n` events the OpenAI API emits for streaming
chat/completions. This is the main functional gap of the server today.

The token source already exists — `LlamaModel.generateChat(InferenceParameters)` /
`generate(...)` yield tokens incrementally through a Java `Iterator` (`LlamaIterable`). What
is missing is an HTTP layer that emits SSE.

**Find a Java-8-compatible HTTP layer with good SSE support (alternative to Javalin), or
implement SSE on NanoHTTPD.** Javalin has a first-class `ctx.sse(...)` API but is **not
usable here**: Javalin 5 requires Java 11 and Javalin 6 requires Java 17, while this repo
targets Java 8; Javalin 4 (the last Java-8 release) is EOL. Options, in rough order of
preference:
- **Implement SSE on the existing NanoHTTPD** via `NanoHTTPD.newChunkedResponse(status,
"text/event-stream", InputStream)`, bridging a `LlamaIterable` to an `InputStream` that
writes `data: {chunk}\n\n` frames. No new dependency, stays Java-8 clean; likely the right
answer. Cost: the iterator→SSE bridge plus closing the `LlamaIterable` on client
disconnect.
- **Undertow** — Java-8 compatible, has a server-sent-events handler, but a heavier
dependency tree.
- **Spark Java** (Jetty 9) — Java-8 compatible; SSE support is limited/manual.
- Avoid: Javalin 5/6 (Java 11/17), Javalin 4 (EOL), and the JDK `com.sun.net.httpserver`
(ArchUnit-banned `com.sun..`).

Scope when implemented: honour `"stream": true` on `POST /v1/chat/completions` and
`POST /v1/completions`, emit OpenAI-style SSE chunks terminated by `data: [DONE]`, close the
underlying `LlamaIterable` on disconnect, and keep the non-streaming path as the default. Add
a model-free routing test plus a real-socket SSE integration test (mirroring
`OaiHttpServerIntegrationTest`).

## Open — cross-cutting (slice for this repo)

- **jqwik pin policy** — see [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jqwik-prompt-injection.md). `jqwik.version ≤ 1.9.3` is mandatory.
Expand Down
59 changes: 59 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ SPDX-License-Identifier: MIT
<checker.version>4.2.0</checker.version>
<jackson.version>2.22.0</jackson.version>
<reactor.version>3.8.6</reactor.version>
<nanohttpd.version>2.3.1</nanohttpd.version>
<slf4j.version>2.0.18</slf4j.version>
<logback.version>1.5.34</logback.version>
<animal-sniffer.version>1.27</animal-sniffer.version>
Expand Down Expand Up @@ -148,6 +149,20 @@ SPDX-License-Identifier: MIT
<artifactId>jackson-databind</artifactId>
<version>${jackson.version}</version>
</dependency>
<!--
Embedded HTTP server for the optional OpenAI-compatible server entry point
(net.ladenthin.llama.server.LlamaServer, the fat-jar Main-Class). Declared
<optional> so library consumers do NOT inherit it on their classpath; the
assembly (jar-with-dependencies) profile still bundles it so the fat jar can
run the server. Pure Java, zero transitive deps (Java-8 clean), so it does
not perturb the enforcer dependencyConvergence rule.
-->
<dependency>
<groupId>org.nanohttpd</groupId>
<artifactId>nanohttpd</artifactId>
<version>${nanohttpd.version}</version>
<optional>true</optional>
</dependency>
<!-- Required by OSInfo (vendored from xerial/sqlite-jdbc) for log emission. -->
<dependency>
<groupId>org.slf4j</groupId>
Expand Down Expand Up @@ -259,6 +274,11 @@ SPDX-License-Identifier: MIT
<artifactId>git-commit-id-maven-plugin</artifactId>
<version>10.0.0</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
Expand Down Expand Up @@ -968,5 +988,44 @@ SPDX-License-Identifier: MIT
</plugins>
</build>
</profile>
<profile>
<!--
Builds the fat jar-with-dependencies uber JAR: the library classes, the
default-platform native libs from src/main/resources, and all runtime Java
dependencies in one drop-on-classpath JAR, runnable via the LlamaServer
Main-Class (set below) to start the OpenAI-compatible HTTP server. Off by
default; the CI `package` job activates it so the uber JAR rides along in the
`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
as `mvn -P assembly package`.
-->
<id>assembly</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>net.ladenthin.llama.server.LlamaServer</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>build-fat-jar</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>
</project>
47 changes: 47 additions & 0 deletions spotbugs-exclude.xml
Original file line number Diff line number Diff line change
Expand Up @@ -360,4 +360,51 @@ SPDX-License-Identifier: MIT
<Method name="requireNonNull"/>
</Match>

<!--
The OpenAI-compatible server (net.ladenthin.llama.server.*) is a CLI entry point:
the model path, host, port and alias all come from command-line arguments supplied
by whoever launches the process. findsecbugs flags Paths.get on the model path
(PATH_TRAVERSAL_IN) and the startup log lines that echo these values
(CRLF_INJECTION_LOGS) because they are non-literal, but the threat model is identical
to the LlamaLoader PATH_TRAVERSAL suppression above: an attacker who can set the
server's command line has already won, and there is no untrusted end-user input
reaching these paths or log statements. There is also no meaningful "allowed root"
to canonicalise the operator-chosen model path against.
-->
<Match>
<Class name="~net\.ladenthin\.llama\.server\..*"/>
<Or>
<Bug pattern="PATH_TRAVERSAL_IN"/>
<Bug pattern="CRLF_INJECTION_LOGS"/>
</Or>
</Match>

<!--
LlamaServerArgs.parse is a flat command-line flag dispatcher: a single switch over
the known flags, one case per option, read top to bottom. javac desugars a String
switch into a hashCode lookup plus an equals chain (two branches per case), which
fb-contrib's bytecode-level CC_CYCLOMATIC_COMPLEXITY counts as a very high score.
The source complexity is low and table-flat; extracting the cases into a dispatch
map would not make it clearer, so we accept the detector artifact here.
-->
<Match>
<Class name="net.ladenthin.llama.server.LlamaServerArgs"/>
<Bug pattern="CC_CYCLOMATIC_COMPLEXITY"/>
<Method name="parse"/>
</Match>

<!--
LlamaModelOaiBackend is a thin non-owning wrapper around a LlamaModel (the same
deliberate dependency-injection contract as Session above): the server owns the one
LlamaModel and its native context, and the backend holds the passed-in reference to
serve requests. The model must NOT be defensively copied, so storing the reference is
by design; spotbugs flags it as EI_EXPOSE_REP2 because the constructor stores an
externally-mutable object, which is true but intended.
-->
<Match>
<Class name="net.ladenthin.llama.server.LlamaModelOaiBackend"/>
<Bug pattern="EI_EXPOSE_REP2"/>
<Method name="&lt;init&gt;"/>
</Match>

</FindBugsFilter>
15 changes: 14 additions & 1 deletion src/main/java/net/ladenthin/llama/LlamaModel.java
Original file line number Diff line number Diff line change
Expand Up @@ -835,7 +835,20 @@ public String restoreSlot(int slotId, String filepath) {

native String handleSlotAction(int action, int slotId, @Nullable String filename);

native String handleChatCompletions(String params);
/**
* Run an OpenAI-compatible chat completion (mirrors the {@code /v1/chat/completions}
* endpoint). The request JSON must contain a {@code "messages"} array in the standard
* OpenAI chat format; the model's chat template is applied automatically. Returns the
* result in OAI format with a {@code "choices"} array. This is the raw JSON-in/JSON-out
* form used by {@link #chatComplete(net.ladenthin.llama.parameters.InferenceParameters)}
* and by the embedded OpenAI-compatible server
* ({@link net.ladenthin.llama.server.LlamaServer}); it is the chat counterpart of
* {@link #handleCompletionsOai(String)} and {@link #handleEmbeddings(String, boolean)}.
*
* @param params JSON string with OAI-compatible chat-completion parameters (incl. {@code "messages"})
* @return JSON response in OAI chat-completion format
*/
public native String handleChatCompletions(String params);

native int requestChatCompletion(String params);
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
//
// SPDX-License-Identifier: MIT

package net.ladenthin.llama.server;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ArrayNode;
import com.fasterxml.jackson.databind.node.ObjectNode;
import lombok.ToString;
import net.ladenthin.llama.LlamaModel;

/**
* {@link OaiBackend} backed by a loaded {@link LlamaModel}. Each operation forwards the raw request
* JSON to the matching {@code LlamaModel.handle*} method, which already produces
* OpenAI-compatible response JSON, so no per-field marshalling happens here.
*
* <p>The model is owned by the caller ({@link LlamaServer}); this class neither closes it nor holds
* any other resource.</p>
*/
@ToString
public final class LlamaModelOaiBackend implements OaiBackend {

private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();

private final LlamaModel model;
private final String modelId;

/**
* Create a backend over a loaded model.
*
* @param model the loaded model to serve requests with
* @param modelId the identifier reported by {@link #listModels()} and echoed in responses
*/
public LlamaModelOaiBackend(LlamaModel model, String modelId) {
this.model = model;
this.modelId = modelId;
}

@Override
public String chatCompletions(String requestJson) {
return model.handleChatCompletions(requestJson);
}

@Override
public String completions(String requestJson) {
return model.handleCompletionsOai(requestJson);
}

@Override
public String embeddings(String requestJson) {
return model.handleEmbeddings(requestJson, true);
}

@Override
public String listModels() {
final ObjectNode root = OBJECT_MAPPER.createObjectNode();
root.put("object", "list");
final ArrayNode data = root.putArray("data");
final ObjectNode entry = data.addObject();
entry.put("id", modelId);
entry.put("object", "model");
entry.put("owned_by", "llamacpp");
// ObjectNode.toString() emits valid JSON without a checked exception.
return root.toString();
}
}
Loading
Loading