Skip to content

Commit 59abd79

Browse files
Merge pull request #242 from bernardladenthin/claude/cool-curie-ym3acr
Add OpenAI-compatible HTTP server (LlamaServer)
2 parents b502a1c + e595af6 commit 59abd79

20 files changed

Lines changed: 1347 additions & 7 deletions

.github/workflows/publish.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -818,7 +818,12 @@ jobs:
818818
distribution: 'temurin'
819819
java-version: ${{ env.JAVA_VERSION }}
820820
- name: Build JARs
821-
run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android -Dmaven.test.skip=true -Dgpg.skip=true package
821+
# `assembly` additionally produces the fat jar-with-dependencies uber JAR
822+
# (llama-<version>-jar-with-dependencies.jar: library classes + Java runtime deps +
823+
# default-platform native libs in one drop-on-classpath JAR, runnable via its
824+
# LlamaServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
825+
# artifact below - a CI run artifact only, not a Maven Central / GitHub-Release asset.
826+
run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android,assembly -Dmaven.test.skip=true -Dgpg.skip=true package
822827
- name: Upload JARs
823828
uses: actions/upload-artifact@v7
824829
with:

CLAUDE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,7 @@ For the full record of upstream API breaks across version ranges (b5022 &#x2192;
229229
mvn compile # Compiles Java and generates JNI headers
230230
mvn test # Run all tests (requires native library and model files)
231231
mvn package # Build JAR
232+
mvn -P assembly package # Also build the fat jar-with-dependencies uber JAR (library + Java deps + native libs); CI builds it and uploads it in the `llama-jars` artifact
232233
mvn test -Dtest=LlamaModelTest#testGenerate # Run a single test method
233234
```
234235

@@ -452,6 +453,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
452453
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
453454
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
454455
- `OSInfo` — Detects OS and architecture for library resolution.
456+
- `server.LlamaServer` — Optional OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".
455457

456458
**Native layer** (`src/main/cpp/`):
457459
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.

README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
9797
- **Infilling** (fill-in-the-middle) for code models.
9898
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
9999
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
100+
- **Runnable OpenAI-compatible HTTP server** (`LlamaServer`, the fat-jar `Main-Class`): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
100101
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
101102
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
102103

@@ -396,6 +397,37 @@ a JSON response, matching the HTTP server's contract:
396397
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
397398
`restoreSlot(int, String)`, and `getModelMeta()`.
398399

400+
### OpenAI-compatible HTTP server
401+
402+
The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
403+
`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
404+
server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
405+
request body to the matching `LlamaModel.handle*` method:
406+
407+
```bash
408+
java -jar target/llama-<version>-jar-with-dependencies.jar \
409+
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
410+
```
411+
412+
| Method &amp; path | Backed by |
413+
|---|---|
414+
| `POST /v1/chat/completions` | `LlamaModel.handleChatCompletions` |
415+
| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
416+
| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
417+
| `GET /v1/models` | the configured model alias |
418+
| `GET /health` | static `{"status":"ok"}` |
419+
420+
```bash
421+
curl http://localhost:8080/v1/chat/completions \
422+
-H 'Content-Type: application/json' \
423+
-d '{"messages":[{"role":"user","content":"Hello!"}]}'
424+
```
425+
426+
Run with `--help` for all options (`--ctx-size`, `--threads`, `--model-alias`, …). Responses are
427+
non-streaming (the full JSON result is returned per request). The NanoHTTPD dependency is declared
428+
`<optional>`, so it is bundled in the fat jar but **not** inherited by projects that use this
429+
library as a Maven dependency; running the server requires the fat jar (or adding NanoHTTPD yourself).
430+
399431
### Model/Inference Configuration
400432

401433
There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder

TODO.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,40 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
5555

5656
**Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
5757

58+
### OpenAI-compatible server: token streaming (SSE) + Java-8 HTTP layer
59+
60+
The `net.ladenthin.llama.server.LlamaServer` MVP is **non-streaming**: every request calls
61+
the blocking `LlamaModel.handle*` method and returns the full JSON response in one shot. A
62+
client that sends `"stream": true` still receives a single response, not the incremental
63+
`text/event-stream` (SSE) `data: {chunk}\n\n` events the OpenAI API emits for streaming
64+
chat/completions. This is the main functional gap of the server today.
65+
66+
The token source already exists — `LlamaModel.generateChat(InferenceParameters)` /
67+
`generate(...)` yield tokens incrementally through a Java `Iterator` (`LlamaIterable`). What
68+
is missing is an HTTP layer that emits SSE.
69+
70+
**Find a Java-8-compatible HTTP layer with good SSE support (alternative to Javalin), or
71+
implement SSE on NanoHTTPD.** Javalin has a first-class `ctx.sse(...)` API but is **not
72+
usable here**: Javalin 5 requires Java 11 and Javalin 6 requires Java 17, while this repo
73+
targets Java 8; Javalin 4 (the last Java-8 release) is EOL. Options, in rough order of
74+
preference:
75+
- **Implement SSE on the existing NanoHTTPD** via `NanoHTTPD.newChunkedResponse(status,
76+
"text/event-stream", InputStream)`, bridging a `LlamaIterable` to an `InputStream` that
77+
writes `data: {chunk}\n\n` frames. No new dependency, stays Java-8 clean; likely the right
78+
answer. Cost: the iterator→SSE bridge plus closing the `LlamaIterable` on client
79+
disconnect.
80+
- **Undertow** — Java-8 compatible, has a server-sent-events handler, but a heavier
81+
dependency tree.
82+
- **Spark Java** (Jetty 9) — Java-8 compatible; SSE support is limited/manual.
83+
- Avoid: Javalin 5/6 (Java 11/17), Javalin 4 (EOL), and the JDK `com.sun.net.httpserver`
84+
(ArchUnit-banned `com.sun..`).
85+
86+
Scope when implemented: honour `"stream": true` on `POST /v1/chat/completions` and
87+
`POST /v1/completions`, emit OpenAI-style SSE chunks terminated by `data: [DONE]`, close the
88+
underlying `LlamaIterable` on disconnect, and keep the non-streaming path as the default. Add
89+
a model-free routing test plus a real-socket SSE integration test (mirroring
90+
`OaiHttpServerIntegrationTest`).
91+
5892
## Open — cross-cutting (slice for this repo)
5993

6094
- **jqwik pin policy** — see [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jqwik-prompt-injection.md). `jqwik.version ≤ 1.9.3` is mandatory.

pom.xml

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ SPDX-License-Identifier: MIT
5858
<checker.version>4.2.0</checker.version>
5959
<jackson.version>2.22.0</jackson.version>
6060
<reactor.version>3.8.6</reactor.version>
61+
<nanohttpd.version>2.3.1</nanohttpd.version>
6162
<slf4j.version>2.0.18</slf4j.version>
6263
<logback.version>1.5.34</logback.version>
6364
<animal-sniffer.version>1.27</animal-sniffer.version>
@@ -148,6 +149,20 @@ SPDX-License-Identifier: MIT
148149
<artifactId>jackson-databind</artifactId>
149150
<version>${jackson.version}</version>
150151
</dependency>
152+
<!--
153+
Embedded HTTP server for the optional OpenAI-compatible server entry point
154+
(net.ladenthin.llama.server.LlamaServer, the fat-jar Main-Class). Declared
155+
<optional> so library consumers do NOT inherit it on their classpath; the
156+
assembly (jar-with-dependencies) profile still bundles it so the fat jar can
157+
run the server. Pure Java, zero transitive deps (Java-8 clean), so it does
158+
not perturb the enforcer dependencyConvergence rule.
159+
-->
160+
<dependency>
161+
<groupId>org.nanohttpd</groupId>
162+
<artifactId>nanohttpd</artifactId>
163+
<version>${nanohttpd.version}</version>
164+
<optional>true</optional>
165+
</dependency>
151166
<!-- Required by OSInfo (vendored from xerial/sqlite-jdbc) for log emission. -->
152167
<dependency>
153168
<groupId>org.slf4j</groupId>
@@ -259,6 +274,11 @@ SPDX-License-Identifier: MIT
259274
<artifactId>git-commit-id-maven-plugin</artifactId>
260275
<version>10.0.0</version>
261276
</plugin>
277+
<plugin>
278+
<groupId>org.apache.maven.plugins</groupId>
279+
<artifactId>maven-assembly-plugin</artifactId>
280+
<version>3.8.0</version>
281+
</plugin>
262282
<plugin>
263283
<groupId>org.apache.maven.plugins</groupId>
264284
<artifactId>maven-compiler-plugin</artifactId>
@@ -968,5 +988,44 @@ SPDX-License-Identifier: MIT
968988
</plugins>
969989
</build>
970990
</profile>
991+
<profile>
992+
<!--
993+
Builds the fat jar-with-dependencies uber JAR: the library classes, the
994+
default-platform native libs from src/main/resources, and all runtime Java
995+
dependencies in one drop-on-classpath JAR, runnable via the LlamaServer
996+
Main-Class (set below) to start the OpenAI-compatible HTTP server. Off by
997+
default; the CI `package` job activates it so the uber JAR rides along in the
998+
`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
999+
as `mvn -P assembly package`.
1000+
-->
1001+
<id>assembly</id>
1002+
<build>
1003+
<plugins>
1004+
<plugin>
1005+
<groupId>org.apache.maven.plugins</groupId>
1006+
<artifactId>maven-assembly-plugin</artifactId>
1007+
<configuration>
1008+
<descriptorRefs>
1009+
<descriptorRef>jar-with-dependencies</descriptorRef>
1010+
</descriptorRefs>
1011+
<archive>
1012+
<manifest>
1013+
<mainClass>net.ladenthin.llama.server.LlamaServer</mainClass>
1014+
</manifest>
1015+
</archive>
1016+
</configuration>
1017+
<executions>
1018+
<execution>
1019+
<id>build-fat-jar</id>
1020+
<phase>package</phase>
1021+
<goals>
1022+
<goal>single</goal>
1023+
</goals>
1024+
</execution>
1025+
</executions>
1026+
</plugin>
1027+
</plugins>
1028+
</build>
1029+
</profile>
9711030
</profiles>
9721031
</project>

spotbugs-exclude.xml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,4 +360,51 @@ SPDX-License-Identifier: MIT
360360
<Method name="requireNonNull"/>
361361
</Match>
362362

363+
<!--
364+
The OpenAI-compatible server (net.ladenthin.llama.server.*) is a CLI entry point:
365+
the model path, host, port and alias all come from command-line arguments supplied
366+
by whoever launches the process. findsecbugs flags Paths.get on the model path
367+
(PATH_TRAVERSAL_IN) and the startup log lines that echo these values
368+
(CRLF_INJECTION_LOGS) because they are non-literal, but the threat model is identical
369+
to the LlamaLoader PATH_TRAVERSAL suppression above: an attacker who can set the
370+
server's command line has already won, and there is no untrusted end-user input
371+
reaching these paths or log statements. There is also no meaningful "allowed root"
372+
to canonicalise the operator-chosen model path against.
373+
-->
374+
<Match>
375+
<Class name="~net\.ladenthin\.llama\.server\..*"/>
376+
<Or>
377+
<Bug pattern="PATH_TRAVERSAL_IN"/>
378+
<Bug pattern="CRLF_INJECTION_LOGS"/>
379+
</Or>
380+
</Match>
381+
382+
<!--
383+
LlamaServerArgs.parse is a flat command-line flag dispatcher: a single switch over
384+
the known flags, one case per option, read top to bottom. javac desugars a String
385+
switch into a hashCode lookup plus an equals chain (two branches per case), which
386+
fb-contrib's bytecode-level CC_CYCLOMATIC_COMPLEXITY counts as a very high score.
387+
The source complexity is low and table-flat; extracting the cases into a dispatch
388+
map would not make it clearer, so we accept the detector artifact here.
389+
-->
390+
<Match>
391+
<Class name="net.ladenthin.llama.server.LlamaServerArgs"/>
392+
<Bug pattern="CC_CYCLOMATIC_COMPLEXITY"/>
393+
<Method name="parse"/>
394+
</Match>
395+
396+
<!--
397+
LlamaModelOaiBackend is a thin non-owning wrapper around a LlamaModel (the same
398+
deliberate dependency-injection contract as Session above): the server owns the one
399+
LlamaModel and its native context, and the backend holds the passed-in reference to
400+
serve requests. The model must NOT be defensively copied, so storing the reference is
401+
by design; spotbugs flags it as EI_EXPOSE_REP2 because the constructor stores an
402+
externally-mutable object, which is true but intended.
403+
-->
404+
<Match>
405+
<Class name="net.ladenthin.llama.server.LlamaModelOaiBackend"/>
406+
<Bug pattern="EI_EXPOSE_REP2"/>
407+
<Method name="&lt;init&gt;"/>
408+
</Match>
409+
363410
</FindBugsFilter>

src/main/java/net/ladenthin/llama/LlamaModel.java

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -835,7 +835,20 @@ public String restoreSlot(int slotId, String filepath) {
835835

836836
native String handleSlotAction(int action, int slotId, @Nullable String filename);
837837

838-
native String handleChatCompletions(String params);
838+
/**
839+
* Run an OpenAI-compatible chat completion (mirrors the {@code /v1/chat/completions}
840+
* endpoint). The request JSON must contain a {@code "messages"} array in the standard
841+
* OpenAI chat format; the model's chat template is applied automatically. Returns the
842+
* result in OAI format with a {@code "choices"} array. This is the raw JSON-in/JSON-out
843+
* form used by {@link #chatComplete(net.ladenthin.llama.parameters.InferenceParameters)}
844+
* and by the embedded OpenAI-compatible server
845+
* ({@link net.ladenthin.llama.server.LlamaServer}); it is the chat counterpart of
846+
* {@link #handleCompletionsOai(String)} and {@link #handleEmbeddings(String, boolean)}.
847+
*
848+
* @param params JSON string with OAI-compatible chat-completion parameters (incl. {@code "messages"})
849+
* @return JSON response in OAI chat-completion format
850+
*/
851+
public native String handleChatCompletions(String params);
839852

840853
native int requestChatCompletion(String params);
841854
}
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import com.fasterxml.jackson.databind.ObjectMapper;
8+
import com.fasterxml.jackson.databind.node.ArrayNode;
9+
import com.fasterxml.jackson.databind.node.ObjectNode;
10+
import lombok.ToString;
11+
import net.ladenthin.llama.LlamaModel;
12+
13+
/**
14+
* {@link OaiBackend} backed by a loaded {@link LlamaModel}. Each operation forwards the raw request
15+
* JSON to the matching {@code LlamaModel.handle*} method, which already produces
16+
* OpenAI-compatible response JSON, so no per-field marshalling happens here.
17+
*
18+
* <p>The model is owned by the caller ({@link LlamaServer}); this class neither closes it nor holds
19+
* any other resource.</p>
20+
*/
21+
@ToString
22+
public final class LlamaModelOaiBackend implements OaiBackend {
23+
24+
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
25+
26+
private final LlamaModel model;
27+
private final String modelId;
28+
29+
/**
30+
* Create a backend over a loaded model.
31+
*
32+
* @param model the loaded model to serve requests with
33+
* @param modelId the identifier reported by {@link #listModels()} and echoed in responses
34+
*/
35+
public LlamaModelOaiBackend(LlamaModel model, String modelId) {
36+
this.model = model;
37+
this.modelId = modelId;
38+
}
39+
40+
@Override
41+
public String chatCompletions(String requestJson) {
42+
return model.handleChatCompletions(requestJson);
43+
}
44+
45+
@Override
46+
public String completions(String requestJson) {
47+
return model.handleCompletionsOai(requestJson);
48+
}
49+
50+
@Override
51+
public String embeddings(String requestJson) {
52+
return model.handleEmbeddings(requestJson, true);
53+
}
54+
55+
@Override
56+
public String listModels() {
57+
final ObjectNode root = OBJECT_MAPPER.createObjectNode();
58+
root.put("object", "list");
59+
final ArrayNode data = root.putArray("data");
60+
final ObjectNode entry = data.addObject();
61+
entry.put("id", modelId);
62+
entry.put("object", "model");
63+
entry.put("owned_by", "llamacpp");
64+
// ObjectNode.toString() emits valid JSON without a checked exception.
65+
return root.toString();
66+
}
67+
}

0 commit comments

Comments
 (0)