Skip to content

Commit 18e3008

Browse files
committed
Add an OpenAI-compatible HTTP server entry point (NanoHTTPD)
Introduce net.ladenthin.llama.server.LlamaServer, a runnable main class (and the fat-jar Main-Class) that loads a GGUF model in-process and serves the OpenAI-compatible endpoints over a tiny NanoHTTPD server: POST /v1/chat/completions -> LlamaModel.handleChatCompletions POST /v1/completions -> LlamaModel.handleCompletionsOai POST /v1/embeddings -> LlamaModel.handleEmbeddings (needs --embedding) GET /v1/models -> configured model alias GET /health -> {"status":"ok"} The handle* methods already return OAI-shaped JSON, so the server only forwards request bodies. Design: - OaiRouter (model-free, unit-tested) maps method+path+body to a response; OaiHttpServer is the thin NanoHTTPD adapter; LlamaModelOaiBackend bridges to LlamaModel; LlamaServerArgs parses --model/--host/--port/--ctx-size/ --n-gpu-layers/--threads/--embedding/--model-alias/--help. - handleChatCompletions widened to public to match the other raw OAI handlers. - NanoHTTPD is an <optional> compile dependency: bundled in the fat jar, not inherited by library consumers (Java-8 clean, zero transitive deps). - New `server` ArchUnit layer (the only layer allowed to access the Api root). - spotbugs-exclude: PATH_TRAVERSAL_IN + CRLF_INJECTION_LOGS on the server package (operator-supplied CLI input; same threat model as LlamaLoader), CC on the flag switch (desugared String-switch artifact), EI_EXPOSE_REP2 on the backend (non-owning model wrapper, mirrors Session). Tests (model-free): LlamaServerArgsTest (10), OaiRouterTest (10), OaiHttpServerIntegrationTest (real loopback socket + fake backend, 1). Verified: spotless, compile (Error Prone/NullAway/Checker), spotbugs Max+Low, javadoc, and the assembly fat jar (Main-Class set, NanoHTTPD bundled) all clean. Docs: README "OpenAI-compatible HTTP server" + Features bullet; CLAUDE.md note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UZbmBX5CjqVwPcaTS61im6
1 parent f3ee50c commit 18e3008

18 files changed

Lines changed: 1267 additions & 6 deletions

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
453453
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
454454
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
455455
- `OSInfo` — Detects OS and architecture for library resolution.
456+
- `server.LlamaServer` — Optional OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".
456457

457458
**Native layer** (`src/main/cpp/`):
458459
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.

README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
9797
- **Infilling** (fill-in-the-middle) for code models.
9898
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
9999
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
100+
- **Runnable OpenAI-compatible HTTP server** (`LlamaServer`, the fat-jar `Main-Class`): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
100101
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
101102
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
102103

@@ -396,6 +397,37 @@ a JSON response, matching the HTTP server's contract:
396397
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
397398
`restoreSlot(int, String)`, and `getModelMeta()`.
398399

400+
### OpenAI-compatible HTTP server
401+
402+
The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
403+
`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
404+
server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
405+
request body to the matching `LlamaModel.handle*` method:
406+
407+
```bash
408+
java -jar target/llama-<version>-jar-with-dependencies.jar \
409+
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
410+
```
411+
412+
| Method &amp; path | Backed by |
413+
|---|---|
414+
| `POST /v1/chat/completions` | `LlamaModel.handleChatCompletions` |
415+
| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
416+
| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
417+
| `GET /v1/models` | the configured model alias |
418+
| `GET /health` | static `{"status":"ok"}` |
419+
420+
```bash
421+
curl http://localhost:8080/v1/chat/completions \
422+
-H 'Content-Type: application/json' \
423+
-d '{"messages":[{"role":"user","content":"Hello!"}]}'
424+
```
425+
426+
Run with `--help` for all options (`--ctx-size`, `--threads`, `--model-alias`, …). Responses are
427+
non-streaming (the full JSON result is returned per request). The NanoHTTPD dependency is declared
428+
`<optional>`, so it is bundled in the fat jar but **not** inherited by projects that use this
429+
library as a Maven dependency; running the server requires the fat jar (or adding NanoHTTPD yourself).
430+
399431
### Model/Inference Configuration
400432

401433
There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder

pom.xml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ SPDX-License-Identifier: MIT
5858
<checker.version>4.2.0</checker.version>
5959
<jackson.version>2.22.0</jackson.version>
6060
<reactor.version>3.8.6</reactor.version>
61+
<nanohttpd.version>2.3.1</nanohttpd.version>
6162
<slf4j.version>2.0.18</slf4j.version>
6263
<logback.version>1.5.34</logback.version>
6364
<animal-sniffer.version>1.27</animal-sniffer.version>
@@ -148,6 +149,20 @@ SPDX-License-Identifier: MIT
148149
<artifactId>jackson-databind</artifactId>
149150
<version>${jackson.version}</version>
150151
</dependency>
152+
<!--
153+
Embedded HTTP server for the optional OpenAI-compatible server entry point
154+
(net.ladenthin.llama.server.LlamaServer, the fat-jar Main-Class). Declared
155+
<optional> so library consumers do NOT inherit it on their classpath; the
156+
assembly (jar-with-dependencies) profile still bundles it so the fat jar can
157+
run the server. Pure Java, zero transitive deps (Java-8 clean), so it does
158+
not perturb the enforcer dependencyConvergence rule.
159+
-->
160+
<dependency>
161+
<groupId>org.nanohttpd</groupId>
162+
<artifactId>nanohttpd</artifactId>
163+
<version>${nanohttpd.version}</version>
164+
<optional>true</optional>
165+
</dependency>
151166
<!-- Required by OSInfo (vendored from xerial/sqlite-jdbc) for log emission. -->
152167
<dependency>
153168
<groupId>org.slf4j</groupId>
@@ -994,6 +1009,11 @@ SPDX-License-Identifier: MIT
9941009
<descriptorRefs>
9951010
<descriptorRef>jar-with-dependencies</descriptorRef>
9961011
</descriptorRefs>
1012+
<archive>
1013+
<manifest>
1014+
<mainClass>net.ladenthin.llama.server.LlamaServer</mainClass>
1015+
</manifest>
1016+
</archive>
9971017
</configuration>
9981018
<executions>
9991019
<execution>

spotbugs-exclude.xml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,4 +360,51 @@ SPDX-License-Identifier: MIT
360360
<Method name="requireNonNull"/>
361361
</Match>
362362

363+
<!--
364+
The OpenAI-compatible server (net.ladenthin.llama.server.*) is a CLI entry point:
365+
the model path, host, port and alias all come from command-line arguments supplied
366+
by whoever launches the process. findsecbugs flags Paths.get on the model path
367+
(PATH_TRAVERSAL_IN) and the startup log lines that echo these values
368+
(CRLF_INJECTION_LOGS) because they are non-literal, but the threat model is identical
369+
to the LlamaLoader PATH_TRAVERSAL suppression above: an attacker who can set the
370+
server's command line has already won, and there is no untrusted end-user input
371+
reaching these paths or log statements. There is also no meaningful "allowed root"
372+
to canonicalise the operator-chosen model path against.
373+
-->
374+
<Match>
375+
<Class name="~net\.ladenthin\.llama\.server\..*"/>
376+
<Or>
377+
<Bug pattern="PATH_TRAVERSAL_IN"/>
378+
<Bug pattern="CRLF_INJECTION_LOGS"/>
379+
</Or>
380+
</Match>
381+
382+
<!--
383+
LlamaServerArgs.parse is a flat command-line flag dispatcher: a single switch over
384+
the known flags, one case per option, read top to bottom. javac desugars a String
385+
switch into a hashCode lookup plus an equals chain (two branches per case), which
386+
fb-contrib's bytecode-level CC_CYCLOMATIC_COMPLEXITY counts as a very high score.
387+
The source complexity is low and table-flat; extracting the cases into a dispatch
388+
map would not make it clearer, so we accept the detector artifact here.
389+
-->
390+
<Match>
391+
<Class name="net.ladenthin.llama.server.LlamaServerArgs"/>
392+
<Bug pattern="CC_CYCLOMATIC_COMPLEXITY"/>
393+
<Method name="parse"/>
394+
</Match>
395+
396+
<!--
397+
LlamaModelOaiBackend is a thin non-owning wrapper around a LlamaModel (the same
398+
deliberate dependency-injection contract as Session above): the server owns the one
399+
LlamaModel and its native context, and the backend holds the passed-in reference to
400+
serve requests. The model must NOT be defensively copied, so storing the reference is
401+
by design; spotbugs flags it as EI_EXPOSE_REP2 because the constructor stores an
402+
externally-mutable object, which is true but intended.
403+
-->
404+
<Match>
405+
<Class name="net.ladenthin.llama.server.LlamaModelOaiBackend"/>
406+
<Bug pattern="EI_EXPOSE_REP2"/>
407+
<Method name="&lt;init&gt;"/>
408+
</Match>
409+
363410
</FindBugsFilter>

src/main/java/net/ladenthin/llama/LlamaModel.java

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -835,7 +835,20 @@ public String restoreSlot(int slotId, String filepath) {
835835

836836
native String handleSlotAction(int action, int slotId, @Nullable String filename);
837837

838-
native String handleChatCompletions(String params);
838+
/**
839+
* Run an OpenAI-compatible chat completion (mirrors the {@code /v1/chat/completions}
840+
* endpoint). The request JSON must contain a {@code "messages"} array in the standard
841+
* OpenAI chat format; the model's chat template is applied automatically. Returns the
842+
* result in OAI format with a {@code "choices"} array. This is the raw JSON-in/JSON-out
843+
* form used by {@link #chatComplete(net.ladenthin.llama.parameters.InferenceParameters)}
844+
* and by the embedded OpenAI-compatible server
845+
* ({@link net.ladenthin.llama.server.LlamaServer}); it is the chat counterpart of
846+
* {@link #handleCompletionsOai(String)} and {@link #handleEmbeddings(String, boolean)}.
847+
*
848+
* @param params JSON string with OAI-compatible chat-completion parameters (incl. {@code "messages"})
849+
* @return JSON response in OAI chat-completion format
850+
*/
851+
public native String handleChatCompletions(String params);
839852

840853
native int requestChatCompletion(String params);
841854
}
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import com.fasterxml.jackson.databind.ObjectMapper;
8+
import com.fasterxml.jackson.databind.node.ArrayNode;
9+
import com.fasterxml.jackson.databind.node.ObjectNode;
10+
import lombok.ToString;
11+
import net.ladenthin.llama.LlamaModel;
12+
13+
/**
14+
* {@link OaiBackend} backed by a loaded {@link LlamaModel}. Each operation forwards the raw request
15+
* JSON to the matching {@code LlamaModel.handle*} method, which already produces
16+
* OpenAI-compatible response JSON, so no per-field marshalling happens here.
17+
*
18+
* <p>The model is owned by the caller ({@link LlamaServer}); this class neither closes it nor holds
19+
* any other resource.</p>
20+
*/
21+
@ToString
22+
public final class LlamaModelOaiBackend implements OaiBackend {
23+
24+
private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
25+
26+
private final LlamaModel model;
27+
private final String modelId;
28+
29+
/**
30+
* Create a backend over a loaded model.
31+
*
32+
* @param model the loaded model to serve requests with
33+
* @param modelId the identifier reported by {@link #listModels()} and echoed in responses
34+
*/
35+
public LlamaModelOaiBackend(LlamaModel model, String modelId) {
36+
this.model = model;
37+
this.modelId = modelId;
38+
}
39+
40+
@Override
41+
public String chatCompletions(String requestJson) {
42+
return model.handleChatCompletions(requestJson);
43+
}
44+
45+
@Override
46+
public String completions(String requestJson) {
47+
return model.handleCompletionsOai(requestJson);
48+
}
49+
50+
@Override
51+
public String embeddings(String requestJson) {
52+
return model.handleEmbeddings(requestJson, true);
53+
}
54+
55+
@Override
56+
public String listModels() {
57+
final ObjectNode root = OBJECT_MAPPER.createObjectNode();
58+
root.put("object", "list");
59+
final ArrayNode data = root.putArray("data");
60+
final ObjectNode entry = data.addObject();
61+
entry.put("id", modelId);
62+
entry.put("object", "model");
63+
entry.put("owned_by", "llamacpp");
64+
// ObjectNode.toString() emits valid JSON without a checked exception.
65+
return root.toString();
66+
}
67+
}
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import fi.iki.elonen.NanoHTTPD;
8+
import java.io.IOException;
9+
import net.ladenthin.llama.LlamaModel;
10+
import net.ladenthin.llama.parameters.ModelParameters;
11+
import org.slf4j.Logger;
12+
import org.slf4j.LoggerFactory;
13+
14+
/**
15+
* Entry point for the optional OpenAI-compatible HTTP server, and the {@code Main-Class} of the
16+
* {@code -jar-with-dependencies} assembly.
17+
*
18+
* <p>It parses the command line ({@link LlamaServerArgs}), loads a GGUF model into a
19+
* {@link LlamaModel}, and serves OpenAI-compatible endpoints over NanoHTTPD via {@link OaiRouter} /
20+
* {@link OaiHttpServer}. A shutdown hook stops the server and closes the model on JVM exit
21+
* (e.g. Ctrl-C / SIGTERM). Run {@code --help} for the full option list.</p>
22+
*
23+
* <p>Example:</p>
24+
*
25+
* <pre>{@code
26+
* java -jar llama-<version>-jar-with-dependencies.jar \
27+
* --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
28+
* }</pre>
29+
*
30+
* <p>Responses are non-streaming: the full JSON result is returned per request.</p>
31+
*/
32+
public final class LlamaServer {
33+
34+
private static final Logger LOG = LoggerFactory.getLogger(LlamaServer.class);
35+
36+
private LlamaServer() {}
37+
38+
/**
39+
* Start the server (blocks the JVM alive on a non-daemon listener thread), or print help.
40+
*
41+
* @param args command-line arguments; see {@link LlamaServerArgs#usage()}
42+
* @throws IOException if the HTTP server cannot bind the configured host/port
43+
*/
44+
public static void main(String[] args) throws IOException {
45+
if (LlamaServerArgs.isHelpRequested(args)) {
46+
LOG.info("{}{}", System.lineSeparator(), LlamaServerArgs.usage());
47+
return;
48+
}
49+
50+
final LlamaServerConfig config = LlamaServerArgs.parse(args);
51+
final LlamaModel model = loadModel(config);
52+
final OaiBackend backend = new LlamaModelOaiBackend(model, config.getModelAlias());
53+
final OaiHttpServer server = new OaiHttpServer(config.getHost(), config.getPort(), new OaiRouter(backend));
54+
55+
Runtime.getRuntime().addShutdownHook(new Thread(() -> shutdown(server, model), "llama-server-shutdown"));
56+
57+
try {
58+
// daemon=false: the non-daemon listener thread keeps the JVM alive after main() returns.
59+
server.start(NanoHTTPD.SOCKET_READ_TIMEOUT, false);
60+
} catch (IOException e) {
61+
// Close the just-loaded native model before propagating the bind failure.
62+
model.close();
63+
throw e;
64+
}
65+
66+
LOG.info(
67+
"LlamaServer listening on http://{}:{} (model={})",
68+
config.getHost(),
69+
config.getPort(),
70+
config.getModelAlias());
71+
}
72+
73+
private static LlamaModel loadModel(LlamaServerConfig config) {
74+
final ModelParameters params =
75+
new ModelParameters().setModel(config.getModelPath()).setGpuLayers(config.getGpuLayers());
76+
if (config.getCtxSize() > 0) {
77+
params.setCtxSize(config.getCtxSize());
78+
}
79+
if (config.getThreads() > 0) {
80+
params.setThreads(config.getThreads());
81+
}
82+
if (config.isEmbedding()) {
83+
params.enableEmbedding();
84+
}
85+
LOG.info("Loading model {} ...", config.getModelPath());
86+
return new LlamaModel(params);
87+
}
88+
89+
private static void shutdown(OaiHttpServer server, LlamaModel model) {
90+
server.stop();
91+
model.close();
92+
}
93+
}

0 commit comments

Comments
 (0)