bernardladenthin · bernardladenthin · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
@@ -818,7 +818,12 @@ jobs:
           distribution: 'temurin'
           java-version: ${{ env.JAVA_VERSION }}
       - name: Build JARs
-        run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android -Dmaven.test.skip=true -Dgpg.skip=true package
+        # `assembly` additionally produces the fat jar-with-dependencies uber JAR
+        # (llama-<version>-jar-with-dependencies.jar: library classes + Java runtime deps +
+        # default-platform native libs in one drop-on-classpath JAR, runnable via its
+        # LlamaServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
+        # artifact below - a CI run artifact only, not a Maven Central / GitHub-Release asset.
+        run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android,assembly -Dmaven.test.skip=true -Dgpg.skip=true package
       - name: Upload JARs
         uses: actions/upload-artifact@v7
         with:

@@ -229,6 +229,7 @@ For the full record of upstream API breaks across version ranges (b5022 &#x2192;
 mvn compile          # Compiles Java and generates JNI headers
 mvn test             # Run all tests (requires native library and model files)
 mvn package          # Build JAR
+mvn -P assembly package  # Also build the fat jar-with-dependencies uber JAR (library + Java deps + native libs); CI builds it and uploads it in the `llama-jars` artifact
 mvn test -Dtest=LlamaModelTest#testGenerate  # Run a single test method
 ```
 
@@ -452,6 +453,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
 - `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
 - `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
 - `OSInfo` — Detects OS and architecture for library resolution.
+- `server.LlamaServer` — Optional OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".
 
 **Native layer** (`src/main/cpp/`):
 - `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.

@@ -97,6 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
 - **Infilling** (fill-in-the-middle) for code models.
 - **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
 - **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
+- **Runnable OpenAI-compatible HTTP server** (`LlamaServer`, the fat-jar `Main-Class`): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
 - **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
 - Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
 
@@ -396,6 +397,37 @@ a JSON response, matching the HTTP server's contract:
 Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
 `restoreSlot(int, String)`, and `getModelMeta()`.
 
+### OpenAI-compatible HTTP server
+
+The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
+`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
+server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
+request body to the matching `LlamaModel.handle*` method:
+
+```bash
+java -jar target/llama-<version>-jar-with-dependencies.jar \
+    --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
+```
+
+| Method &amp; path | Backed by |
+|---|---|
+| `POST /v1/chat/completions` | `LlamaModel.handleChatCompletions` |
+| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
+| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
+| `GET /v1/models` | the configured model alias |
+| `GET /health` | static `{"status":"ok"}` |
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H 'Content-Type: application/json' \
+  -d '{"messages":[{"role":"user","content":"Hello!"}]}'
+```
+
+Run with `--help` for all options (`--ctx-size`, `--threads`, `--model-alias`, …). Responses are
+non-streaming (the full JSON result is returned per request). The NanoHTTPD dependency is declared
+`<optional>`, so it is bundled in the fat jar but **not** inherited by projects that use this
+library as a Maven dependency; running the server requires the fat jar (or adding NanoHTTPD yourself).
+
 ### Model/Inference Configuration
 
 There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder 

@@ -55,6 +55,40 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
 
   **Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
 
+### OpenAI-compatible server: token streaming (SSE) + Java-8 HTTP layer
+
+The `net.ladenthin.llama.server.LlamaServer` MVP is **non-streaming**: every request calls
+the blocking `LlamaModel.handle*` method and returns the full JSON response in one shot. A
+client that sends `"stream": true` still receives a single response, not the incremental
+`text/event-stream` (SSE) `data: {chunk}\n\n` events the OpenAI API emits for streaming
+chat/completions. This is the main functional gap of the server today.
+
+The token source already exists — `LlamaModel.generateChat(InferenceParameters)` /
+`generate(...)` yield tokens incrementally through a Java `Iterator` (`LlamaIterable`). What
+is missing is an HTTP layer that emits SSE.
+
+**Find a Java-8-compatible HTTP layer with good SSE support (alternative to Javalin), or
+implement SSE on NanoHTTPD.** Javalin has a first-class `ctx.sse(...)` API but is **not
+usable here**: Javalin 5 requires Java 11 and Javalin 6 requires Java 17, while this repo
+targets Java 8; Javalin 4 (the last Java-8 release) is EOL. Options, in rough order of
+preference:
+- **Implement SSE on the existing NanoHTTPD** via `NanoHTTPD.newChunkedResponse(status,
+  "text/event-stream", InputStream)`, bridging a `LlamaIterable` to an `InputStream` that
+  writes `data: {chunk}\n\n` frames. No new dependency, stays Java-8 clean; likely the right
+  answer. Cost: the iterator→SSE bridge plus closing the `LlamaIterable` on client
+  disconnect.
+- **Undertow** — Java-8 compatible, has a server-sent-events handler, but a heavier
+  dependency tree.
+- **Spark Java** (Jetty 9) — Java-8 compatible; SSE support is limited/manual.
+- Avoid: Javalin 5/6 (Java 11/17), Javalin 4 (EOL), and the JDK `com.sun.net.httpserver`
+  (ArchUnit-banned `com.sun..`).
+
+Scope when implemented: honour `"stream": true` on `POST /v1/chat/completions` and
+`POST /v1/completions`, emit OpenAI-style SSE chunks terminated by `data: [DONE]`, close the
+underlying `LlamaIterable` on disconnect, and keep the non-streaming path as the default. Add
+a model-free routing test plus a real-socket SSE integration test (mirroring
+`OaiHttpServerIntegrationTest`).
+
 ## Open — cross-cutting (slice for this repo)
 
 - **jqwik pin policy** — see [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jqwik-prompt-injection.md). `jqwik.version ≤ 1.9.3` is mandatory.

@@ -58,6 +58,7 @@ SPDX-License-Identifier: MIT
 		<checker.version>4.2.0</checker.version>
 		<jackson.version>2.22.0</jackson.version>
 		<reactor.version>3.8.6</reactor.version>
+		<nanohttpd.version>2.3.1</nanohttpd.version>
 		<slf4j.version>2.0.18</slf4j.version>
 		<logback.version>1.5.34</logback.version>
 		<animal-sniffer.version>1.27</animal-sniffer.version>
@@ -148,6 +149,20 @@ SPDX-License-Identifier: MIT
 			<artifactId>jackson-databind</artifactId>
 			<version>${jackson.version}</version>
 		</dependency>
+		<!--
+			Embedded HTTP server for the optional OpenAI-compatible server entry point
+			(net.ladenthin.llama.server.LlamaServer, the fat-jar Main-Class). Declared
+			<optional> so library consumers do NOT inherit it on their classpath; the
+			assembly (jar-with-dependencies) profile still bundles it so the fat jar can
+			run the server. Pure Java, zero transitive deps (Java-8 clean), so it does
+			not perturb the enforcer dependencyConvergence rule.
+		-->
+		<dependency>
+			<groupId>org.nanohttpd</groupId>
+			<artifactId>nanohttpd</artifactId>
+			<version>${nanohttpd.version}</version>
+			<optional>true</optional>
+		</dependency>
 		<!-- Required by OSInfo (vendored from xerial/sqlite-jdbc) for log emission. -->
 		<dependency>
 			<groupId>org.slf4j</groupId>
@@ -259,6 +274,11 @@ SPDX-License-Identifier: MIT
 					<artifactId>git-commit-id-maven-plugin</artifactId>
 					<version>10.0.0</version>
 				</plugin>
+				<plugin>
+					<groupId>org.apache.maven.plugins</groupId>
+					<artifactId>maven-assembly-plugin</artifactId>
+					<version>3.8.0</version>
+				</plugin>
 				<plugin>
 					<groupId>org.apache.maven.plugins</groupId>
 					<artifactId>maven-compiler-plugin</artifactId>
@@ -968,5 +988,44 @@ SPDX-License-Identifier: MIT
 				</plugins>
 			</build>
 		</profile>
+		<profile>
+			<!--
+				Builds the fat jar-with-dependencies uber JAR: the library classes, the
+				default-platform native libs from src/main/resources, and all runtime Java
+				dependencies in one drop-on-classpath JAR, runnable via the LlamaServer
+				Main-Class (set below) to start the OpenAI-compatible HTTP server. Off by
+				default; the CI `package` job activates it so the uber JAR rides along in the
+				`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
+				as `mvn -P assembly package`.
+			-->
+			<id>assembly</id>
+			<build>
+				<plugins>
+					<plugin>
+						<groupId>org.apache.maven.plugins</groupId>
+						<artifactId>maven-assembly-plugin</artifactId>
+						<configuration>
+							<descriptorRefs>
+								<descriptorRef>jar-with-dependencies</descriptorRef>
+							</descriptorRefs>
+							<archive>
+								<manifest>
+									<mainClass>net.ladenthin.llama.server.LlamaServer</mainClass>
+								</manifest>
+							</archive>
+						</configuration>
+						<executions>
+							<execution>
+								<id>build-fat-jar</id>
+								<phase>package</phase>
+								<goals>
+									<goal>single</goal>
+								</goals>
+							</execution>
+						</executions>
+					</plugin>
+				</plugins>
+			</build>
+		</profile>
 	</profiles>
 </project>
@@ -360,4 +360,51 @@ SPDX-License-Identifier: MIT
         <Method name="requireNonNull"/>
     </Match>
 
+    <!--
+        The OpenAI-compatible server (net.ladenthin.llama.server.*) is a CLI entry point:
+        the model path, host, port and alias all come from command-line arguments supplied
+        by whoever launches the process. findsecbugs flags Paths.get on the model path
+        (PATH_TRAVERSAL_IN) and the startup log lines that echo these values
+        (CRLF_INJECTION_LOGS) because they are non-literal, but the threat model is identical
+        to the LlamaLoader PATH_TRAVERSAL suppression above: an attacker who can set the
+        server's command line has already won, and there is no untrusted end-user input
+        reaching these paths or log statements. There is also no meaningful "allowed root"
+        to canonicalise the operator-chosen model path against.
+    -->
+    <Match>
+        <Class name="~net\.ladenthin\.llama\.server\..*"/>
+        <Or>
+            <Bug pattern="PATH_TRAVERSAL_IN"/>
+            <Bug pattern="CRLF_INJECTION_LOGS"/>
+        </Or>
+    </Match>
+
+    <!--
+        LlamaServerArgs.parse is a flat command-line flag dispatcher: a single switch over
+        the known flags, one case per option, read top to bottom. javac desugars a String
+        switch into a hashCode lookup plus an equals chain (two branches per case), which
+        fb-contrib's bytecode-level CC_CYCLOMATIC_COMPLEXITY counts as a very high score.
+        The source complexity is low and table-flat; extracting the cases into a dispatch
+        map would not make it clearer, so we accept the detector artifact here.
+    -->
+    <Match>
+        <Class name="net.ladenthin.llama.server.LlamaServerArgs"/>
+        <Bug pattern="CC_CYCLOMATIC_COMPLEXITY"/>
+        <Method name="parse"/>
+    </Match>
+
+    <!--
+        LlamaModelOaiBackend is a thin non-owning wrapper around a LlamaModel (the same
+        deliberate dependency-injection contract as Session above): the server owns the one
+        LlamaModel and its native context, and the backend holds the passed-in reference to
+        serve requests. The model must NOT be defensively copied, so storing the reference is
+        by design; spotbugs flags it as EI_EXPOSE_REP2 because the constructor stores an
+        externally-mutable object, which is true but intended.
+    -->
+    <Match>
+        <Class name="net.ladenthin.llama.server.LlamaModelOaiBackend"/>
+        <Bug pattern="EI_EXPOSE_REP2"/>
+        <Method name="&lt;init&gt;"/>
+    </Match>
+
 </FindBugsFilter>
@@ -835,7 +835,20 @@ public String restoreSlot(int slotId, String filepath) {
 
     native String handleSlotAction(int action, int slotId, @Nullable String filename);
 
-    native String handleChatCompletions(String params);
+    /**
+     * Run an OpenAI-compatible chat completion (mirrors the {@code /v1/chat/completions}
+     * endpoint). The request JSON must contain a {@code "messages"} array in the standard
+     * OpenAI chat format; the model's chat template is applied automatically. Returns the
+     * result in OAI format with a {@code "choices"} array. This is the raw JSON-in/JSON-out
+     * form used by {@link #chatComplete(net.ladenthin.llama.parameters.InferenceParameters)}
+     * and by the embedded OpenAI-compatible server
+     * ({@link net.ladenthin.llama.server.LlamaServer}); it is the chat counterpart of
+     * {@link #handleCompletionsOai(String)} and {@link #handleEmbeddings(String, boolean)}.
+     *
+     * @param params JSON string with OAI-compatible chat-completion parameters (incl. {@code "messages"})
+     * @return JSON response in OAI chat-completion format
+     */
+    public native String handleChatCompletions(String params);
 
     native int requestChatCompletion(String params);
 }
@@ -0,0 +1,67 @@
+// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
+//
+// SPDX-License-Identifier: MIT
+
+package net.ladenthin.llama.server;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.fasterxml.jackson.databind.node.ArrayNode;
+import com.fasterxml.jackson.databind.node.ObjectNode;
+import lombok.ToString;
+import net.ladenthin.llama.LlamaModel;
+
+/**
+ * {@link OaiBackend} backed by a loaded {@link LlamaModel}. Each operation forwards the raw request
+ * JSON to the matching {@code LlamaModel.handle*} method, which already produces
+ * OpenAI-compatible response JSON, so no per-field marshalling happens here.
+ *
+ * <p>The model is owned by the caller ({@link LlamaServer}); this class neither closes it nor holds
+ * any other resource.</p>
+ */
+@ToString
+public final class LlamaModelOaiBackend implements OaiBackend {
+
+    private static final ObjectMapper OBJECT_MAPPER = new ObjectMapper();
+
+    private final LlamaModel model;
+    private final String modelId;
+
+    /**
+     * Create a backend over a loaded model.
+     *
+     * @param model   the loaded model to serve requests with
+     * @param modelId the identifier reported by {@link #listModels()} and echoed in responses
+     */
+    public LlamaModelOaiBackend(LlamaModel model, String modelId) {
+        this.model = model;
+        this.modelId = modelId;
+    }
+
+    @Override
+    public String chatCompletions(String requestJson) {
+        return model.handleChatCompletions(requestJson);
+    }
+
+    @Override
+    public String completions(String requestJson) {
+        return model.handleCompletionsOai(requestJson);
+    }
+
+    @Override
+    public String embeddings(String requestJson) {
+        return model.handleEmbeddings(requestJson, true);
+    }
+
+    @Override
+    public String listModels() {
+        final ObjectNode root = OBJECT_MAPPER.createObjectNode();
+        root.put("object", "list");
+        final ArrayNode data = root.putArray("data");
+        final ObjectNode entry = data.addObject();
+        entry.put("id", modelId);
+        entry.put("object", "model");
+        entry.put("owned_by", "llamacpp");
+        // ObjectNode.toString() emits valid JSON without a checked exception.
+        return root.toString();
+    }
+}