server: make NativeServer the default fat-jar Main-Class (keep OpenAiCompatServer)

claude · claude · commit 8a1a68fb02af · 2026-07-03T08:46:47.000Z
Two runnable server mains now exist. The fat jar's default Main-Class becomes NativeServer, so `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full native llama.cpp server with its embedded WebUI, forwarding every argument. OpenAiCompatServer is unchanged and still runnable via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`. - NativeServer.main(args): forwards argv, starts the server, registers a JVM shutdown hook (the embedded server installs no signal handlers of its own — see patches/0006 — so the hook is what stops it cleanly on Ctrl-C/SIGTERM), and blocks until the native worker exits. - llama/pom.xml assembly profile: Main-Class OpenAiCompatServer -> NativeServer. - README + CLAUDE.md: document the two modes and how to select each. Verified end-to-end (Linux x86_64, synthetic granitehybrid): `java -cp … NativeServer -m model --port 8972` serves /health=ok after load; SIGTERM to the JVM fires the shutdown hook -> clean "cleaning up before exit" -> port down. Javadoc + spotless clean; 7 pure-Java NativeServer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -836,7 +836,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
 - `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
 - `OSInfo` — Detects OS and architecture for library resolution.
 - **`server` package — OpenAI-compatible HTTP endpoint (a single implementation).**
-  - `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198).
+  - `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), embeddable and runnable via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …` (the fat-jar default `Main-Class` is now `NativeServer` — see "Two server modes"). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198).
   - **Alternative protocol surfaces** (pure translation over the OpenAI chat core — no second inference path; each reconstructs streamed tool calls via `ToolCallDeltaAccumulator`): **Ollama-native** (`GET /api/version`, `/api/tags`, `POST /api/show`, `/api/chat` with NDJSON streaming, `/api/generate` prompt-completion/FIM — `OllamaApiSupport`; `/api/show` advertises tools/insert/vision capabilities + context length for Copilot's Ollama provider), **Anthropic Messages** (`POST /v1/messages`, SSE event stream — `AnthropicApiSupport` + `AnthropicStreamTranslator`), and **OpenAI Responses** (`POST /v1/responses`, SSE event stream — `ResponsesApiSupport` + `ResponsesStreamTranslator`). The llama.cpp-native `GET /props` (context length + `modalities`) is served via `OpenAiSseFormatter.propsJson` for autocomplete clients that size their context from it.
   - Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`; `corsAllowOrigin`; `supportsVision`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`; flags incl. `--mmproj`/`--embedding`/`--reranking`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON + usage normalization), `OaiRerankSupport` (pure rerank request/response shaping), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
   - The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
@@ -853,8 +853,8 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
 
 The library exposes **two** ways to serve a model over HTTP, on two different transports:
 
-1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. This is the fat-jar `Main-Class`; its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
-2. **`server.NativeServer` (native transport).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
+1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. It has its own `main` (run via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`); its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
+2. **`server.NativeServer` (native transport) — the fat-jar default `Main-Class`.** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
 
 ### Native Helper Architecture
 
diff --git a/README.md b/README.md
@@ -107,7 +107,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
 - **Infilling** (fill-in-the-middle) for code models.
 - **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
 - **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
-- **Runnable OpenAI-compatible HTTP server** (`OpenAiCompatServer`, the fat-jar `Main-Class`, streaming SSE, zero extra dependency): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
+- **Two runnable HTTP server modes.** The fat jar's default `Main-Class` is `NativeServer` — the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI, no separate `llama-server.exe`: `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080`. The Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) is also available: `java -cp …-jar-with-dependencies.jar net.ladenthin.llama.server.OpenAiCompatServer --model model.gguf --port 8080`.
 - **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
 - Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
 
@@ -591,7 +591,9 @@ array alone at `GET /slots`. OpenAI responses preserve
 
 `net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
 OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra
-dependency and no separate server process. It is both embeddable and the fat-jar `Main-Class`. It
+dependency and no separate server process. It is embeddable, and runnable via
+`java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …` (the fat jar's default
+`Main-Class` is instead `NativeServer` — see "Native server with the built-in WebUI" below). It
 serves:
 
 | Method &amp; path | Backed by |
@@ -646,16 +648,17 @@ try (LlamaModel model = new LlamaModel(modelParams);
 }
 ```
 
-…or run it standalone. The fat jar built by the `assembly` profile (`mvn -P assembly package`) is
-runnable (its `Main-Class` is `net.ladenthin.llama.server.OpenAiCompatServer`); the plain library jar
-works too via `-cp`:
+…or run it standalone. It has its own `main`, launched by class name via `-cp` (the fat jar's
+default `java -jar` `Main-Class` is `NativeServer` — the native server below — so name
+`OpenAiCompatServer` explicitly to get this Java one):
 
 ```bash
-# fat jar (bundles the native lib + Java deps)
-java -jar target/llama-<version>-jar-with-dependencies.jar \
+# fat jar (bundles the native lib + Java deps) — name the class explicitly
+java -cp target/llama-<version>-jar-with-dependencies.jar \
+    net.ladenthin.llama.server.OpenAiCompatServer \
     --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
 
-# or the plain jar
+# or the plain library jar
 java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
   --model models/model.gguf --port 8080 --model-id local-model
 ```
@@ -716,7 +719,17 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` (
 the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use
 `net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over
 JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so
-every flag works exactly as it does for the standalone binary:
+every flag works exactly as it does for the standalone binary. It is the fat jar's default
+`Main-Class`, so `java -jar` just forwards its args to the native server (pass `--help` for the full
+llama-server option list):
+
+```bash
+java -jar target/llama-<version>-jar-with-dependencies.jar \
+    -m models/model.gguf --host 127.0.0.1 --port 8080 -c 65536 --jinja
+# then open http://127.0.0.1:8080/ for the WebUI
+```
+
+Or embed it:
 
 ```java
 try (NativeServer server = new NativeServer(
diff --git a/llama/pom.xml b/llama/pom.xml
@@ -1296,8 +1296,10 @@ SPDX-License-Identifier: MIT
 			<!--
 				Builds the fat jar-with-dependencies uber JAR: the library classes, the
 				default-platform native libs from src/main/resources, and all runtime Java
-				dependencies in one drop-on-classpath JAR, runnable via the OpenAiCompatServer
-				Main-Class (set below) to start the OpenAI-compatible HTTP server. Off by
+				dependencies in one drop-on-classpath JAR, with NativeServer as the fat-jar
+				Main-Class (set below) to start the full native llama.cpp server (embedded WebUI,
+					every llama-server flag forwarded); OpenAiCompatServer stays runnable via
+					`java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer`. Off by
 				default; the CI `package` job activates it so the uber JAR rides along in the
 				`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
 				as `mvn -P assembly package`.
@@ -1314,7 +1316,7 @@ SPDX-License-Identifier: MIT
 							</descriptorRefs>
 							<archive>
 								<manifest>
-									<mainClass>net.ladenthin.llama.server.OpenAiCompatServer</mainClass>
+									<mainClass>net.ladenthin.llama.server.NativeServer</mainClass>
 								</manifest>
 							</archive>
 						</configuration>
diff --git a/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java b/llama/src/main/java/net/ladenthin/llama/server/NativeServer.java
@@ -180,6 +180,44 @@ public void close() {
         }
     }
 
+    /**
+     * Fat-jar entry point (the assembly JAR's {@code Main-Class}): starts the full native llama.cpp
+     * server — WebUI included — forwarding every argument to it verbatim, and blocks until the
+     * server exits or the JVM is asked to shut down (Ctrl-C / SIGTERM), stopping the server cleanly
+     * on the way out.
+     *
+     * <p>This is the default runnable server. The Java-transport {@link OpenAiCompatServer} remains
+     * available via its own {@code main} — run it explicitly with
+     * {@code java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …}.</p>
+     *
+     * @param args the llama-server command-line arguments, forwarded verbatim (e.g. {@code -m
+     *             model.gguf --host 127.0.0.1 --port 8080}); pass {@code --help} for the full
+     *             llama-server option list
+     * @throws InterruptedException if interrupted while waiting for the server to exit
+     */
+    public static void main(String[] args) throws InterruptedException {
+        final NativeServer server = new NativeServer(args);
+        final AtomicBoolean stoppedByHook = new AtomicBoolean(false);
+        // Graceful Ctrl-C / SIGTERM: the embedded server installs no signal handlers of its own
+        // (see patches/0006), so the JVM-level shutdown hook is what stops it before exit.
+        Runtime.getRuntime()
+                .addShutdownHook(new Thread(
+                        () -> {
+                            stoppedByHook.set(true);
+                            server.close();
+                        },
+                        "jllama-native-server-shutdown"));
+        server.start();
+        // Keep the JVM alive until the native worker exits — on its own (e.g. a fatal startup/model
+        // error that llama_server has already logged) or because the shutdown hook stopped it.
+        while (server.isRunning()) {
+            Thread.sleep(200L);
+        }
+        if (!stoppedByHook.get()) {
+            server.close();
+        }
+    }
+
     /**
      * Starts the native server on a worker thread and returns an opaque handle. The argv is
      * forwarded verbatim (with a synthetic {@code argv[0]}).