Skip to content

Commit 8a1a68f

Browse files
committed
server: make NativeServer the default fat-jar Main-Class (keep OpenAiCompatServer)
Two runnable server mains now exist. The fat jar's default Main-Class becomes NativeServer, so `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full native llama.cpp server with its embedded WebUI, forwarding every argument. OpenAiCompatServer is unchanged and still runnable via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`. - NativeServer.main(args): forwards argv, starts the server, registers a JVM shutdown hook (the embedded server installs no signal handlers of its own — see patches/0006 — so the hook is what stops it cleanly on Ctrl-C/SIGTERM), and blocks until the native worker exits. - llama/pom.xml assembly profile: Main-Class OpenAiCompatServer -> NativeServer. - README + CLAUDE.md: document the two modes and how to select each. Verified end-to-end (Linux x86_64, synthetic granitehybrid): `java -cp … NativeServer -m model --port 8972` serves /health=ok after load; SIGTERM to the JVM fires the shutdown hook -> clean "cleaning up before exit" -> port down. Javadoc + spotless clean; 7 pure-Java NativeServer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
1 parent 3c8aeb8 commit 8a1a68f

4 files changed

Lines changed: 68 additions & 15 deletions

File tree

CLAUDE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -836,7 +836,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
836836
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
837837
- `OSInfo` — Detects OS and architecture for library resolution.
838838
- **`server` package — OpenAI-compatible HTTP endpoint (a single implementation).**
839-
- `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198).
839+
- `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), embeddable and runnable via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …` (the fat-jar default `Main-Class` is now `NativeServer` — see "Two server modes"). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198).
840840
- **Alternative protocol surfaces** (pure translation over the OpenAI chat core — no second inference path; each reconstructs streamed tool calls via `ToolCallDeltaAccumulator`): **Ollama-native** (`GET /api/version`, `/api/tags`, `POST /api/show`, `/api/chat` with NDJSON streaming, `/api/generate` prompt-completion/FIM — `OllamaApiSupport`; `/api/show` advertises tools/insert/vision capabilities + context length for Copilot's Ollama provider), **Anthropic Messages** (`POST /v1/messages`, SSE event stream — `AnthropicApiSupport` + `AnthropicStreamTranslator`), and **OpenAI Responses** (`POST /v1/responses`, SSE event stream — `ResponsesApiSupport` + `ResponsesStreamTranslator`). The llama.cpp-native `GET /props` (context length + `modalities`) is served via `OpenAiSseFormatter.propsJson` for autocomplete clients that size their context from it.
841841
- Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`; `corsAllowOrigin`; `supportsVision`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`; flags incl. `--mmproj`/`--embedding`/`--reranking`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON + usage normalization), `OaiRerankSupport` (pure rerank request/response shaping), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
842842
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
@@ -853,8 +853,8 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
853853

854854
The library exposes **two** ways to serve a model over HTTP, on two different transports:
855855

856-
1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. This is the fat-jar `Main-Class`; its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
857-
2. **`server.NativeServer` (native transport).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
856+
1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. It has its own `main` (run via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`); its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
857+
2. **`server.NativeServer` (native transport) — the fat-jar default `Main-Class`.** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
858858

859859
### Native Helper Architecture
860860

README.md

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
107107
- **Infilling** (fill-in-the-middle) for code models.
108108
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
109109
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
110-
- **Runnable OpenAI-compatible HTTP server** (`OpenAiCompatServer`, the fat-jar `Main-Class`, streaming SSE, zero extra dependency): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
110+
- **Two runnable HTTP server modes.** The fat jar's default `Main-Class` is `NativeServer` — the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI, no separate `llama-server.exe`: `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080`. The Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) is also available: `java -cp …-jar-with-dependencies.jar net.ladenthin.llama.server.OpenAiCompatServer --model model.gguf --port 8080`.
111111
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
112112
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
113113

@@ -591,7 +591,9 @@ array alone at `GET /slots`. OpenAI responses preserve
591591

592592
`net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
593593
OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra
594-
dependency and no separate server process. It is both embeddable and the fat-jar `Main-Class`. It
594+
dependency and no separate server process. It is embeddable, and runnable via
595+
`java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …` (the fat jar's default
596+
`Main-Class` is instead `NativeServer` — see "Native server with the built-in WebUI" below). It
595597
serves:
596598

597599
| Method &amp; path | Backed by |
@@ -646,16 +648,17 @@ try (LlamaModel model = new LlamaModel(modelParams);
646648
}
647649
```
648650

649-
…or run it standalone. The fat jar built by the `assembly` profile (`mvn -P assembly package`) is
650-
runnable (its `Main-Class` is `net.ladenthin.llama.server.OpenAiCompatServer`); the plain library jar
651-
works too via `-cp`:
651+
…or run it standalone. It has its own `main`, launched by class name via `-cp` (the fat jar's
652+
default `java -jar` `Main-Class` is `NativeServer`the native server below — so name
653+
`OpenAiCompatServer` explicitly to get this Java one):
652654

653655
```bash
654-
# fat jar (bundles the native lib + Java deps)
655-
java -jar target/llama-<version>-jar-with-dependencies.jar \
656+
# fat jar (bundles the native lib + Java deps) — name the class explicitly
657+
java -cp target/llama-<version>-jar-with-dependencies.jar \
658+
net.ladenthin.llama.server.OpenAiCompatServer \
656659
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
657660

658-
# or the plain jar
661+
# or the plain library jar
659662
java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
660663
--model models/model.gguf --port 8080 --model-id local-model
661664
```
@@ -716,7 +719,17 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` (
716719
the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use
717720
`net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over
718721
JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so
719-
every flag works exactly as it does for the standalone binary:
722+
every flag works exactly as it does for the standalone binary. It is the fat jar's default
723+
`Main-Class`, so `java -jar` just forwards its args to the native server (pass `--help` for the full
724+
llama-server option list):
725+
726+
```bash
727+
java -jar target/llama-<version>-jar-with-dependencies.jar \
728+
-m models/model.gguf --host 127.0.0.1 --port 8080 -c 65536 --jinja
729+
# then open http://127.0.0.1:8080/ for the WebUI
730+
```
731+
732+
Or embed it:
720733

721734
```java
722735
try (NativeServer server = new NativeServer(

llama/pom.xml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1296,8 +1296,10 @@ SPDX-License-Identifier: MIT
12961296
<!--
12971297
Builds the fat jar-with-dependencies uber JAR: the library classes, the
12981298
default-platform native libs from src/main/resources, and all runtime Java
1299-
dependencies in one drop-on-classpath JAR, runnable via the OpenAiCompatServer
1300-
Main-Class (set below) to start the OpenAI-compatible HTTP server. Off by
1299+
dependencies in one drop-on-classpath JAR, with NativeServer as the fat-jar
1300+
Main-Class (set below) to start the full native llama.cpp server (embedded WebUI,
1301+
every llama-server flag forwarded); OpenAiCompatServer stays runnable via
1302+
`java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer`. Off by
13011303
default; the CI `package` job activates it so the uber JAR rides along in the
13021304
`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
13031305
as `mvn -P assembly package`.
@@ -1314,7 +1316,7 @@ SPDX-License-Identifier: MIT
13141316
</descriptorRefs>
13151317
<archive>
13161318
<manifest>
1317-
<mainClass>net.ladenthin.llama.server.OpenAiCompatServer</mainClass>
1319+
<mainClass>net.ladenthin.llama.server.NativeServer</mainClass>
13181320
</manifest>
13191321
</archive>
13201322
</configuration>

llama/src/main/java/net/ladenthin/llama/server/NativeServer.java

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,44 @@ public void close() {
180180
}
181181
}
182182

183+
/**
184+
* Fat-jar entry point (the assembly JAR's {@code Main-Class}): starts the full native llama.cpp
185+
* server — WebUI included — forwarding every argument to it verbatim, and blocks until the
186+
* server exits or the JVM is asked to shut down (Ctrl-C / SIGTERM), stopping the server cleanly
187+
* on the way out.
188+
*
189+
* <p>This is the default runnable server. The Java-transport {@link OpenAiCompatServer} remains
190+
* available via its own {@code main} — run it explicitly with
191+
* {@code java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …}.</p>
192+
*
193+
* @param args the llama-server command-line arguments, forwarded verbatim (e.g. {@code -m
194+
* model.gguf --host 127.0.0.1 --port 8080}); pass {@code --help} for the full
195+
* llama-server option list
196+
* @throws InterruptedException if interrupted while waiting for the server to exit
197+
*/
198+
public static void main(String[] args) throws InterruptedException {
199+
final NativeServer server = new NativeServer(args);
200+
final AtomicBoolean stoppedByHook = new AtomicBoolean(false);
201+
// Graceful Ctrl-C / SIGTERM: the embedded server installs no signal handlers of its own
202+
// (see patches/0006), so the JVM-level shutdown hook is what stops it before exit.
203+
Runtime.getRuntime()
204+
.addShutdownHook(new Thread(
205+
() -> {
206+
stoppedByHook.set(true);
207+
server.close();
208+
},
209+
"jllama-native-server-shutdown"));
210+
server.start();
211+
// Keep the JVM alive until the native worker exits — on its own (e.g. a fatal startup/model
212+
// error that llama_server has already logged) or because the shutdown hook stopped it.
213+
while (server.isRunning()) {
214+
Thread.sleep(200L);
215+
}
216+
if (!stoppedByHook.get()) {
217+
server.close();
218+
}
219+
}
220+
183221
/**
184222
* Starts the native server on a worker thread and returns an opaque handle. The argv is
185223
* forwarded verbatim (with a synthetic {@code argv[0]}).

0 commit comments

Comments
 (0)