Skip to content

Commit 8a4b1ff

Browse files
committed
Namespace the ServerLauncher selector flag as --jllama-openai-compat
Rename the fat-jar dispatch selector from --openai-compat to --jllama-openai-compat so it can never collide with a current or future llama.cpp / llama-server flag: upstream owns the --* space, this launcher owns --jllama-*. The jllama prefix is the project's native-library name, which upstream will never use, and it stays a lowercase-hyphen CLI token (not the verbose FQN, not the class name). ServerLauncher strips it before forwarding, so it never reaches llama_server (which rejects unknown flags). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
1 parent b274520 commit 8a4b1ff

4 files changed

Lines changed: 18 additions & 10 deletions

File tree

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -851,10 +851,10 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
851851

852852
### Two server modes (`OpenAiCompatServer` vs `NativeServer`)
853853

854-
The library exposes **two** ways to serve a model over HTTP, on two different transports. The fat jar's `Main-Class` is `server.ServerLauncher`, a tiny dispatcher: it runs `OpenAiCompatServer` when `--openai-compat` is present (that marker is stripped, the rest forwarded) and the default `NativeServer` otherwise. Both mains are also runnable directly by class name via `java -cp`. The two modes:
854+
The library exposes **two** ways to serve a model over HTTP, on two different transports. The fat jar's `Main-Class` is `server.ServerLauncher`, a tiny dispatcher: it runs `OpenAiCompatServer` when `--jllama-openai-compat` is present (that marker is stripped, the rest forwarded) and the default `NativeServer` otherwise. Both mains are also runnable directly by class name via `java -cp`. The two modes:
855855

856856
1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. It has its own `main` (run via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`); its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
857-
2. **`server.NativeServer` (native transport) — the default fat-jar server (when `--openai-compat` is absent).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
857+
2. **`server.NativeServer` (native transport) — the default fat-jar server (when `--jllama-openai-compat` is absent).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
858858

859859
### Native Helper Architecture
860860

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
107107
- **Infilling** (fill-in-the-middle) for code models.
108108
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
109109
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
110-
- **Two runnable HTTP server modes, one fat-jar entry.** The fat jar's `Main-Class` is `ServerLauncher`, which dispatches on the `--openai-compat` flag. Without it, `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI — no separate `llama-server.exe`. With it, `java -jar … --openai-compat --model model.gguf --port 8080` runs the Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) instead. Both are also runnable directly by class name via `java -cp … net.ladenthin.llama.server.{NativeServer,OpenAiCompatServer}`.
110+
- **Two runnable HTTP server modes, one fat-jar entry.** The fat jar's `Main-Class` is `ServerLauncher`, which dispatches on the `--jllama-openai-compat` flag. Without it, `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI — no separate `llama-server.exe`. With it, `java -jar … --jllama-openai-compat --model model.gguf --port 8080` runs the Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) instead. Both are also runnable directly by class name via `java -cp … net.ladenthin.llama.server.{NativeServer,OpenAiCompatServer}`.
111111
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
112112
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
113113

@@ -649,12 +649,12 @@ try (LlamaModel model = new LlamaModel(modelParams);
649649
```
650650

651651
…or run it standalone. The fat jar's `Main-Class` is the `ServerLauncher` dispatcher, so add
652-
`--openai-compat` to select this Java server (the launcher strips that flag and forwards the rest);
652+
`--jllama-openai-compat` to select this Java server (the launcher strips that flag and forwards the rest);
653653
or name the class explicitly via `-cp`:
654654

655655
```bash
656-
# fat jar (bundles the native lib + Java deps) — select the Java server with --openai-compat
657-
java -jar target/llama-<version>-jar-with-dependencies.jar --openai-compat \
656+
# fat jar (bundles the native lib + Java deps) — select the Java server with --jllama-openai-compat
657+
java -jar target/llama-<version>-jar-with-dependencies.jar --jllama-openai-compat \
658658
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
659659

660660
# or name the class explicitly (fat jar or plain library jar)
@@ -719,7 +719,7 @@ the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use
719719
`net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over
720720
JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so
721721
every flag works exactly as it does for the standalone binary. The fat jar runs it **by default**
722-
(when `--openai-compat` is absent), forwarding its args to the native server (pass `--help` for the
722+
(when `--jllama-openai-compat` is absent), forwarding its args to the native server (pass `--help` for the
723723
full llama-server option list):
724724

725725
```bash

llama/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1297,7 +1297,7 @@ SPDX-License-Identifier: MIT
12971297
Builds the fat jar-with-dependencies uber JAR: the library classes, the
12981298
default-platform native libs from src/main/resources, and all runtime Java
12991299
dependencies in one drop-on-classpath JAR, with ServerLauncher as the fat-jar
1300-
Main-Class (set below), which dispatches on an `openai-compat` selector flag: with it, runs
1300+
Main-Class (set below), which dispatches on an `jllama-openai-compat` selector flag: with it, runs
13011301
OpenAiCompatServer (Java OpenAI API); without it, the default NativeServer (native
13021302
server, embedded WebUI, all flags forwarded). Both mains stay runnable by class name via `java -cp <jar> …`. Off by
13031303
default; the CI `package` job activates it so the uber JAR rides along in the

llama/src/main/java/net/ladenthin/llama/server/ServerLauncher.java

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,16 @@
2828
*/
2929
public final class ServerLauncher {
3030

31-
/** Selector flag: when present, run {@link OpenAiCompatServer} instead of the default {@link NativeServer}. */
32-
public static final String OPENAI_COMPAT_FLAG = "--openai-compat";
31+
/**
32+
* Selector flag: when present, run {@link OpenAiCompatServer} instead of the default
33+
* {@link NativeServer}.
34+
*
35+
* <p>Namespaced with the {@code jllama} prefix (this project's native-library name) so it can
36+
* never collide with a current or future llama.cpp / llama-server flag — upstream owns the
37+
* {@code --*} space, this launcher owns {@code --jllama-*}. The launcher strips it before
38+
* forwarding, so it never reaches {@code llama_server} (which rejects unknown flags).</p>
39+
*/
40+
public static final String OPENAI_COMPAT_FLAG = "--jllama-openai-compat";
3341

3442
private ServerLauncher() {}
3543

0 commit comments

Comments
 (0)