server: simplify ServerLauncher dispatch to one primitive; rename flag to --openai-compat

claude · claude · commit b2745209361a · 2026-07-03T09:10:14.000Z
Per review: collapse the dispatch to a single pure helper. withoutFlag(args, flag) strips the selector and returns a (possibly shorter) array; main() selects the mode purely by whether that shortened the list (present iff result is smaller), so the separate selectsOpenAiCompat method + its baked-in constant are gone. The helper takes the flag as a parameter, so it is general and testable independent of the flag's meaning. Also rename the selector --open-ai-compat -> --openai-compat ("OpenAI" is one word, matching the brand and the codebase's oaicompat / OpenAiCompatServer); constant OPEN_AI_COMPAT_FLAG -> OPENAI_COMPAT_FLAG. Tests rewritten around withoutFlag: the length-change selection signal (shorter iff present, same length iff absent, position-independent) plus stripping behaviour (strips all occurrences, preserves order, no-op when absent, empty). 7 pass. Verified at runtime: `ServerLauncher --openai-compat -m model --port 8974` routes to OpenAiCompatServer (/ -> invalid_request_error) and shuts down cleanly; no-flag routes to NativeServer. README + CLAUDE.md + pom updated; spotless + javadoc clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -851,10 +851,10 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
 
 ### Two server modes (`OpenAiCompatServer` vs `NativeServer`)
 
-The library exposes **two** ways to serve a model over HTTP, on two different transports. The fat jar's `Main-Class` is `server.ServerLauncher`, a tiny dispatcher: it runs `OpenAiCompatServer` when `--open-ai-compat` is present (that marker is stripped, the rest forwarded) and the default `NativeServer` otherwise. Both mains are also runnable directly by class name via `java -cp`. The two modes:
+The library exposes **two** ways to serve a model over HTTP, on two different transports. The fat jar's `Main-Class` is `server.ServerLauncher`, a tiny dispatcher: it runs `OpenAiCompatServer` when `--openai-compat` is present (that marker is stripped, the rest forwarded) and the default `NativeServer` otherwise. Both mains are also runnable directly by class name via `java -cp`. The two modes:
 
 1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. It has its own `main` (run via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`); its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
-2. **`server.NativeServer` (native transport) — the default fat-jar server (when `--open-ai-compat` is absent).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
+2. **`server.NativeServer` (native transport) — the default fat-jar server (when `--openai-compat` is absent).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
 
 ### Native Helper Architecture
 
diff --git a/README.md b/README.md
@@ -107,7 +107,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
 - **Infilling** (fill-in-the-middle) for code models.
 - **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
 - **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
-- **Two runnable HTTP server modes, one fat-jar entry.** The fat jar's `Main-Class` is `ServerLauncher`, which dispatches on the `--open-ai-compat` flag. Without it, `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI — no separate `llama-server.exe`. With it, `java -jar … --open-ai-compat --model model.gguf --port 8080` runs the Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) instead. Both are also runnable directly by class name via `java -cp … net.ladenthin.llama.server.{NativeServer,OpenAiCompatServer}`.
+- **Two runnable HTTP server modes, one fat-jar entry.** The fat jar's `Main-Class` is `ServerLauncher`, which dispatches on the `--openai-compat` flag. Without it, `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI — no separate `llama-server.exe`. With it, `java -jar … --openai-compat --model model.gguf --port 8080` runs the Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) instead. Both are also runnable directly by class name via `java -cp … net.ladenthin.llama.server.{NativeServer,OpenAiCompatServer}`.
 - **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
 - Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
 
@@ -649,12 +649,12 @@ try (LlamaModel model = new LlamaModel(modelParams);
 ```
 
 …or run it standalone. The fat jar's `Main-Class` is the `ServerLauncher` dispatcher, so add
-`--open-ai-compat` to select this Java server (the launcher strips that flag and forwards the rest);
+`--openai-compat` to select this Java server (the launcher strips that flag and forwards the rest);
 or name the class explicitly via `-cp`:
 
 ```bash
-# fat jar (bundles the native lib + Java deps) — select the Java server with --open-ai-compat
-java -jar target/llama-<version>-jar-with-dependencies.jar --open-ai-compat \
+# fat jar (bundles the native lib + Java deps) — select the Java server with --openai-compat
+java -jar target/llama-<version>-jar-with-dependencies.jar --openai-compat \
     --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
 
 # or name the class explicitly (fat jar or plain library jar)
@@ -719,7 +719,7 @@ the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use
 `net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over
 JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so
 every flag works exactly as it does for the standalone binary. The fat jar runs it **by default**
-(when `--open-ai-compat` is absent), forwarding its args to the native server (pass `--help` for the
+(when `--openai-compat` is absent), forwarding its args to the native server (pass `--help` for the
 full llama-server option list):
 
 ```bash
diff --git a/llama/pom.xml b/llama/pom.xml
@@ -1297,7 +1297,7 @@ SPDX-License-Identifier: MIT
 				Builds the fat jar-with-dependencies uber JAR: the library classes, the
 				default-platform native libs from src/main/resources, and all runtime Java
 				dependencies in one drop-on-classpath JAR, with ServerLauncher as the fat-jar
-				Main-Class (set below), which dispatches on an `open-ai-compat` selector flag: with it, runs
+				Main-Class (set below), which dispatches on an `openai-compat` selector flag: with it, runs
 					OpenAiCompatServer (Java OpenAI API); without it, the default NativeServer (native
 					server, embedded WebUI, all flags forwarded). Both mains stay runnable by class name via `java -cp <jar> …`. Off by
 				default; the CI `package` job activates it so the uber JAR rides along in the
diff --git a/llama/src/main/java/net/ladenthin/llama/server/ServerLauncher.java b/llama/src/main/java/net/ladenthin/llama/server/ServerLauncher.java
@@ -9,63 +9,52 @@
 
 /**
  * Fat-jar entry point that dispatches to one of the two server modes based on a single selector
- * flag. With {@value #OPEN_AI_COMPAT_FLAG} present it runs {@link OpenAiCompatServer} (the
+ * flag. With {@value #OPENAI_COMPAT_FLAG} present it runs {@link OpenAiCompatServer} (the
  * Java-transport, OpenAI-compatible JSON API); without it, {@link NativeServer} (the full native
  * llama.cpp server with embedded WebUI, the default).
  *
- * <p>Every other argument is forwarded verbatim to the chosen server; the {@value
- * #OPEN_AI_COMPAT_FLAG} marker itself is stripped so it never reaches either parser (it is not a
- * llama.cpp flag, and {@code llama_server} rejects unknown flags).</p>
+ * <p>The dispatch uses a single primitive, {@link #withoutFlag(String[], String)}: it strips the
+ * selector from the arguments (the flag is not a llama.cpp flag, and {@code llama_server} rejects
+ * unknown flags), and the mode is chosen purely by whether that shortened the list — present iff the
+ * result is smaller. Every other argument is forwarded verbatim.</p>
  *
  * <p><strong>Flag sets differ.</strong> {@link NativeServer} forwards <em>every</em> llama-server
  * flag to {@code llama_server}, whereas {@link OpenAiCompatServer}'s CLI ({@link OpenAiServerCli})
  * accepts a curated subset and rejects unknown flags — so native-only flags (e.g. {@code --ui},
- * {@code -fa}) cannot be combined with {@value #OPEN_AI_COMPAT_FLAG}.</p>
+ * {@code -fa}) cannot be combined with {@value #OPENAI_COMPAT_FLAG}.</p>
  *
  * <p>Both underlying mains remain directly runnable by class name via {@code java -cp}; this
  * launcher is purely a convenience so a single {@code java -jar} covers both.</p>
  */
 public final class ServerLauncher {
 
     /** Selector flag: when present, run {@link OpenAiCompatServer} instead of the default {@link NativeServer}. */
-    public static final String OPEN_AI_COMPAT_FLAG = "--open-ai-compat";
+    public static final String OPENAI_COMPAT_FLAG = "--openai-compat";
 
     private ServerLauncher() {}
 
     /**
-     * Dispatches to {@link OpenAiCompatServer#main(String[])} when {@value #OPEN_AI_COMPAT_FLAG} is
-     * present (with that marker removed from the arguments), otherwise to
-     * {@link NativeServer#main(String[])} with all arguments forwarded unchanged.
+     * Dispatches to {@link OpenAiCompatServer#main(String[])} when {@value #OPENAI_COMPAT_FLAG} is
+     * present (with that marker removed), otherwise to {@link NativeServer#main(String[])} with all
+     * arguments forwarded unchanged. Selection is derived from whether stripping the flag shortened
+     * the argument list.
      *
      * @param args the process arguments
      * @throws Exception if the selected server's {@code main} throws (it blocks until shutdown)
      */
     public static void main(String[] args) throws Exception {
-        if (selectsOpenAiCompat(args)) {
-            OpenAiCompatServer.main(withoutFlag(args, OPEN_AI_COMPAT_FLAG));
+        final String[] forwarded = withoutFlag(args, OPENAI_COMPAT_FLAG);
+        if (forwarded.length != args.length) {
+            OpenAiCompatServer.main(forwarded);
         } else {
             NativeServer.main(args);
         }
     }
 
-    /**
-     * Whether the arguments request the OpenAI-compatible server via {@value #OPEN_AI_COMPAT_FLAG}.
-     *
-     * @param args the process arguments
-     * @return {@code true} if the selector flag is present
-     */
-    static boolean selectsOpenAiCompat(String[] args) {
-        for (final String arg : args) {
-            if (OPEN_AI_COMPAT_FLAG.equals(arg)) {
-                return true;
-            }
-        }
-        return false;
-    }
-
     /**
      * Returns a copy of {@code args} with every occurrence of {@code flag} removed, preserving the
-     * order of the remaining arguments.
+     * order of the remaining arguments. The result is shorter than {@code args} exactly when
+     * {@code flag} was present — which is how {@link #main(String[])} selects the server mode.
      *
      * @param args the arguments
      * @param flag the flag token to strip
diff --git a/llama/src/test/java/net/ladenthin/llama/server/ServerLauncherTest.java b/llama/src/test/java/net/ladenthin/llama/server/ServerLauncherTest.java
@@ -12,50 +12,56 @@
 import org.junit.jupiter.api.Test;
 
 /**
- * Pure-Java unit tests for {@link ServerLauncher}'s dispatch logic (selector detection + flag
- * stripping). No server is started and no native library is required.
+ * Pure-Java unit tests for {@link ServerLauncher}'s single dispatch primitive,
+ * {@link ServerLauncher#withoutFlag(String[], String)}. Selection is derived from the length change
+ * (result shorter iff the flag was present), so these tests cover both the stripping behaviour and
+ * that selection signal. No server is started and no native library is required.
  */
 public class ServerLauncherTest {
 
+    private static final String FLAG = ServerLauncher.OPENAI_COMPAT_FLAG;
+
+    // --- selection signal: shorter iff the flag was present ---
+
     @Test
-    public void selectsNativeByDefault() {
-        assertThat(ServerLauncher.selectsOpenAiCompat(new String[] {"-m", "m.gguf", "--port", "8080"}), is(false));
+    public void resultIsShorterWhenFlagPresent() {
+        String[] in = {FLAG, "-m", "m.gguf", "--port", "8080"};
+        assertThat(ServerLauncher.withoutFlag(in, FLAG).length < in.length, is(true));
     }
 
     @Test
-    public void selectsOpenAiCompatWhenFlagPresent() {
-        assertThat(ServerLauncher.selectsOpenAiCompat(new String[] {"--open-ai-compat", "-m", "m.gguf"}), is(true));
+    public void resultKeepsLengthWhenFlagAbsent() {
+        String[] in = {"-m", "m.gguf", "--port", "8080"};
+        assertThat(ServerLauncher.withoutFlag(in, FLAG).length == in.length, is(true));
     }
 
     @Test
-    public void selectorFlagPositionDoesNotMatter() {
-        assertThat(ServerLauncher.selectsOpenAiCompat(new String[] {"-m", "m.gguf", "--open-ai-compat"}), is(true));
+    public void flagPositionDoesNotMatter() {
+        String[] in = {"-m", "m.gguf", FLAG};
+        assertThat(ServerLauncher.withoutFlag(in, FLAG).length < in.length, is(true));
     }
 
+    // --- stripping behaviour ---
+
     @Test
-    public void withoutFlagStripsTheSelectorAndPreservesTheRest() {
-        String[] out = ServerLauncher.withoutFlag(
-                new String[] {"--open-ai-compat", "-m", "m.gguf", "--port", "8080"},
-                ServerLauncher.OPEN_AI_COMPAT_FLAG);
+    public void stripsTheSelectorAndPreservesTheRest() {
+        String[] out = ServerLauncher.withoutFlag(new String[] {FLAG, "-m", "m.gguf", "--port", "8080"}, FLAG);
         assertThat(out, arrayContaining("-m", "m.gguf", "--port", "8080"));
     }
 
     @Test
-    public void withoutFlagRemovesEveryOccurrence() {
-        String[] out = ServerLauncher.withoutFlag(
-                new String[] {"--open-ai-compat", "-m", "m.gguf", "--open-ai-compat"},
-                ServerLauncher.OPEN_AI_COMPAT_FLAG);
+    public void removesEveryOccurrence() {
+        String[] out = ServerLauncher.withoutFlag(new String[] {FLAG, "-m", "m.gguf", FLAG}, FLAG);
         assertThat(out, arrayContaining("-m", "m.gguf"));
     }
 
     @Test
-    public void withoutFlagIsNoOpWhenAbsent() {
-        String[] in = new String[] {"-m", "m.gguf"};
-        assertThat(ServerLauncher.withoutFlag(in, ServerLauncher.OPEN_AI_COMPAT_FLAG), arrayContaining("-m", "m.gguf"));
+    public void isNoOpWhenAbsent() {
+        assertThat(ServerLauncher.withoutFlag(new String[] {"-m", "m.gguf"}, FLAG), arrayContaining("-m", "m.gguf"));
     }
 
     @Test
-    public void withoutFlagOnEmptyArgsIsEmpty() {
-        assertThat(ServerLauncher.withoutFlag(new String[] {}, ServerLauncher.OPEN_AI_COMPAT_FLAG), is(emptyArray()));
+    public void emptyArgsStayEmpty() {
+        assertThat(ServerLauncher.withoutFlag(new String[] {}, FLAG), is(emptyArray()));
     }
 }