Skip to content

Commit c288e43

Browse files
committed
server: add ServerLauncher — one fat-jar entry, pick mode via --open-ai-compat
Tiny dispatcher as the fat jar's Main-Class: with --open-ai-compat it runs OpenAiCompatServer (Java-transport OpenAI API), without it the default NativeServer (full native llama.cpp server + WebUI). The --open-ai-compat marker is stripped (it is not a llama.cpp flag); all other args are forwarded verbatim to the chosen server. Both underlying mains stay runnable directly by class name via java -cp. Note: the two servers accept different flag sets — NativeServer forwards every llama-server flag, OpenAiCompatServer's CLI accepts a curated subset and rejects unknown flags — so native-only flags can't be combined with --open-ai-compat. Dispatch logic split into pure static helpers (selectsOpenAiCompat / withoutFlag) with 7 unit tests. Verified at runtime: `ServerLauncher --open-ai-compat -m model --port 8973` starts the Java server (/ -> invalid_request_error, its handler), and without the flag starts NativeServer (/ -> native File Not Found); both shut down cleanly on SIGTERM. pom Main-Class NativeServer -> ServerLauncher; README + CLAUDE.md updated. spotless + javadoc clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
1 parent 8a1a68f commit c288e43

5 files changed

Lines changed: 161 additions & 18 deletions

File tree

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -851,10 +851,10 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
851851

852852
### Two server modes (`OpenAiCompatServer` vs `NativeServer`)
853853

854-
The library exposes **two** ways to serve a model over HTTP, on two different transports:
854+
The library exposes **two** ways to serve a model over HTTP, on two different transports. The fat jar's `Main-Class` is `server.ServerLauncher`, a tiny dispatcher: it runs `OpenAiCompatServer` when `--open-ai-compat` is present (that marker is stripped, the rest forwarded) and the default `NativeServer` otherwise. Both mains are also runnable directly by class name via `java -cp`. The two modes:
855855

856856
1. **`server.OpenAiCompatServer` (Java transport).** OpenAI/Ollama/Anthropic-compatible JSON API on the JDK's `com.sun.net.httpserver`, driving the compiled server *core* over JNI. Embeddable, no extra dependency, and it can share/reuse a `LlamaModel`. It serves **no** static assets — its `/` route is a 404, so **no WebUI**. It has its own `main` (run via `java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer …`); its CLI (`OpenAiServerCli`) maps a curated flag subset (`-m/-c/-b/-ub/-ngl/-t/-tb/-ctk/-ctv/--jinja/--chat-template-kwargs/--host/--port/--parallel/--mmproj/--api-key/--embedding/--reranking`).
857-
2. **`server.NativeServer` (native transport) — the fat-jar default `Main-Class`.** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
857+
2. **`server.NativeServer` (native transport) — the default fat-jar server (when `--open-ai-compat` is absent).** Runs the **full upstream `llama_server`** (via `patches/0006` + `native_server.cpp`) inside `libjllama`, forwarding the raw llama-server argv verbatim — so **every** llama-server flag works and the **embedded WebUI is served** (when the assets are compiled in; CI's released jars have them, local `cmake` builds use the empty-asset stub). It is an **independent lifecycle** (loads its own model from the argv, like `llama-server.exe`; owns the process's llama backend + stderr logging while running), **single-instance per process** (upstream keeps shutdown state in file-scope globals), and **not available on Android** (the `subprocess.h` guard). Reusing an already-loaded `LlamaModel`'s context is a documented TODO. `libjllama` loading anywhere a JVM runs is what makes this "no separate `llama-server.exe`" possible.
858858

859859
### Native Helper Architecture
860860

README.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
107107
- **Infilling** (fill-in-the-middle) for code models.
108108
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
109109
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
110-
- **Two runnable HTTP server modes.** The fat jar's default `Main-Class` is `NativeServer`the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNI, no separate `llama-server.exe`: `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080`. The Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) is also available: `java -cp …-jar-with-dependencies.jar net.ladenthin.llama.server.OpenAiCompatServer --model model.gguf --port 8080`.
110+
- **Two runnable HTTP server modes, one fat-jar entry.** The fat jar's `Main-Class` is `ServerLauncher`, which dispatches on the `--open-ai-compat` flag. Without it, `java -jar …-jar-with-dependencies.jar -m model.gguf --port 8080` runs the full upstream llama.cpp server (embedded **WebUI**, every llama-server flag forwarded) hosted inside `libjllama` over JNIno separate `llama-server.exe`. With it, `java -jar … --open-ai-compat --model model.gguf --port 8080` runs the Java-transport, zero-extra-dependency **OpenAI-compatible** server (`OpenAiCompatServer`, streaming SSE) instead. Both are also runnable directly by class name via `java -cp … net.ladenthin.llama.server.{NativeServer,OpenAiCompatServer}`.
111111
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
112112
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
113113

@@ -648,17 +648,16 @@ try (LlamaModel model = new LlamaModel(modelParams);
648648
}
649649
```
650650

651-
…or run it standalone. It has its own `main`, launched by class name via `-cp` (the fat jar's
652-
default `java -jar` `Main-Class` is `NativeServer`the native server below — so name
653-
`OpenAiCompatServer` explicitly to get this Java one):
651+
…or run it standalone. The fat jar's `Main-Class` is the `ServerLauncher` dispatcher, so add
652+
`--open-ai-compat` to select this Java server (the launcher strips that flag and forwards the rest);
653+
or name the class explicitly via `-cp`:
654654

655655
```bash
656-
# fat jar (bundles the native lib + Java deps) — name the class explicitly
657-
java -cp target/llama-<version>-jar-with-dependencies.jar \
658-
net.ladenthin.llama.server.OpenAiCompatServer \
656+
# fat jar (bundles the native lib + Java deps) — select the Java server with --open-ai-compat
657+
java -jar target/llama-<version>-jar-with-dependencies.jar --open-ai-compat \
659658
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
660659

661-
# or the plain library jar
660+
# or name the class explicitly (fat jar or plain library jar)
662661
java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
663662
--model models/model.gguf --port 8080 --model-id local-model
664663
```
@@ -719,9 +718,9 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` (
719718
the **full upstream llama.cpp server, including its bundled Svelte WebUI**, use
720719
`net.ladenthin.llama.server.NativeServer`. It runs the real `llama_server` inside `libjllama` over
721720
JNI — no separate `llama-server.exe` — and **forwards the raw llama-server arguments verbatim**, so
722-
every flag works exactly as it does for the standalone binary. It is the fat jar's default
723-
`Main-Class`, so `java -jar` just forwards its args to the native server (pass `--help` for the full
724-
llama-server option list):
721+
every flag works exactly as it does for the standalone binary. The fat jar runs it **by default**
722+
(when `--open-ai-compat` is absent), forwarding its args to the native server (pass `--help` for the
723+
full llama-server option list):
725724

726725
```bash
727726
java -jar target/llama-<version>-jar-with-dependencies.jar \

llama/pom.xml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1296,10 +1296,10 @@ SPDX-License-Identifier: MIT
12961296
<!--
12971297
Builds the fat jar-with-dependencies uber JAR: the library classes, the
12981298
default-platform native libs from src/main/resources, and all runtime Java
1299-
dependencies in one drop-on-classpath JAR, with NativeServer as the fat-jar
1300-
Main-Class (set below) to start the full native llama.cpp server (embedded WebUI,
1301-
every llama-server flag forwarded); OpenAiCompatServer stays runnable via
1302-
`java -cp <jar> net.ladenthin.llama.server.OpenAiCompatServer`. Off by
1299+
dependencies in one drop-on-classpath JAR, with ServerLauncher as the fat-jar
1300+
Main-Class (set below), which dispatches on an `open-ai-compat` selector flag: with it, runs
1301+
OpenAiCompatServer (Java OpenAI API); without it, the default NativeServer (native
1302+
server, embedded WebUI, all flags forwarded). Both mains stay runnable by class name via `java -cp <jar> `. Off by
13031303
default; the CI `package` job activates it so the uber JAR rides along in the
13041304
`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
13051305
as `mvn -P assembly package`.
@@ -1316,7 +1316,7 @@ SPDX-License-Identifier: MIT
13161316
</descriptorRefs>
13171317
<archive>
13181318
<manifest>
1319-
<mainClass>net.ladenthin.llama.server.NativeServer</mainClass>
1319+
<mainClass>net.ladenthin.llama.server.ServerLauncher</mainClass>
13201320
</manifest>
13211321
</archive>
13221322
</configuration>
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import java.util.ArrayList;
8+
import java.util.List;
9+
10+
/**
11+
* Fat-jar entry point that dispatches to one of the two server modes based on a single selector
12+
* flag. With {@value #OPEN_AI_COMPAT_FLAG} present it runs {@link OpenAiCompatServer} (the
13+
* Java-transport, OpenAI-compatible JSON API); without it, {@link NativeServer} (the full native
14+
* llama.cpp server with embedded WebUI, the default).
15+
*
16+
* <p>Every other argument is forwarded verbatim to the chosen server; the {@value
17+
* #OPEN_AI_COMPAT_FLAG} marker itself is stripped so it never reaches either parser (it is not a
18+
* llama.cpp flag, and {@code llama_server} rejects unknown flags).</p>
19+
*
20+
* <p><strong>Flag sets differ.</strong> {@link NativeServer} forwards <em>every</em> llama-server
21+
* flag to {@code llama_server}, whereas {@link OpenAiCompatServer}'s CLI ({@link OpenAiServerCli})
22+
* accepts a curated subset and rejects unknown flags — so native-only flags (e.g. {@code --ui},
23+
* {@code -fa}) cannot be combined with {@value #OPEN_AI_COMPAT_FLAG}.</p>
24+
*
25+
* <p>Both underlying mains remain directly runnable by class name via {@code java -cp}; this
26+
* launcher is purely a convenience so a single {@code java -jar} covers both.</p>
27+
*/
28+
public final class ServerLauncher {
29+
30+
/** Selector flag: when present, run {@link OpenAiCompatServer} instead of the default {@link NativeServer}. */
31+
public static final String OPEN_AI_COMPAT_FLAG = "--open-ai-compat";
32+
33+
private ServerLauncher() {}
34+
35+
/**
36+
* Dispatches to {@link OpenAiCompatServer#main(String[])} when {@value #OPEN_AI_COMPAT_FLAG} is
37+
* present (with that marker removed from the arguments), otherwise to
38+
* {@link NativeServer#main(String[])} with all arguments forwarded unchanged.
39+
*
40+
* @param args the process arguments
41+
* @throws Exception if the selected server's {@code main} throws (it blocks until shutdown)
42+
*/
43+
public static void main(String[] args) throws Exception {
44+
if (selectsOpenAiCompat(args)) {
45+
OpenAiCompatServer.main(withoutFlag(args, OPEN_AI_COMPAT_FLAG));
46+
} else {
47+
NativeServer.main(args);
48+
}
49+
}
50+
51+
/**
52+
* Whether the arguments request the OpenAI-compatible server via {@value #OPEN_AI_COMPAT_FLAG}.
53+
*
54+
* @param args the process arguments
55+
* @return {@code true} if the selector flag is present
56+
*/
57+
static boolean selectsOpenAiCompat(String[] args) {
58+
for (final String arg : args) {
59+
if (OPEN_AI_COMPAT_FLAG.equals(arg)) {
60+
return true;
61+
}
62+
}
63+
return false;
64+
}
65+
66+
/**
67+
* Returns a copy of {@code args} with every occurrence of {@code flag} removed, preserving the
68+
* order of the remaining arguments.
69+
*
70+
* @param args the arguments
71+
* @param flag the flag token to strip
72+
* @return a new array without {@code flag}
73+
*/
74+
static String[] withoutFlag(String[] args, String flag) {
75+
final List<String> filtered = new ArrayList<>(args.length);
76+
for (final String arg : args) {
77+
if (!flag.equals(arg)) {
78+
filtered.add(arg);
79+
}
80+
}
81+
return filtered.toArray(new String[0]);
82+
}
83+
}
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama.server;
6+
7+
import static org.hamcrest.MatcherAssert.assertThat;
8+
import static org.hamcrest.Matchers.arrayContaining;
9+
import static org.hamcrest.Matchers.emptyArray;
10+
import static org.hamcrest.Matchers.is;
11+
12+
import org.junit.jupiter.api.Test;
13+
14+
/**
15+
* Pure-Java unit tests for {@link ServerLauncher}'s dispatch logic (selector detection + flag
16+
* stripping). No server is started and no native library is required.
17+
*/
18+
public class ServerLauncherTest {
19+
20+
@Test
21+
public void selectsNativeByDefault() {
22+
assertThat(ServerLauncher.selectsOpenAiCompat(new String[] {"-m", "m.gguf", "--port", "8080"}), is(false));
23+
}
24+
25+
@Test
26+
public void selectsOpenAiCompatWhenFlagPresent() {
27+
assertThat(ServerLauncher.selectsOpenAiCompat(new String[] {"--open-ai-compat", "-m", "m.gguf"}), is(true));
28+
}
29+
30+
@Test
31+
public void selectorFlagPositionDoesNotMatter() {
32+
assertThat(ServerLauncher.selectsOpenAiCompat(new String[] {"-m", "m.gguf", "--open-ai-compat"}), is(true));
33+
}
34+
35+
@Test
36+
public void withoutFlagStripsTheSelectorAndPreservesTheRest() {
37+
String[] out = ServerLauncher.withoutFlag(
38+
new String[] {"--open-ai-compat", "-m", "m.gguf", "--port", "8080"},
39+
ServerLauncher.OPEN_AI_COMPAT_FLAG);
40+
assertThat(out, arrayContaining("-m", "m.gguf", "--port", "8080"));
41+
}
42+
43+
@Test
44+
public void withoutFlagRemovesEveryOccurrence() {
45+
String[] out = ServerLauncher.withoutFlag(
46+
new String[] {"--open-ai-compat", "-m", "m.gguf", "--open-ai-compat"},
47+
ServerLauncher.OPEN_AI_COMPAT_FLAG);
48+
assertThat(out, arrayContaining("-m", "m.gguf"));
49+
}
50+
51+
@Test
52+
public void withoutFlagIsNoOpWhenAbsent() {
53+
String[] in = new String[] {"-m", "m.gguf"};
54+
assertThat(ServerLauncher.withoutFlag(in, ServerLauncher.OPEN_AI_COMPAT_FLAG), arrayContaining("-m", "m.gguf"));
55+
}
56+
57+
@Test
58+
public void withoutFlagOnEmptyArgsIsEmpty() {
59+
assertThat(ServerLauncher.withoutFlag(new String[] {}, ServerLauncher.OPEN_AI_COMPAT_FLAG), is(emptyArray()));
60+
}
61+
}

0 commit comments

Comments
 (0)