Skip to content

Commit fa3afb5

Browse files
committed
Consolidate OpenAI server: keep JDK/streaming OpenAiCompatServer, drop NanoHTTPD
Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server (PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it. Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class). - Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged). - Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health. - Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions, embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new routes to handleCompletionsOai / handleEmbeddings. - New testable CLI parser OpenAiServerCli (short/long/alias flags, --help, validation) replacing the inline arg map and the deleted LlamaServerArgs; produces ModelParameters + OpenAiServerConfig. Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig, OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend (+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest). Reconciliation: - pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class -> OpenAiCompatServer. - spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse; drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private, like the old LlamaModelChatBackend, which needed none). - LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and module-info `requires jdk.httpserver` unchanged (still correct for the survivor). - LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated; removed the consolidation block and the now-moot "implement SSE" TODO (its premise that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported, exported jdk.httpserver module). C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the streaming path survives. Verification (model-free): mvn compile test-compile; targeted tests (LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest, ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all green; javadoc:jar clean; spotless:check clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
1 parent af13cf0 commit fa3afb5

26 files changed

Lines changed: 895 additions & 1443 deletions

.github/workflows/publish.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -821,7 +821,7 @@ jobs:
821821
# `assembly` additionally produces the fat jar-with-dependencies uber JAR
822822
# (llama-<version>-jar-with-dependencies.jar: library classes + Java runtime deps +
823823
# default-platform native libs in one drop-on-classpath JAR, runnable via its
824-
# LlamaServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
824+
# OpenAiCompatServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
825825
# artifact below - a CI run artifact only, not a Maven Central / GitHub-Release asset.
826826
run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android,assembly -Dmaven.test.skip=true -Dgpg.skip=true package
827827
- name: Upload JARs

CLAUDE.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -471,9 +471,10 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
471471
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
472472
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
473473
- `OSInfo` — Detects OS and architecture for library resolution.
474-
- **`server` package — OpenAI-compatible HTTP endpoint. NOTE: two implementations coexist on this branch pending a "best of both" consolidation (see [`TODO.md`](TODO.md)).**
475-
- `server.OpenAiCompatServer` — built on the JDK's `com.sun.net.httpserver` (no new dependency). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming) and `GET /v1/models` by delegating to `LlamaModel.chatComplete` / `LlamaModel.streamChatCompletion`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk`), preserving `delta.tool_calls`.
476-
- `server.LlamaServer` — an OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".
474+
- **`server` package — OpenAI-compatible HTTP endpoint (a single implementation).**
475+
- `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `GET /v1/models` and `GET /health`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion``requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`.
476+
- Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
477+
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
477478

478479
**Native layer** (`src/main/cpp/`):
479480
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.

README.md

Lines changed: 26 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
9797
- **Infilling** (fill-in-the-middle) for code models.
9898
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
9999
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
100-
- **Runnable OpenAI-compatible HTTP server** (`LlamaServer`, the fat-jar `Main-Class`): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
100+
- **Runnable OpenAI-compatible HTTP server** (`OpenAiCompatServer`, the fat-jar `Main-Class`, streaming SSE, zero extra dependency): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
101101
- **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
102102
- Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
103103

@@ -399,20 +399,22 @@ Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, Str
399399

400400
### OpenAI-compatible HTTP server
401401

402-
> **Note — two implementations pending consolidation.** This branch currently ships **two**
403-
> independent OpenAI-compatible servers in `net.ladenthin.llama.server`, awaiting a "best of both"
404-
> merge (see [TODO.md](TODO.md)). Both are documented below; they will be unified into one.
405-
406-
#### Option A — `OpenAiCompatServer` (dependency-free, streaming SSE) — for VS Code Copilot and other OpenAI clients
407-
408402
`net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
409403
OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra
410-
dependency and no separate server process. It serves:
404+
dependency and no separate server process. It is both embeddable and the fat-jar `Main-Class`. It
405+
serves:
406+
407+
| Method &amp; path | Backed by |
408+
|---|---|
409+
| `POST /v1/chat/completions` | `LlamaModel.streamChatCompletion` (streaming SSE) / `chatComplete` (blocking) |
410+
| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
411+
| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
412+
| `GET /v1/models` | the configured model id |
413+
| `GET /health` | static `{"status":"ok"}` (unauthenticated) |
411414

412-
- `POST /v1/chat/completions` — streaming (Server-Sent Events) and non-streaming, forwarding
413-
`messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
414-
clients work.
415-
- `GET /v1/models` — advertises the configured model id.
415+
Chat completions support **streaming via Server-Sent Events** and non-streaming, forwarding
416+
`messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
417+
clients work. Completions and embeddings are non-streaming (the full JSON result per request).
416418

417419
Embed it in your app:
418420

@@ -425,14 +427,24 @@ try (LlamaModel model = new LlamaModel(modelParams);
425427
}
426428
```
427429

428-
…or run it standalone:
430+
…or run it standalone. The fat jar built by the `assembly` profile (`mvn -P assembly package`) is
431+
runnable (its `Main-Class` is `net.ladenthin.llama.server.OpenAiCompatServer`); the plain library jar
432+
works too via `-cp`:
429433

430434
```bash
435+
# fat jar (bundles the native lib + Java deps)
436+
java -jar target/llama-<version>-jar-with-dependencies.jar \
437+
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
438+
439+
# or the plain jar
431440
java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
432441
--model models/model.gguf --port 8080 --model-id local-model
433442
```
434443

435-
Verify with curl:
444+
Run with `--help` for the full option list (`-m/--model`, `--host`, `-p/--port`, `-c/--ctx-size`,
445+
`-ngl/--n-gpu-layers`, `-t/--threads`, `--parallel`, `--model-id`, `--api-key`, `--embedding`).
446+
447+
Verify with curl (streaming chat):
436448

437449
```bash
438450
curl -N http://127.0.0.1:8080/v1/chat/completions \
@@ -474,37 +486,6 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` (
474486
`OpenAiServerConfig.apiKey(...)`) to require an `Authorization: Bearer` token; the server binds to
475487
`127.0.0.1` by default.
476488

477-
#### Option B — `LlamaServer` (NanoHTTPD, fat-jar `Main-Class`)
478-
479-
The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
480-
`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
481-
server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
482-
request body to the matching `LlamaModel.handle*` method:
483-
484-
```bash
485-
java -jar target/llama-<version>-jar-with-dependencies.jar \
486-
--model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
487-
```
488-
489-
| Method &amp; path | Backed by |
490-
|---|---|
491-
| `POST /v1/chat/completions` | `LlamaModel.handleChatCompletions` |
492-
| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
493-
| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
494-
| `GET /v1/models` | the configured model alias |
495-
| `GET /health` | static `{"status":"ok"}` |
496-
497-
```bash
498-
curl http://localhost:8080/v1/chat/completions \
499-
-H 'Content-Type: application/json' \
500-
-d '{"messages":[{"role":"user","content":"Hello!"}]}'
501-
```
502-
503-
Run with `--help` for all options (`--ctx-size`, `--threads`, `--model-alias`, …). Responses are
504-
non-streaming (the full JSON result is returned per request). The NanoHTTPD dependency is declared
505-
`<optional>`, so it is bundled in the fat jar but **not** inherited by projects that use this
506-
library as a Maven dependency; running the server requires the fat jar (or adding NanoHTTPD yourself).
507-
508489
### Model/Inference Configuration
509490

510491
There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder

TODO.md

Lines changed: 15 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -13,52 +13,29 @@ cross-cutting initiative.
1313

1414
## Open — jllama-specific
1515

16-
### ⚠️ OpenAI server: TWO implementations to consolidate ("best of both")
17-
18-
Two independent, Claude-generated OpenAI-compatible servers now coexist in
19-
`net.ladenthin.llama.server` after PR #240 was merged on top of the NanoHTTPD server that landed
20-
via #242. **This is a temporary state**; one unified implementation must be chosen. Until then both
21-
compile and are tested side by side.
22-
23-
| | **Option A — `OpenAiCompatServer`** (from PR #240) | **Option B — `LlamaServer`** (from #242) |
24-
|---|---|---|
25-
| HTTP layer | JDK `com.sun.net.httpserver` (the supported `jdk.httpserver` module — **no dependency**) | NanoHTTPD (`<optional>` dep, bundled only in fat jar) |
26-
| Streaming | **Yes** — SSE with `delta.tool_calls`, heartbeats during prefill | No — blocking, full JSON per request |
27-
| Routes | `POST /v1/chat/completions`, `GET /v1/models` | `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `GET /v1/models`, `GET /health` |
28-
| Entry point | CLI launcher + embeddable; `OpenAiServerConfig` builder; optional bearer auth; binds `127.0.0.1` | fat-jar `Main-Class`; `LlamaServerArgs` CLI (`--host/--port/--ctx-size/--threads/…`) |
29-
| Native path | `requestChatCompletionStream` / `receiveChatCompletionChunk` (+ `wrap_stream_chunk` C++ helper) | `LlamaModel.handle*` (blocking) |
30-
| Tests | mapper/SSE/parser unit tests + model-free HTTP test over a socket (`ChatBackend` seam) | `OaiRouterTest`, `LlamaServerArgsTest`, `OaiHttpServerIntegrationTest` |
31-
32-
**Important cross-insight:** Option B's own follow-up TODO below ("OpenAI-compatible server: token
33-
streaming (SSE) + Java-8 HTTP layer") lists SSE as *the main functional gap* and says to **avoid**
34-
`com.sun.net.httpserver` because it is "ArchUnit-banned". Option A **already implements that SSE
35-
streaming** with `com.sun.net.httpserver`, and the ban was lifted correctly: `com.sun.net.httpserver`
36-
is a *supported, exported* JDK API (the `jdk.httpserver` module), not an internal `com.sun..` package —
37-
the `noInternalJdkImports` ArchUnit rule now carries an explicit exception for it. So the premise that
38-
blocked the JDK approach on Option B's side does not hold.
39-
40-
**Consolidation task (separate session — a kickoff prompt accompanies this change):** go through both
41-
implementations, take the best of each, settle on ONE server, delete the other, reconcile the
42-
dependency (`pom.xml` NanoHTTPD + assembly), the ArchUnit `layeredArchitecture` `Server` layer, the
43-
`spotbugs-exclude.xml` entries, `package-info.java`, the README "OpenAI-compatible HTTP server"
44-
section, and this TODO (including the now-partly-moot SSE section below).
45-
4616
### OpenAI-compatible HTTP endpoint (shipped; follow-ups open)
4717

48-
`net.ladenthin.llama.server.OpenAiCompatServer` exposes `POST /v1/chat/completions` (streaming via
49-
SSE + non-streaming) and `GET /v1/models` over the JDK's built-in `com.sun.net.httpserver` (no new
50-
dependency), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can
51-
drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` /
52-
`receiveChatCompletionChunk`), preserving `delta.tool_calls` for agent mode. Follow-ups, deferred
53-
until requested:
18+
`net.ladenthin.llama.server.OpenAiCompatServer` is the single OpenAI-compatible server. It exposes
19+
`POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`,
20+
`POST /v1/embeddings`, `GET /v1/models` and `GET /health` over the JDK's built-in
21+
`com.sun.net.httpserver` (no new dependency), and is the fat-jar `Main-Class`. Streaming chat uses the
22+
native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++
23+
`wrap_stream_chunk` helper), preserving `delta.tool_calls` for agent mode; completions/embeddings
24+
forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`. The CLI is parsed by the
25+
testable `OpenAiServerCli`. (Consolidated from the two interim implementations — PR #240's JDK +
26+
streaming server and #242's NanoHTTPD server — by keeping the JDK/streaming core, porting the extra
27+
routes + a fuller CLI + the fat-jar entry point onto it, and deleting the NanoHTTPD impl + its
28+
`org.nanohttpd` dependency.) Follow-ups, deferred until requested:
5429

5530
- **Multi-model registry.** Only one model id is advertised/served today; support several models
5631
chosen by the request `model` field (and listed in `/v1/models`).
5732
- **`stream_options.include_usage` passthrough** so the final streamed `usage` chunk is emitted
5833
(needs a generic raw-param passthrough on `InferenceParameters`, or explicit mapping).
34+
- **Streaming `/v1/completions`.** The chat route streams; `/v1/completions` is non-streaming today
35+
(a `"stream": true` body still returns one full JSON object). Honour SSE there too if a client needs
36+
it.
5937
- **Additional `apiType`s.** VS Code "Custom Endpoint" also offers Anthropic `messages` and OpenAI
60-
`responses`; only `chat-completions` is implemented. Also consider `/v1/completions` and
61-
`/v1/embeddings` routes.
38+
`responses`; only `chat-completions` is implemented.
6239
- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4
6340
tool-call parser fixes (landed upstream ~Apr 2026); if not, bump per the upgrade procedure so
6441
streamed/blocking `tool_calls` come through for Gemma 4 GGUFs.
@@ -105,40 +82,6 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
10582

10683
**Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
10784

108-
### OpenAI-compatible server: token streaming (SSE) + Java-8 HTTP layer
109-
110-
The `net.ladenthin.llama.server.LlamaServer` MVP is **non-streaming**: every request calls
111-
the blocking `LlamaModel.handle*` method and returns the full JSON response in one shot. A
112-
client that sends `"stream": true` still receives a single response, not the incremental
113-
`text/event-stream` (SSE) `data: {chunk}\n\n` events the OpenAI API emits for streaming
114-
chat/completions. This is the main functional gap of the server today.
115-
116-
The token source already exists — `LlamaModel.generateChat(InferenceParameters)` /
117-
`generate(...)` yield tokens incrementally through a Java `Iterator` (`LlamaIterable`). What
118-
is missing is an HTTP layer that emits SSE.
119-
120-
**Find a Java-8-compatible HTTP layer with good SSE support (alternative to Javalin), or
121-
implement SSE on NanoHTTPD.** Javalin has a first-class `ctx.sse(...)` API but is **not
122-
usable here**: Javalin 5 requires Java 11 and Javalin 6 requires Java 17, while this repo
123-
targets Java 8; Javalin 4 (the last Java-8 release) is EOL. Options, in rough order of
124-
preference:
125-
- **Implement SSE on the existing NanoHTTPD** via `NanoHTTPD.newChunkedResponse(status,
126-
"text/event-stream", InputStream)`, bridging a `LlamaIterable` to an `InputStream` that
127-
writes `data: {chunk}\n\n` frames. No new dependency, stays Java-8 clean; likely the right
128-
answer. Cost: the iterator→SSE bridge plus closing the `LlamaIterable` on client
129-
disconnect.
130-
- **Undertow** — Java-8 compatible, has a server-sent-events handler, but a heavier
131-
dependency tree.
132-
- **Spark Java** (Jetty 9) — Java-8 compatible; SSE support is limited/manual.
133-
- Avoid: Javalin 5/6 (Java 11/17), Javalin 4 (EOL), and the JDK `com.sun.net.httpserver`
134-
(ArchUnit-banned `com.sun..`).
135-
136-
Scope when implemented: honour `"stream": true` on `POST /v1/chat/completions` and
137-
`POST /v1/completions`, emit OpenAI-style SSE chunks terminated by `data: [DONE]`, close the
138-
underlying `LlamaIterable` on disconnect, and keep the non-streaming path as the default. Add
139-
a model-free routing test plus a real-socket SSE integration test (mirroring
140-
`OaiHttpServerIntegrationTest`).
141-
14285
## Open — cross-cutting (slice for this repo)
14386

14487
- **jqwik pin policy** — see [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jqwik-prompt-injection.md). `jqwik.version ≤ 1.9.3` is mandatory.

0 commit comments

Comments
 (0)