bernardladenthin
diff --git a/‎.github/workflows/publish.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/publish.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 4 additions & 3 deletions b/‎CLAUDE.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 26 additions & 45 deletions b/‎README.md‎
Lines changed: 26 additions & 45 deletions
diff --git a/‎TODO.md‎
Lines changed: 15 additions & 72 deletions b/‎TODO.md‎
Lines changed: 15 additions & 72 deletions
@@ -821,7 +821,7 @@ jobs:
         # `assembly` additionally produces the fat jar-with-dependencies uber JAR
         # (llama-<version>-jar-with-dependencies.jar: library classes + Java runtime deps +
         # default-platform native libs in one drop-on-classpath JAR, runnable via its
-        # LlamaServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
+        # OpenAiCompatServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
         # artifact below - a CI run artifact only, not a Maven Central / GitHub-Release asset.
         run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android,assembly -Dmaven.test.skip=true -Dgpg.skip=true package
       - name: Upload JARs
 
@@ -471,9 +471,10 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
 - `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
 - `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
 - `OSInfo` — Detects OS and architecture for library resolution.
-- **`server` package — OpenAI-compatible HTTP endpoint. NOTE: two implementations coexist on this branch pending a "best of both" consolidation (see [`TODO.md`](TODO.md)).**
-  - `server.OpenAiCompatServer` — built on the JDK's `com.sun.net.httpserver` (no new dependency). Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming) and `GET /v1/models` by delegating to `LlamaModel.chatComplete` / `LlamaModel.streamChatCompletion`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk`), preserving `delta.tool_calls`.
-  - `server.LlamaServer` — an OpenAI-compatible HTTP server and the fat-jar `Main-Class`. `LlamaServerArgs` parses the CLI; `OaiRouter` / `OaiHttpServer` (NanoHTTPD) map `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `GET /v1/models` to the `LlamaModel.handle*` methods. NanoHTTPD is an `<optional>` dependency (bundled only in the fat jar, not inherited by library consumers). The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`). See README "OpenAI-compatible HTTP server".
+- **`server` package — OpenAI-compatible HTTP endpoint (a single implementation).**
+  - `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `GET /v1/models` and `GET /health`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`.
+  - Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
+  - The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
 
 **Native layer** (`src/main/cpp/`):
 - `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.
 
@@ -97,7 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
 - **Infilling** (fill-in-the-middle) for code models.
 - **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
 - **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
-- **Runnable OpenAI-compatible HTTP server** (`LlamaServer`, the fat-jar `Main-Class`): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
+- **Runnable OpenAI-compatible HTTP server** (`OpenAiCompatServer`, the fat-jar `Main-Class`, streaming SSE, zero extra dependency): `java -jar …-jar-with-dependencies.jar --model model.gguf --port 8080`.
 - **Model metadata** access (`getModelMeta()`) and **server management** (metrics, slot save/restore, runtime thread reconfiguration).
 - Pre-built native binaries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86); CUDA, Metal, and Vulkan supported via local build.
 
@@ -399,20 +399,22 @@ Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, Str
 
 ### OpenAI-compatible HTTP server
 
-> **Note — two implementations pending consolidation.** This branch currently ships **two**
-> independent OpenAI-compatible servers in `net.ladenthin.llama.server`, awaiting a "best of both"
-> merge (see [TODO.md](TODO.md)). Both are documented below; they will be unified into one.
-
-#### Option A — `OpenAiCompatServer` (dependency-free, streaming SSE) — for VS Code Copilot and other OpenAI clients
-
 `net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
 OpenAI-compatible HTTP endpoint using only the JDK's built-in `com.sun.net.httpserver` — no extra
-dependency and no separate server process. It serves:
+dependency and no separate server process. It is both embeddable and the fat-jar `Main-Class`. It
+serves:
+
+| Method &amp; path | Backed by |
+|---|---|
+| `POST /v1/chat/completions` | `LlamaModel.streamChatCompletion` (streaming SSE) / `chatComplete` (blocking) |
+| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
+| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
+| `GET /v1/models` | the configured model id |
+| `GET /health` | static `{"status":"ok"}` (unauthenticated) |
 
-- `POST /v1/chat/completions` — streaming (Server-Sent Events) and non-streaming, forwarding
-  `messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
-  clients work.
-- `GET /v1/models` — advertises the configured model id.
+Chat completions support **streaming via Server-Sent Events** and non-streaming, forwarding
+`messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
+clients work. Completions and embeddings are non-streaming (the full JSON result per request).
 
 Embed it in your app:
 
@@ -425,14 +427,24 @@ try (LlamaModel model = new LlamaModel(modelParams);
 }
 ```
 
-…or run it standalone:
+…or run it standalone. The fat jar built by the `assembly` profile (`mvn -P assembly package`) is
+runnable (its `Main-Class` is `net.ladenthin.llama.server.OpenAiCompatServer`); the plain library jar
+works too via `-cp`:
 
 ```bash
+# fat jar (bundles the native lib + Java deps)
+java -jar target/llama-<version>-jar-with-dependencies.jar \
+    --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
+
+# or the plain jar
 java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServer \
   --model models/model.gguf --port 8080 --model-id local-model
 ```
 
-Verify with curl:
+Run with `--help` for the full option list (`-m/--model`, `--host`, `-p/--port`, `-c/--ctx-size`,
+`-ngl/--n-gpu-layers`, `-t/--threads`, `--parallel`, `--model-id`, `--api-key`, `--embedding`).
+
+Verify with curl (streaming chat):
 
 ```bash
 curl -N http://127.0.0.1:8080/v1/chat/completions \
@@ -474,37 +486,6 @@ tool calling depends on the model's own tool-calling quality. Pass `--api-key` (
 `OpenAiServerConfig.apiKey(...)`) to require an `Authorization: Bearer` token; the server binds to
 `127.0.0.1` by default.
 
-#### Option B — `LlamaServer` (NanoHTTPD, fat-jar `Main-Class`)
-
-The fat jar built by the `assembly` profile (`mvn -P assembly package`) is runnable: its
-`Main-Class` is `net.ladenthin.llama.server.LlamaServer`, a small [NanoHTTPD](https://github.com/NanoHttpd/nanohttpd)
-server that loads a GGUF model in-process and serves OpenAI-compatible endpoints by forwarding each
-request body to the matching `LlamaModel.handle*` method:
-
-```bash
-java -jar target/llama-<version>-jar-with-dependencies.jar \
-    --model models/Qwen3-0.6B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --n-gpu-layers 99
-```
-
-| Method &amp; path | Backed by |
-|---|---|
-| `POST /v1/chat/completions` | `LlamaModel.handleChatCompletions` |
-| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
-| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
-| `GET /v1/models` | the configured model alias |
-| `GET /health` | static `{"status":"ok"}` |
-
-```bash
-curl http://localhost:8080/v1/chat/completions \
-  -H 'Content-Type: application/json' \
-  -d '{"messages":[{"role":"user","content":"Hello!"}]}'
-```
-
-Run with `--help` for all options (`--ctx-size`, `--threads`, `--model-alias`, …). Responses are
-non-streaming (the full JSON result is returned per request). The NanoHTTPD dependency is declared
-`<optional>`, so it is bundled in the fat jar but **not** inherited by projects that use this
-library as a Maven dependency; running the server requires the fat jar (or adding NanoHTTPD yourself).
-
 ### Model/Inference Configuration
 
 There are two sets of parameters you can configure, `ModelParameters` and `InferenceParameters`. Both provide builder 
 
@@ -13,52 +13,29 @@ cross-cutting initiative.
 
 ## Open — jllama-specific
 
-### ⚠️ OpenAI server: TWO implementations to consolidate ("best of both")
-
-Two independent, Claude-generated OpenAI-compatible servers now coexist in
-`net.ladenthin.llama.server` after PR #240 was merged on top of the NanoHTTPD server that landed
-via #242. **This is a temporary state**; one unified implementation must be chosen. Until then both
-compile and are tested side by side.
-
-| | **Option A — `OpenAiCompatServer`** (from PR #240) | **Option B — `LlamaServer`** (from #242) |
-|---|---|---|
-| HTTP layer | JDK `com.sun.net.httpserver` (the supported `jdk.httpserver` module — **no dependency**) | NanoHTTPD (`<optional>` dep, bundled only in fat jar) |
-| Streaming | **Yes** — SSE with `delta.tool_calls`, heartbeats during prefill | No — blocking, full JSON per request |
-| Routes | `POST /v1/chat/completions`, `GET /v1/models` | `POST /v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `GET /v1/models`, `GET /health` |
-| Entry point | CLI launcher + embeddable; `OpenAiServerConfig` builder; optional bearer auth; binds `127.0.0.1` | fat-jar `Main-Class`; `LlamaServerArgs` CLI (`--host/--port/--ctx-size/--threads/…`) |
-| Native path | `requestChatCompletionStream` / `receiveChatCompletionChunk` (+ `wrap_stream_chunk` C++ helper) | `LlamaModel.handle*` (blocking) |
-| Tests | mapper/SSE/parser unit tests + model-free HTTP test over a socket (`ChatBackend` seam) | `OaiRouterTest`, `LlamaServerArgsTest`, `OaiHttpServerIntegrationTest` |
-
-**Important cross-insight:** Option B's own follow-up TODO below ("OpenAI-compatible server: token
-streaming (SSE) + Java-8 HTTP layer") lists SSE as *the main functional gap* and says to **avoid**
-`com.sun.net.httpserver` because it is "ArchUnit-banned". Option A **already implements that SSE
-streaming** with `com.sun.net.httpserver`, and the ban was lifted correctly: `com.sun.net.httpserver`
-is a *supported, exported* JDK API (the `jdk.httpserver` module), not an internal `com.sun..` package —
-the `noInternalJdkImports` ArchUnit rule now carries an explicit exception for it. So the premise that
-blocked the JDK approach on Option B's side does not hold.
-
-**Consolidation task (separate session — a kickoff prompt accompanies this change):** go through both
-implementations, take the best of each, settle on ONE server, delete the other, reconcile the
-dependency (`pom.xml` NanoHTTPD + assembly), the ArchUnit `layeredArchitecture` `Server` layer, the
-`spotbugs-exclude.xml` entries, `package-info.java`, the README "OpenAI-compatible HTTP server"
-section, and this TODO (including the now-partly-moot SSE section below).
-
 ### OpenAI-compatible HTTP endpoint (shipped; follow-ups open)
 
-`net.ladenthin.llama.server.OpenAiCompatServer` exposes `POST /v1/chat/completions` (streaming via
-SSE + non-streaming) and `GET /v1/models` over the JDK's built-in `com.sun.net.httpserver` (no new
-dependency), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can
-drive a local model. Streaming uses the native OAI chunk path (`requestChatCompletionStream` /
-`receiveChatCompletionChunk`), preserving `delta.tool_calls` for agent mode. Follow-ups, deferred
-until requested:
+`net.ladenthin.llama.server.OpenAiCompatServer` is the single OpenAI-compatible server. It exposes
+`POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`,
+`POST /v1/embeddings`, `GET /v1/models` and `GET /health` over the JDK's built-in
+`com.sun.net.httpserver` (no new dependency), and is the fat-jar `Main-Class`. Streaming chat uses the
+native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++
+`wrap_stream_chunk` helper), preserving `delta.tool_calls` for agent mode; completions/embeddings
+forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`. The CLI is parsed by the
+testable `OpenAiServerCli`. (Consolidated from the two interim implementations — PR #240's JDK +
+streaming server and #242's NanoHTTPD server — by keeping the JDK/streaming core, porting the extra
+routes + a fuller CLI + the fat-jar entry point onto it, and deleting the NanoHTTPD impl + its
+`org.nanohttpd` dependency.) Follow-ups, deferred until requested:
 
 - **Multi-model registry.** Only one model id is advertised/served today; support several models
   chosen by the request `model` field (and listed in `/v1/models`).
 - **`stream_options.include_usage` passthrough** so the final streamed `usage` chunk is emitted
   (needs a generic raw-param passthrough on `InferenceParameters`, or explicit mapping).
+- **Streaming `/v1/completions`.** The chat route streams; `/v1/completions` is non-streaming today
+  (a `"stream": true` body still returns one full JSON object). Honour SSE there too if a client needs
+  it.
 - **Additional `apiType`s.** VS Code "Custom Endpoint" also offers Anthropic `messages` and OpenAI
-  `responses`; only `chat-completions` is implemented. Also consider `/v1/completions` and
-  `/v1/embeddings` routes.
+  `responses`; only `chat-completions` is implemented.
 - **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4
   tool-call parser fixes (landed upstream ~Apr 2026); if not, bump per the upgrade procedure so
   streamed/blocking `tool_calls` come through for Gemma 4 GGUFs.
@@ -105,40 +82,6 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
 
   **Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
 
-### OpenAI-compatible server: token streaming (SSE) + Java-8 HTTP layer
-
-The `net.ladenthin.llama.server.LlamaServer` MVP is **non-streaming**: every request calls
-the blocking `LlamaModel.handle*` method and returns the full JSON response in one shot. A
-client that sends `"stream": true` still receives a single response, not the incremental
-`text/event-stream` (SSE) `data: {chunk}\n\n` events the OpenAI API emits for streaming
-chat/completions. This is the main functional gap of the server today.
-
-The token source already exists — `LlamaModel.generateChat(InferenceParameters)` /
-`generate(...)` yield tokens incrementally through a Java `Iterator` (`LlamaIterable`). What
-is missing is an HTTP layer that emits SSE.
-
-**Find a Java-8-compatible HTTP layer with good SSE support (alternative to Javalin), or
-implement SSE on NanoHTTPD.** Javalin has a first-class `ctx.sse(...)` API but is **not
-usable here**: Javalin 5 requires Java 11 and Javalin 6 requires Java 17, while this repo
-targets Java 8; Javalin 4 (the last Java-8 release) is EOL. Options, in rough order of
-preference:
-- **Implement SSE on the existing NanoHTTPD** via `NanoHTTPD.newChunkedResponse(status,
-  "text/event-stream", InputStream)`, bridging a `LlamaIterable` to an `InputStream` that
-  writes `data: {chunk}\n\n` frames. No new dependency, stays Java-8 clean; likely the right
-  answer. Cost: the iterator→SSE bridge plus closing the `LlamaIterable` on client
-  disconnect.
-- **Undertow** — Java-8 compatible, has a server-sent-events handler, but a heavier
-  dependency tree.
-- **Spark Java** (Jetty 9) — Java-8 compatible; SSE support is limited/manual.
-- Avoid: Javalin 5/6 (Java 11/17), Javalin 4 (EOL), and the JDK `com.sun.net.httpserver`
-  (ArchUnit-banned `com.sun..`).
-
-Scope when implemented: honour `"stream": true` on `POST /v1/chat/completions` and
-`POST /v1/completions`, emit OpenAI-style SSE chunks terminated by `data: [DONE]`, close the
-underlying `LlamaIterable` on disconnect, and keep the non-streaming path as the default. Add
-a model-free routing test plus a real-socket SSE integration test (mirroring
-`OaiHttpServerIntegrationTest`).
-
 ## Open — cross-cutting (slice for this repo)
 
 - **jqwik pin policy** — see [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jqwik-prompt-injection.md). `jqwik.version ≤ 1.9.3` is mandatory.