Trim cross-repo note from assembly comments; add SSE server TODO

claude · claude · commit e595af66703b · 2026-06-18T23:10:44.000Z
Build comments: drop the 'deliberate non-parity (BAF + jllama only)' restatement and the crossrepostatus.md pointer from the package job + assembly profile comments (that lives only in the cross-repo doc). Also correct the now-stale 'no Main-Class' wording in both: the assembly fat jar is runnable via its LlamaServer Main-Class. TODO: add an item to implement OpenAI-style SSE token streaming for the server (stream:true) and to find a Java-8-compatible HTTP layer with SSE support, or implement SSE on the existing NanoHTTPD via chunked responses. Javalin (the SSE-capable option) is unusable here: v5 needs Java 11, v6 needs Java 17, v4 is EOL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UZbmBX5CjqVwPcaTS61im6
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -820,10 +820,9 @@ jobs:
       - name: Build JARs
         # `assembly` additionally produces the fat jar-with-dependencies uber JAR
         # (llama-<version>-jar-with-dependencies.jar: library classes + Java runtime deps +
-        # default-platform native libs in one drop-on-classpath JAR; no Main-Class - it is a
-        # library). It lands in target/ and is uploaded in the `llama-jars` artifact below - a
-        # CI run artifact only, NOT a Maven Central / GitHub-Release asset. Documented as
-        # deliberate non-parity (BAF + jllama only) in workspace/crossrepostatus.md.
+        # default-platform native libs in one drop-on-classpath JAR, runnable via its
+        # LlamaServer Main-Class). It lands in target/ and is uploaded in the `llama-jars`
+        # artifact below - a CI run artifact only, not a Maven Central / GitHub-Release asset.
         run: mvn --batch-mode --no-transfer-progress -P release,cuda,opencl-android,assembly -Dmaven.test.skip=true -Dgpg.skip=true package
       - name: Upload JARs
         uses: actions/upload-artifact@v7
diff --git a/TODO.md b/TODO.md
@@ -55,6 +55,40 @@ These are JNI plumbing items for upstream API additions. Policy: add only after
 
   **Out of scope until evidence supports it**: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
 
+### OpenAI-compatible server: token streaming (SSE) + Java-8 HTTP layer
+
+The `net.ladenthin.llama.server.LlamaServer` MVP is **non-streaming**: every request calls
+the blocking `LlamaModel.handle*` method and returns the full JSON response in one shot. A
+client that sends `"stream": true` still receives a single response, not the incremental
+`text/event-stream` (SSE) `data: {chunk}\n\n` events the OpenAI API emits for streaming
+chat/completions. This is the main functional gap of the server today.
+
+The token source already exists — `LlamaModel.generateChat(InferenceParameters)` /
+`generate(...)` yield tokens incrementally through a Java `Iterator` (`LlamaIterable`). What
+is missing is an HTTP layer that emits SSE.
+
+**Find a Java-8-compatible HTTP layer with good SSE support (alternative to Javalin), or
+implement SSE on NanoHTTPD.** Javalin has a first-class `ctx.sse(...)` API but is **not
+usable here**: Javalin 5 requires Java 11 and Javalin 6 requires Java 17, while this repo
+targets Java 8; Javalin 4 (the last Java-8 release) is EOL. Options, in rough order of
+preference:
+- **Implement SSE on the existing NanoHTTPD** via `NanoHTTPD.newChunkedResponse(status,
+  "text/event-stream", InputStream)`, bridging a `LlamaIterable` to an `InputStream` that
+  writes `data: {chunk}\n\n` frames. No new dependency, stays Java-8 clean; likely the right
+  answer. Cost: the iterator→SSE bridge plus closing the `LlamaIterable` on client
+  disconnect.
+- **Undertow** — Java-8 compatible, has a server-sent-events handler, but a heavier
+  dependency tree.
+- **Spark Java** (Jetty 9) — Java-8 compatible; SSE support is limited/manual.
+- Avoid: Javalin 5/6 (Java 11/17), Javalin 4 (EOL), and the JDK `com.sun.net.httpserver`
+  (ArchUnit-banned `com.sun..`).
+
+Scope when implemented: honour `"stream": true` on `POST /v1/chat/completions` and
+`POST /v1/completions`, emit OpenAI-style SSE chunks terminated by `data: [DONE]`, close the
+underlying `LlamaIterable` on disconnect, and keep the non-streaming path as the default. Add
+a model-free routing test plus a real-socket SSE integration test (mirroring
+`OaiHttpServerIntegrationTest`).
+
 ## Open — cross-cutting (slice for this repo)
 
 - **jqwik pin policy** — see [`../workspace/policies/jqwik-prompt-injection.md`](../workspace/policies/jqwik-prompt-injection.md). `jqwik.version ≤ 1.9.3` is mandatory.
diff --git a/pom.xml b/pom.xml
@@ -992,12 +992,11 @@ SPDX-License-Identifier: MIT
 			<!--
 				Builds the fat jar-with-dependencies uber JAR: the library classes, the
 				default-platform native libs from src/main/resources, and all runtime Java
-				dependencies in one drop-on-classpath JAR. No Main-Class (this is a library,
-				not a CLI). Off by default; the CI `package` job activates it so the uber JAR
-				rides along in the `llama-jars` upload-artifact bundle (a CI run artifact only,
-				not a Maven Central / GitHub-Release asset). Documented in CLAUDE.md
-				"Build Commands" as `mvn -P assembly package` and as deliberate cross-repo
-				non-parity (BAF + jllama only) in workspace/crossrepostatus.md.
+				dependencies in one drop-on-classpath JAR, runnable via the LlamaServer
+				Main-Class (set below) to start the OpenAI-compatible HTTP server. Off by
+				default; the CI `package` job activates it so the uber JAR rides along in the
+				`llama-jars` upload-artifact bundle. Documented in CLAUDE.md "Build Commands"
+				as `mvn -P assembly package`.
 			-->
 			<id>assembly</id>
 			<build>