Add OpenAI-compatible HTTP server (LlamaServer)#242
Merged
Conversation
Add a managed maven-assembly-plugin (3.8.0) and an `assembly` profile that builds llama-<version>-jar-with-dependencies.jar: the library classes, all Java runtime dependencies, and the default-platform native libs from src/main/resources in one drop-on-classpath JAR (no Main-Class - it is a library). Activate it in the package job (-P release,cuda,opencl-android,assembly) so the uber JAR rides along in the existing `llama-jars` upload-artifact (a CI run artifact only, not a Maven Central / GitHub-Release asset). Document the command in CLAUDE.md. Recorded as deliberate cross-repo non-parity (BAF + jllama only) in workspace/crossrepostatus.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UZbmBX5CjqVwPcaTS61im6
Introduce net.ladenthin.llama.server.LlamaServer, a runnable main class (and the
fat-jar Main-Class) that loads a GGUF model in-process and serves the
OpenAI-compatible endpoints over a tiny NanoHTTPD server:
POST /v1/chat/completions -> LlamaModel.handleChatCompletions
POST /v1/completions -> LlamaModel.handleCompletionsOai
POST /v1/embeddings -> LlamaModel.handleEmbeddings (needs --embedding)
GET /v1/models -> configured model alias
GET /health -> {"status":"ok"}
The handle* methods already return OAI-shaped JSON, so the server only forwards
request bodies. Design:
- OaiRouter (model-free, unit-tested) maps method+path+body to a response;
OaiHttpServer is the thin NanoHTTPD adapter; LlamaModelOaiBackend bridges to
LlamaModel; LlamaServerArgs parses --model/--host/--port/--ctx-size/
--n-gpu-layers/--threads/--embedding/--model-alias/--help.
- handleChatCompletions widened to public to match the other raw OAI handlers.
- NanoHTTPD is an <optional> compile dependency: bundled in the fat jar, not
inherited by library consumers (Java-8 clean, zero transitive deps).
- New `server` ArchUnit layer (the only layer allowed to access the Api root).
- spotbugs-exclude: PATH_TRAVERSAL_IN + CRLF_INJECTION_LOGS on the server
package (operator-supplied CLI input; same threat model as LlamaLoader), CC on
the flag switch (desugared String-switch artifact), EI_EXPOSE_REP2 on the
backend (non-owning model wrapper, mirrors Session).
Tests (model-free): LlamaServerArgsTest (10), OaiRouterTest (10),
OaiHttpServerIntegrationTest (real loopback socket + fake backend, 1). Verified:
spotless, compile (Error Prone/NullAway/Checker), spotbugs Max+Low, javadoc, and
the assembly fat jar (Main-Class set, NanoHTTPD bundled) all clean.
Docs: README "OpenAI-compatible HTTP server" + Features bullet; CLAUDE.md note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UZbmBX5CjqVwPcaTS61im6
Build comments: drop the 'deliberate non-parity (BAF + jllama only)' restatement and the crossrepostatus.md pointer from the package job + assembly profile comments (that lives only in the cross-repo doc). Also correct the now-stale 'no Main-Class' wording in both: the assembly fat jar is runnable via its LlamaServer Main-Class. TODO: add an item to implement OpenAI-style SSE token streaming for the server (stream:true) and to find a Java-8-compatible HTTP layer with SSE support, or implement SSE on the existing NanoHTTPD via chunked responses. Javalin (the SSE-capable option) is unusable here: v5 needs Java 11, v6 needs Java 17, v4 is EOL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UZbmBX5CjqVwPcaTS61im6
|
bernardladenthin
pushed a commit
that referenced
this pull request
Jun 18, 2026
Make PR #240 mergeable on top of the NanoHTTPD OpenAI server that landed via #242. Both OpenAI-compatible server implementations now coexist in net.ladenthin.llama.server, pending a "best of both" consolidation (tracked in TODO.md): - OpenAiCompatServer (PR #240): JDK com.sun.net.httpserver, streaming SSE with delta.tool_calls, no new runtime dependency. - LlamaServer (#242): NanoHTTPD, non-streaming, fat-jar Main-Class, plus /v1/completions, /v1/embeddings and /health. Conflicts resolved: - server/package-info.java (add/add): documents both servers + pending merge. - README.md, CLAUDE.md: keep both server sections under one heading. - TODO.md: add a consolidation task; note SSE is already solved by OpenAiCompatServer and that com.sun.net.httpserver is the supported jdk.httpserver module, not an internal com.sun.. API. Auto-merged and verified consistent: LlamaModel.java (distinct native methods on each side), publish.yml, pom.xml (NanoHTTPD + assembly), spotbugs-exclude.xml, and LlamaArchitectureTest.java — main's Server layer already permits this session's server classes (they touch only the Api root + Parameters), and the noInternalJdkImports com.sun.net.httpserver exception merges alongside. Verified: mvn compile + test-compile clean; 64 model-free tests pass (LlamaArchitectureTest + both servers' unit/HTTP tests, integration test self-skips without a model); javadoc jar builds clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
7 tasks
bernardladenthin
pushed a commit
that referenced
this pull request
Jun 20, 2026
…p NanoHTTPD Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server (PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it. Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class). - Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged). - Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health. - Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions, embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new routes to handleCompletionsOai / handleEmbeddings. - New testable CLI parser OpenAiServerCli (short/long/alias flags, --help, validation) replacing the inline arg map and the deleted LlamaServerArgs; produces ModelParameters + OpenAiServerConfig. Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig, OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend (+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest). Reconciliation: - pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class -> OpenAiCompatServer. - spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse; drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private, like the old LlamaModelChatBackend, which needed none). - LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and module-info `requires jdk.httpserver` unchanged (still correct for the survivor). - LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated; removed the consolidation block and the now-moot "implement SSE" TODO (its premise that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported, exported jdk.httpserver module). C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the streaming path survives. Verification (model-free): mvn compile test-compile; targeted tests (LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest, ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all green; javadoc:jar clean; spotless:check clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
LlamaServer) as theMain-Classof the fat JAR, enabling users to serve inference via standard/v1/chat/completions,/v1/completions,/v1/embeddings, and/v1/modelsendpoints.OaiRouter(testable routing logic independent of HTTP layer),OaiHttpServer(NanoHTTPD adapter),LlamaServerArgs(command-line parsing), andLlamaModelOaiBackend(inference delegation).assemblyprofile to build a fat JAR with all dependencies bundled, runnable viajava -jar llama-<version>-jar-with-dependencies.jar --model model.gguf --port 8080.Test plan
LlamaServerArgs(flag parsing, defaults, validation, error messages)OaiRouter(endpoint dispatch, method/body preconditions, error handling, query string stripping)OaiHttpServer(real loopback socket, NanoHTTPD adapter, request/response round-trip)Related issues / PRs
Closes the OpenAI-compatible server feature gap documented in TODO.md (non-streaming MVP).
Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdhttps://claude.ai/code/session_01UZbmBX5CjqVwPcaTS61im6