Add OpenAI-compatible HTTP endpoint for local model serving#240
Merged
Conversation
Expose a loaded LlamaModel over an OpenAI-compatible HTTP API so editors that
speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a
local GGUF model in-process, with no extra runtime dependency (the JDK's built-in
com.sun.net.httpserver).
Native (jllama.cpp, json_helpers.hpp):
- requestChatCompletionStream + receiveChatCompletionChunk: streaming chat that
formats each result as an OpenAI chat.completion.chunk (incl. streamed
delta.tool_calls) via TASK_RESPONSE_TYPE_OAI_CHAT. Uniform {data,stop} envelope
built by the new pure helper wrap_stream_chunk (+5 C++ tests; 445 total).
Java:
- LlamaModel.streamChatCompletion(params, sink) drives the native chunk stream;
ChatStreamChunkParser parses the envelope.
- New net.ladenthin.llama.server package: OpenAiCompatServer (POST
/v1/chat/completions streaming via SSE + non-streaming, GET /v1/models, SSE
heartbeats to survive CPU prefill, optional bearer auth, binds 127.0.0.1),
OpenAiServerConfig, OpenAiRequestMapper (messages/tools forwarded verbatim),
OpenAiSseFormatter, plus a ChatBackend seam for model-free HTTP tests and a
CLI launcher.
- module-info: export the server package, requires jdk.httpserver.
Tests: pure-Java mapper/SSE/parser unit tests + a full HTTP test driven over a
socket with a fake backend (no model/native lib). Architecture rule refined to
allow the supported com.sun.net.httpserver API. Runnable OpenAiServerExample and
README/CLAUDE/TODO documentation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Add .github/workflows/clang-format.yml: install clang-format 22.1.5 (pinned via pip for reproducibility across CI and local checkouts) and fail the build via `clang-format --dry-run --Werror` over all hand-written C++ (src/main/cpp + src/test/cpp). The generated JNI header src/main/cpp/jllama.h (from `javac -h`) is intentionally excluded. Reformat the whole C++ tree with that version so the check is green, and document the pinned version plus the bump procedure in CLAUDE.md. Whitespace-only — all 445 C++ tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Reuse the reasoning model the CI pipeline already downloads (TestConstants.REASONING_MODEL_PATH = models/Qwen3-0.6B-Q4_K_M.gguf) — no extra download. Qwen3-0.6B is instruct-tuned and tool-calling capable, so it exercises the real native chat + streaming path (including the tools/use_jinja path) end-to-end over a socket. Self-skips via Assume when the model file is absent, matching the existing model-gated tests, so a model-free `mvn test` is unaffected. Assertions are structural (valid chat.completion, stream emits chunks + [DONE], a tools request returns a valid message object) because a 0.6B model's wording and whether it elects to call a tool are non-deterministic; the deterministic chunk and tool-call plumbing stays covered by OpenAiCompatServerHttpTest with a fake backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Both OpenAiCompatServerHttpTest and OpenAiCompatServerIntegrationTest duplicated the post/get/read/readAll helpers and the Response holder. Move them into a new abstract OpenAiServerTestSupport (intentionally not named *Test, so the harness never runs it on its own); both classes now extend it and call the inherited post(port, path, body, auth) / get(port, path, auth). Behaviour is unchanged: the fake-backend HTTP tests pass and the model-gated integration test still self-skips when the model is absent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
…273) OpenAiCompatServer's CLI main() called Integer.parseInt on --ctx, --gpu-layers, --parallel and --port without guarding NumberFormatException, so a non-numeric value (e.g. "--port abc") crashed with a raw stack trace — flagged by CodeQL as "missing catch of NumberFormatException" (5 alerts). Consolidate the numeric parsing into one try/catch that prints a clear message and returns (no System.exit — the noSystemExit architecture rule forbids it; this mirrors the existing missing-"--model" usage path), and parse --ctx once instead of twice. Verified: compiles clean under -Werror / NullAway / Checker; LlamaArchitectureTest (noSystemExit) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
…sort The clang-format enforcement commit ran with SortIncludes: CaseSensitive, which alphabetically reordered includes and moved jni_helpers.hpp / json_helpers.hpp ahead of the server-*.h headers that define the `json` alias they depend on. CI then failed to build jllama with "'json' does not name a type" cascading through json_helpers.hpp / jni_helpers.hpp / jllama.cpp on the manylinux, Linux aarch64, Android and C++-test compilers. (A local build masks it — the local toolchain resolves `json` regardless of include order.) Set SortIncludes: Never in .clang-format (the project has order-sensitive includes, documented in CLAUDE.md) and restore the required order — server-*.h + utils.hpp before the helper headers — in jllama.cpp and the affected C++ test files. Document the constraint in CLAUDE.md. clang-format --dry-run --Werror stays clean; the affected targets rebuild and the C++ tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
The model-download steps used `curl -L --fail` with no retry, so a single transient HTTP 429 (HuggingFace rate-limiting) failed the whole job — e.g. the default codellama-7b.Q2_K.gguf download in run 27778671637. Add `--retry 5 --retry-all-errors` to all 31 model-download curls (the Linux bash blocks and the Windows pwsh block). curl honors the server's Retry-After header and backs off on 429/5xx/connection errors; `--fail` still fails the step only after the retries are exhausted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
The Win32 (x86) C++ test job intermittently failed at build-time gtest_discover_tests. llama/ggml/mtmd are linked statically into one large jllama_test binary; on 32-bit Windows its startup plus --gtest_list_tests enumeration sits near the default 5s discovery timeout on shared CI runners. The same b9682 binary discovered within 5s in the #239 merge run but was killed at the 5s timeout in this run (process still alive, empty output — a timeout, not a crash); the b9682 upgrade and 5 newly added tests nudged a marginal case over the limit. x64, Linux and macOS finish well under the default and are unaffected. Raise DISCOVERY_TIMEOUT to 120s (a maximum, so fast platforms still return immediately), which keeps full C++ test coverage on x86 rather than skipping the binary there. Verified locally: 445/445 C++ tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Make PR #240 mergeable on top of the NanoHTTPD OpenAI server that landed via #242. Both OpenAI-compatible server implementations now coexist in net.ladenthin.llama.server, pending a "best of both" consolidation (tracked in TODO.md): - OpenAiCompatServer (PR #240): JDK com.sun.net.httpserver, streaming SSE with delta.tool_calls, no new runtime dependency. - LlamaServer (#242): NanoHTTPD, non-streaming, fat-jar Main-Class, plus /v1/completions, /v1/embeddings and /health. Conflicts resolved: - server/package-info.java (add/add): documents both servers + pending merge. - README.md, CLAUDE.md: keep both server sections under one heading. - TODO.md: add a consolidation task; note SSE is already solved by OpenAiCompatServer and that com.sun.net.httpserver is the supported jdk.httpserver module, not an internal com.sun.. API. Auto-merged and verified consistent: LlamaModel.java (distinct native methods on each side), publish.yml, pom.xml (NanoHTTPD + assembly), spotbugs-exclude.xml, and LlamaArchitectureTest.java — main's Server layer already permits this session's server classes (they touch only the Api root + Parameters), and the noInternalJdkImports com.sun.net.httpserver exception merges alongside. Verified: mvn compile + test-compile clean; 64 model-free tests pass (LlamaArchitectureTest + both servers' unit/HTTP tests, integration test self-skips without a model); javadoc jar builds clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
7 tasks
bernardladenthin
pushed a commit
that referenced
this pull request
Jun 20, 2026
…p NanoHTTPD Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server (PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it. Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class). - Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged). - Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health. - Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions, embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new routes to handleCompletionsOai / handleEmbeddings. - New testable CLI parser OpenAiServerCli (short/long/alias flags, --help, validation) replacing the inline arg map and the deleted LlamaServerArgs; produces ModelParameters + OpenAiServerConfig. Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig, OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend (+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest). Reconciliation: - pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class -> OpenAiCompatServer. - spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse; drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private, like the old LlamaModelChatBackend, which needed none). - LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and module-info `requires jdk.httpserver` unchanged (still correct for the survivor). - LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated; removed the consolidation block and the now-moot "implement SSE" TODO (its premise that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported, exported jdk.httpserver module). C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the streaming path survives. Verification (model-free): mvn compile test-compile; targeted tests (LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest, ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all green; javadoc:jar clean; spotless:check clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Expose a loaded LlamaModel over an OpenAI-compatible HTTP API so editors that
speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a
local GGUF model in-process, with no extra runtime dependency (the JDK's built-in
com.sun.net.httpserver).
Native (jllama.cpp, json_helpers.hpp):
formats each result as an OpenAI chat.completion.chunk (incl. streamed
delta.tool_calls) via TASK_RESPONSE_TYPE_OAI_CHAT. Uniform {data,stop} envelope
built by the new pure helper wrap_stream_chunk (+5 C++ tests; 445 total).
Java:
ChatStreamChunkParser parses the envelope.
/v1/chat/completions streaming via SSE + non-streaming, GET /v1/models, SSE
heartbeats to survive CPU prefill, optional bearer auth, binds 127.0.0.1),
OpenAiServerConfig, OpenAiRequestMapper (messages/tools forwarded verbatim),
OpenAiSseFormatter, plus a ChatBackend seam for model-free HTTP tests and a
CLI launcher.
Tests: pure-Java mapper/SSE/parser unit tests + a full HTTP test driven over a
socket with a fake backend (no model/native lib). Architecture rule refined to
allow the supported com.sun.net.httpserver API. Runnable OpenAiServerExample and
README/CLAUDE/TODO documentation.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ