Add OpenAI-compatible HTTP endpoint for local model serving by bernardladenthin · Pull Request #240 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-06-18T17:20:50Z

Expose a loaded LlamaModel over an OpenAI-compatible HTTP API so editors that
speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a
local GGUF model in-process, with no extra runtime dependency (the JDK's built-in
com.sun.net.httpserver).

Native (jllama.cpp, json_helpers.hpp):

requestChatCompletionStream + receiveChatCompletionChunk: streaming chat that
formats each result as an OpenAI chat.completion.chunk (incl. streamed
delta.tool_calls) via TASK_RESPONSE_TYPE_OAI_CHAT. Uniform {data,stop} envelope
built by the new pure helper wrap_stream_chunk (+5 C++ tests; 445 total).

Java:

LlamaModel.streamChatCompletion(params, sink) drives the native chunk stream;
ChatStreamChunkParser parses the envelope.
New net.ladenthin.llama.server package: OpenAiCompatServer (POST
/v1/chat/completions streaming via SSE + non-streaming, GET /v1/models, SSE
heartbeats to survive CPU prefill, optional bearer auth, binds 127.0.0.1),
OpenAiServerConfig, OpenAiRequestMapper (messages/tools forwarded verbatim),
OpenAiSseFormatter, plus a ChatBackend seam for model-free HTTP tests and a
CLI launcher.
module-info: export the server package, requires jdk.httpserver.

Tests: pure-Java mapper/SSE/parser unit tests + a full HTTP test driven over a
socket with a fake backend (no model/native lib). Architecture rule refined to
allow the supported com.sun.net.httpserver API. Runnable OpenAiServerExample and
README/CLAUDE/TODO documentation.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

Expose a loaded LlamaModel over an OpenAI-compatible HTTP API so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local GGUF model in-process, with no extra runtime dependency (the JDK's built-in com.sun.net.httpserver). Native (jllama.cpp, json_helpers.hpp): - requestChatCompletionStream + receiveChatCompletionChunk: streaming chat that formats each result as an OpenAI chat.completion.chunk (incl. streamed delta.tool_calls) via TASK_RESPONSE_TYPE_OAI_CHAT. Uniform {data,stop} envelope built by the new pure helper wrap_stream_chunk (+5 C++ tests; 445 total). Java: - LlamaModel.streamChatCompletion(params, sink) drives the native chunk stream; ChatStreamChunkParser parses the envelope. - New net.ladenthin.llama.server package: OpenAiCompatServer (POST /v1/chat/completions streaming via SSE + non-streaming, GET /v1/models, SSE heartbeats to survive CPU prefill, optional bearer auth, binds 127.0.0.1), OpenAiServerConfig, OpenAiRequestMapper (messages/tools forwarded verbatim), OpenAiSseFormatter, plus a ChatBackend seam for model-free HTTP tests and a CLI launcher. - module-info: export the server package, requires jdk.httpserver. Tests: pure-Java mapper/SSE/parser unit tests + a full HTTP test driven over a socket with a fake backend (no model/native lib). Architecture rule refined to allow the supported com.sun.net.httpserver API. Runnable OpenAiServerExample and README/CLAUDE/TODO documentation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

Add .github/workflows/clang-format.yml: install clang-format 22.1.5 (pinned via pip for reproducibility across CI and local checkouts) and fail the build via `clang-format --dry-run --Werror` over all hand-written C++ (src/main/cpp + src/test/cpp). The generated JNI header src/main/cpp/jllama.h (from `javac -h`) is intentionally excluded. Reformat the whole C++ tree with that version so the check is green, and document the pinned version plus the bump procedure in CLAUDE.md. Whitespace-only — all 445 C++ tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

Reuse the reasoning model the CI pipeline already downloads (TestConstants.REASONING_MODEL_PATH = models/Qwen3-0.6B-Q4_K_M.gguf) — no extra download. Qwen3-0.6B is instruct-tuned and tool-calling capable, so it exercises the real native chat + streaming path (including the tools/use_jinja path) end-to-end over a socket. Self-skips via Assume when the model file is absent, matching the existing model-gated tests, so a model-free `mvn test` is unaffected. Assertions are structural (valid chat.completion, stream emits chunks + [DONE], a tools request returns a valid message object) because a 0.6B model's wording and whether it elects to call a tool are non-deterministic; the deterministic chunk and tool-call plumbing stays covered by OpenAiCompatServerHttpTest with a fake backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

Both OpenAiCompatServerHttpTest and OpenAiCompatServerIntegrationTest duplicated the post/get/read/readAll helpers and the Response holder. Move them into a new abstract OpenAiServerTestSupport (intentionally not named *Test, so the harness never runs it on its own); both classes now extend it and call the inherited post(port, path, body, auth) / get(port, path, auth). Behaviour is unchanged: the fake-backend HTTP tests pass and the model-gated integration test still self-skips when the model is absent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

…273) OpenAiCompatServer's CLI main() called Integer.parseInt on --ctx, --gpu-layers, --parallel and --port without guarding NumberFormatException, so a non-numeric value (e.g. "--port abc") crashed with a raw stack trace — flagged by CodeQL as "missing catch of NumberFormatException" (5 alerts). Consolidate the numeric parsing into one try/catch that prints a clear message and returns (no System.exit — the noSystemExit architecture rule forbids it; this mirrors the existing missing-"--model" usage path), and parse --ctx once instead of twice. Verified: compiles clean under -Werror / NullAway / Checker; LlamaArchitectureTest (noSystemExit) passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

…sort The clang-format enforcement commit ran with SortIncludes: CaseSensitive, which alphabetically reordered includes and moved jni_helpers.hpp / json_helpers.hpp ahead of the server-*.h headers that define the `json` alias they depend on. CI then failed to build jllama with "'json' does not name a type" cascading through json_helpers.hpp / jni_helpers.hpp / jllama.cpp on the manylinux, Linux aarch64, Android and C++-test compilers. (A local build masks it — the local toolchain resolves `json` regardless of include order.) Set SortIncludes: Never in .clang-format (the project has order-sensitive includes, documented in CLAUDE.md) and restore the required order — server-*.h + utils.hpp before the helper headers — in jllama.cpp and the affected C++ test files. Document the constraint in CLAUDE.md. clang-format --dry-run --Werror stays clean; the affected targets rebuild and the C++ tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

The model-download steps used `curl -L --fail` with no retry, so a single transient HTTP 429 (HuggingFace rate-limiting) failed the whole job — e.g. the default codellama-7b.Q2_K.gguf download in run 27778671637. Add `--retry 5 --retry-all-errors` to all 31 model-download curls (the Linux bash blocks and the Windows pwsh block). curl honors the server's Retry-After header and backs off on 429/5xx/connection errors; `--fail` still fails the step only after the retries are exhausted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

The Win32 (x86) C++ test job intermittently failed at build-time gtest_discover_tests. llama/ggml/mtmd are linked statically into one large jllama_test binary; on 32-bit Windows its startup plus --gtest_list_tests enumeration sits near the default 5s discovery timeout on shared CI runners. The same b9682 binary discovered within 5s in the #239 merge run but was killed at the 5s timeout in this run (process still alive, empty output — a timeout, not a crash); the b9682 upgrade and 5 newly added tests nudged a marginal case over the limit. x64, Linux and macOS finish well under the default and are unaffected. Raise DISCOVERY_TIMEOUT to 120s (a maximum, so fast platforms still return immediately), which keeps full C++ test coverage on x86 rather than skipping the binary there. Verified locally: 445/445 C++ tests still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

Make PR #240 mergeable on top of the NanoHTTPD OpenAI server that landed via #242. Both OpenAI-compatible server implementations now coexist in net.ladenthin.llama.server, pending a "best of both" consolidation (tracked in TODO.md): - OpenAiCompatServer (PR #240): JDK com.sun.net.httpserver, streaming SSE with delta.tool_calls, no new runtime dependency. - LlamaServer (#242): NanoHTTPD, non-streaming, fat-jar Main-Class, plus /v1/completions, /v1/embeddings and /health. Conflicts resolved: - server/package-info.java (add/add): documents both servers + pending merge. - README.md, CLAUDE.md: keep both server sections under one heading. - TODO.md: add a consolidation task; note SSE is already solved by OpenAiCompatServer and that com.sun.net.httpserver is the supported jdk.httpserver module, not an internal com.sun.. API. Auto-merged and verified consistent: LlamaModel.java (distinct native methods on each side), publish.yml, pom.xml (NanoHTTPD + assembly), spotbugs-exclude.xml, and LlamaArchitectureTest.java — main's Server layer already permits this session's server classes (they touch only the Api root + Parameters), and the noInternalJdkImports com.sun.net.httpserver exception merges alongside. Verified: mvn compile + test-compile clean; 64 model-free tests pass (LlamaArchitectureTest + both servers' unit/HTTP tests, integration test self-skips without a model); javadoc jar builds clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

…p NanoHTTPD Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server (PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it. Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class). - Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged). - Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health. - Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions, embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new routes to handleCompletionsOai / handleEmbeddings. - New testable CLI parser OpenAiServerCli (short/long/alias flags, --help, validation) replacing the inline arg map and the deleted LlamaServerArgs; produces ModelParameters + OpenAiServerConfig. Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig, OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend (+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest). Reconciliation: - pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class -> OpenAiCompatServer. - spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse; drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private, like the old LlamaModelChatBackend, which needed none). - LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and module-info `requires jdk.httpserver` unchanged (still correct for the survivor). - LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated; removed the consolidation block and the now-moot "implement SSE" TODO (its premise that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported, exported jdk.httpserver module). C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the streaming path survives. Verification (model-free): mvn compile test-compile; targeted tests (LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest, ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all green; javadoc:jar clean; spotless:check clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF

claude added 4 commits June 18, 2026 16:53

bernardladenthin temporarily deployed to startgate June 18, 2026 17:20 — with GitHub Actions Inactive

github-advanced-security AI found potential problems Jun 18, 2026

View reviewed changes

bernardladenthin temporarily deployed to startgate June 18, 2026 17:34 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 18, 2026 17:50 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 18, 2026 18:06 — with GitHub Actions Inactive

bernardladenthin temporarily deployed to startgate June 18, 2026 18:59 — with GitHub Actions Inactive

bernardladenthin had a problem deploying to startgate June 18, 2026 23:36 — with GitHub Actions Error

bernardladenthin merged commit af13cf0 into main Jun 18, 2026
9 of 12 checks passed

bernardladenthin deleted the claude/cool-hypatia-m7kcu3 branch June 18, 2026 23:37

bernardladenthin mentioned this pull request Jun 19, 2026

Consolidate OpenAI server: unify implementations, add multi-protocol support #243

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI-compatible HTTP endpoint for local model serving#240

Add OpenAI-compatible HTTP endpoint for local model serving#240
bernardladenthin merged 9 commits into
mainfrom
claude/cool-hypatia-m7kcu3

bernardladenthin commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bernardladenthin commented Jun 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants