Skip to content

Add OpenAI-compatible HTTP endpoint for local model serving#240

Merged
bernardladenthin merged 9 commits into
mainfrom
claude/cool-hypatia-m7kcu3
Jun 18, 2026
Merged

Add OpenAI-compatible HTTP endpoint for local model serving#240
bernardladenthin merged 9 commits into
mainfrom
claude/cool-hypatia-m7kcu3

Conversation

@bernardladenthin

Copy link
Copy Markdown
Owner

Expose a loaded LlamaModel over an OpenAI-compatible HTTP API so editors that
speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a
local GGUF model in-process, with no extra runtime dependency (the JDK's built-in
com.sun.net.httpserver).

Native (jllama.cpp, json_helpers.hpp):

  • requestChatCompletionStream + receiveChatCompletionChunk: streaming chat that
    formats each result as an OpenAI chat.completion.chunk (incl. streamed
    delta.tool_calls) via TASK_RESPONSE_TYPE_OAI_CHAT. Uniform {data,stop} envelope
    built by the new pure helper wrap_stream_chunk (+5 C++ tests; 445 total).

Java:

  • LlamaModel.streamChatCompletion(params, sink) drives the native chunk stream;
    ChatStreamChunkParser parses the envelope.
  • New net.ladenthin.llama.server package: OpenAiCompatServer (POST
    /v1/chat/completions streaming via SSE + non-streaming, GET /v1/models, SSE
    heartbeats to survive CPU prefill, optional bearer auth, binds 127.0.0.1),
    OpenAiServerConfig, OpenAiRequestMapper (messages/tools forwarded verbatim),
    OpenAiSseFormatter, plus a ChatBackend seam for model-free HTTP tests and a
    CLI launcher.
  • module-info: export the server package, requires jdk.httpserver.

Tests: pure-Java mapper/SSE/parser unit tests + a full HTTP test driven over a
socket with a fake backend (no model/native lib). Architecture rule refined to
allow the supported com.sun.net.httpserver API. Runnable OpenAiServerExample and
README/CLAUDE/TODO documentation.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ

claude added 4 commits June 18, 2026 16:53
Expose a loaded LlamaModel over an OpenAI-compatible HTTP API so editors that
speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a
local GGUF model in-process, with no extra runtime dependency (the JDK's built-in
com.sun.net.httpserver).

Native (jllama.cpp, json_helpers.hpp):
- requestChatCompletionStream + receiveChatCompletionChunk: streaming chat that
  formats each result as an OpenAI chat.completion.chunk (incl. streamed
  delta.tool_calls) via TASK_RESPONSE_TYPE_OAI_CHAT. Uniform {data,stop} envelope
  built by the new pure helper wrap_stream_chunk (+5 C++ tests; 445 total).

Java:
- LlamaModel.streamChatCompletion(params, sink) drives the native chunk stream;
  ChatStreamChunkParser parses the envelope.
- New net.ladenthin.llama.server package: OpenAiCompatServer (POST
  /v1/chat/completions streaming via SSE + non-streaming, GET /v1/models, SSE
  heartbeats to survive CPU prefill, optional bearer auth, binds 127.0.0.1),
  OpenAiServerConfig, OpenAiRequestMapper (messages/tools forwarded verbatim),
  OpenAiSseFormatter, plus a ChatBackend seam for model-free HTTP tests and a
  CLI launcher.
- module-info: export the server package, requires jdk.httpserver.

Tests: pure-Java mapper/SSE/parser unit tests + a full HTTP test driven over a
socket with a fake backend (no model/native lib). Architecture rule refined to
allow the supported com.sun.net.httpserver API. Runnable OpenAiServerExample and
README/CLAUDE/TODO documentation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Add .github/workflows/clang-format.yml: install clang-format 22.1.5 (pinned via
pip for reproducibility across CI and local checkouts) and fail the build via
`clang-format --dry-run --Werror` over all hand-written C++ (src/main/cpp +
src/test/cpp). The generated JNI header src/main/cpp/jllama.h (from `javac -h`)
is intentionally excluded.

Reformat the whole C++ tree with that version so the check is green, and document
the pinned version plus the bump procedure in CLAUDE.md. Whitespace-only — all
445 C++ tests still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Reuse the reasoning model the CI pipeline already downloads
(TestConstants.REASONING_MODEL_PATH = models/Qwen3-0.6B-Q4_K_M.gguf) — no extra
download. Qwen3-0.6B is instruct-tuned and tool-calling capable, so it exercises
the real native chat + streaming path (including the tools/use_jinja path)
end-to-end over a socket. Self-skips via Assume when the model file is absent,
matching the existing model-gated tests, so a model-free `mvn test` is unaffected.

Assertions are structural (valid chat.completion, stream emits chunks + [DONE], a
tools request returns a valid message object) because a 0.6B model's wording and
whether it elects to call a tool are non-deterministic; the deterministic chunk
and tool-call plumbing stays covered by OpenAiCompatServerHttpTest with a fake
backend.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Both OpenAiCompatServerHttpTest and OpenAiCompatServerIntegrationTest duplicated
the post/get/read/readAll helpers and the Response holder. Move them into a new
abstract OpenAiServerTestSupport (intentionally not named *Test, so the harness
never runs it on its own); both classes now extend it and call the inherited
post(port, path, body, auth) / get(port, path, auth).

Behaviour is unchanged: the fake-backend HTTP tests pass and the model-gated
integration test still self-skips when the model is absent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Comment thread src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java Fixed
Comment thread src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java Fixed
Comment thread src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java Fixed
Comment thread src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java Fixed
Comment thread src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java Fixed
…273)

OpenAiCompatServer's CLI main() called Integer.parseInt on --ctx, --gpu-layers,
--parallel and --port without guarding NumberFormatException, so a non-numeric
value (e.g. "--port abc") crashed with a raw stack trace — flagged by CodeQL as
"missing catch of NumberFormatException" (5 alerts).

Consolidate the numeric parsing into one try/catch that prints a clear message and
returns (no System.exit — the noSystemExit architecture rule forbids it; this
mirrors the existing missing-"--model" usage path), and parse --ctx once instead
of twice.

Verified: compiles clean under -Werror / NullAway / Checker; LlamaArchitectureTest
(noSystemExit) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
…sort

The clang-format enforcement commit ran with SortIncludes: CaseSensitive, which
alphabetically reordered includes and moved jni_helpers.hpp / json_helpers.hpp
ahead of the server-*.h headers that define the `json` alias they depend on. CI
then failed to build jllama with "'json' does not name a type" cascading through
json_helpers.hpp / jni_helpers.hpp / jllama.cpp on the manylinux, Linux aarch64,
Android and C++-test compilers. (A local build masks it — the local toolchain
resolves `json` regardless of include order.)

Set SortIncludes: Never in .clang-format (the project has order-sensitive includes,
documented in CLAUDE.md) and restore the required order — server-*.h + utils.hpp
before the helper headers — in jllama.cpp and the affected C++ test files. Document
the constraint in CLAUDE.md. clang-format --dry-run --Werror stays clean; the
affected targets rebuild and the C++ tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
The model-download steps used `curl -L --fail` with no retry, so a single
transient HTTP 429 (HuggingFace rate-limiting) failed the whole job — e.g. the
default codellama-7b.Q2_K.gguf download in run 27778671637. Add
`--retry 5 --retry-all-errors` to all 31 model-download curls (the Linux bash
blocks and the Windows pwsh block). curl honors the server's Retry-After header
and backs off on 429/5xx/connection errors; `--fail` still fails the step only
after the retries are exhausted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
The Win32 (x86) C++ test job intermittently failed at build-time
gtest_discover_tests. llama/ggml/mtmd are linked statically into one
large jllama_test binary; on 32-bit Windows its startup plus
--gtest_list_tests enumeration sits near the default 5s discovery
timeout on shared CI runners. The same b9682 binary discovered within
5s in the #239 merge run but was killed at the 5s timeout in this run
(process still alive, empty output — a timeout, not a crash); the b9682
upgrade and 5 newly added tests nudged a marginal case over the limit.
x64, Linux and macOS finish well under the default and are unaffected.

Raise DISCOVERY_TIMEOUT to 120s (a maximum, so fast platforms still
return immediately), which keeps full C++ test coverage on x86 rather
than skipping the binary there. Verified locally: 445/445 C++ tests
still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
Make PR #240 mergeable on top of the NanoHTTPD OpenAI server that landed
via #242. Both OpenAI-compatible server implementations now coexist in
net.ladenthin.llama.server, pending a "best of both" consolidation
(tracked in TODO.md):

- OpenAiCompatServer (PR #240): JDK com.sun.net.httpserver, streaming SSE
  with delta.tool_calls, no new runtime dependency.
- LlamaServer (#242): NanoHTTPD, non-streaming, fat-jar Main-Class, plus
  /v1/completions, /v1/embeddings and /health.

Conflicts resolved:
- server/package-info.java (add/add): documents both servers + pending merge.
- README.md, CLAUDE.md: keep both server sections under one heading.
- TODO.md: add a consolidation task; note SSE is already solved by
  OpenAiCompatServer and that com.sun.net.httpserver is the supported
  jdk.httpserver module, not an internal com.sun.. API.

Auto-merged and verified consistent: LlamaModel.java (distinct native
methods on each side), publish.yml, pom.xml (NanoHTTPD + assembly),
spotbugs-exclude.xml, and LlamaArchitectureTest.java — main's Server
layer already permits this session's server classes (they touch only the
Api root + Parameters), and the noInternalJdkImports com.sun.net.httpserver
exception merges alongside.

Verified: mvn compile + test-compile clean; 64 model-free tests pass
(LlamaArchitectureTest + both servers' unit/HTTP tests, integration test
self-skips without a model); javadoc jar builds clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014L2dLbAtwdq7C6a2gFRsQQ
@bernardladenthin bernardladenthin merged commit af13cf0 into main Jun 18, 2026
9 of 12 checks passed
@bernardladenthin bernardladenthin deleted the claude/cool-hypatia-m7kcu3 branch June 18, 2026 23:37
bernardladenthin pushed a commit that referenced this pull request Jun 20, 2026
…p NanoHTTPD

Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server
(PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD
blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the
NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it.

Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class).
- Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged).
- Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health.
- Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions,
  embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new
  routes to handleCompletionsOai / handleEmbeddings.
- New testable CLI parser OpenAiServerCli (short/long/alias flags, --help,
  validation) replacing the inline arg map and the deleted LlamaServerArgs;
  produces ModelParameters + OpenAiServerConfig.

Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig,
OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend
(+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest).

Reconciliation:
- pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class ->
  OpenAiCompatServer.
- spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse;
  drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private,
  like the old LlamaModelChatBackend, which needed none).
- LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and
  module-info `requires jdk.httpserver` unchanged (still correct for the survivor).
- LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated;
  removed the consolidation block and the now-moot "implement SSE" TODO (its premise
  that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported,
  exported jdk.httpserver module).

C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the
streaming path survives.

Verification (model-free): mvn compile test-compile; targeted tests
(LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest,
ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all
green; javadoc:jar clean; spotless:check clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants