Consolidate OpenAI server: unify implementations, add multi-protocol support#243
Merged
Conversation
bernardladenthin
pushed a commit
that referenced
this pull request
Jun 19, 2026
OpenAiServerEmbeddingsIntegrationTest loaded CodeLlama-7B with enableEmbedding() only, which leaves the pooling type NONE (CodeLlama's GGUF reports pooling = -1). The OpenAI /v1/embeddings path (LlamaModel.handleEmbeddings with oaicompat=true) rejects pooling NONE, so both test methods received HTTP 500 instead of 200 (Java Tests Ubuntu job on PR #243). Set .setPoolingType(PoolingType.MEAN) so CodeLlama produces a single pooled sentence vector the OAI endpoint can return (MEAN/LAST both work for decoder-only models, per LlamaEmbeddingsTest). The low-level LlamaModel#embed path is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…p NanoHTTPD Two interim OpenAI-compatible servers coexisted in net.ladenthin.llama.server (PR #240's JDK com.sun.net.httpserver streaming server on top of #242's NanoHTTPD blocking server). Settle on one: keep the JDK + SSE-streaming core, absorb the NanoHTTPD server's extra routes / CLI / fat-jar entry point, then delete it. Survivor: OpenAiCompatServer (dependency-free, embeddable, fat-jar Main-Class). - Streaming chat via SSE with delta.tool_calls + prefill heartbeats (unchanged). - Ported routes: POST /v1/completions, POST /v1/embeddings, GET /health. - Broadened the model-free test seam ChatBackend -> OpenAiBackend (+ completions, embeddings); LlamaModelChatBackend -> LlamaModelBackend forwards the two new routes to handleCompletionsOai / handleEmbeddings. - New testable CLI parser OpenAiServerCli (short/long/alias flags, --help, validation) replacing the inline arg map and the deleted LlamaServerArgs; produces ModelParameters + OpenAiServerConfig. Deleted NanoHTTPD impl: LlamaServer, LlamaServerArgs, LlamaServerConfig, OaiHttpServer, OaiRouter, OaiBackend, OaiResponse, LlamaModelOaiBackend (+ OaiRouterTest, LlamaServerArgsTest, OaiHttpServerIntegrationTest). Reconciliation: - pom.xml: drop org.nanohttpd dependency + version; assembly Main-Class -> OpenAiCompatServer. - spotbugs-exclude.xml: retarget CC_CYCLOMATIC_COMPLEXITY to OpenAiServerCli.parse; drop the LlamaModelOaiBackend EI_EXPOSE_REP2 entry (survivor is package-private, like the old LlamaModelChatBackend, which needed none). - LlamaArchitectureTest Server layer + com.sun.net.httpserver exception and module-info `requires jdk.httpserver` unchanged (still correct for the survivor). - LlamaModel javadoc link, README, CLAUDE.md, TODO.md, publish.yml comment updated; removed the consolidation block and the now-moot "implement SSE" TODO (its premise that com.sun.net.httpserver is ArchUnit-banned was wrong: it is the supported, exported jdk.httpserver module). C++ (jllama.cpp / json_helpers.hpp / wrap_stream_chunk + its tests) unchanged: the streaming path survives. Verification (model-free): mvn compile test-compile; targeted tests (LlamaArchitectureTest, OpenAiRequestMapperTest, OpenAiSseFormatterTest, ChatStreamChunkParserTest, OpenAiCompatServerHttpTest, OpenAiServerCliTest) all green; javadoc:jar clean; spotless:check clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ends
Implements the XS+S recommendations from the IDE/agent backend investigation,
targeting agentic tool-calling (Qwen) and local autocomplete:
XS:
- POST /infill route (FIM autocomplete: llama.vscode/Twinny/Tabby/Continue) —
forwards verbatim to the existing native handleInfill; FIM tokens applied
server-side from GGUF metadata. New OpenAiBackend.infill + LlamaModelBackend.
- Tolerant routing: every route also reachable without the /v1 prefix.
- cache_prompt defaulted true in the chat mapper (KV-prefix reuse for IDE latency).
- C++ regression guard (#20198): assert tool_calls.function.arguments is a JSON
STRING, not an object — passes against pinned b9682, so agentic tool-calling is
wire-correct for the OpenAI SDK / Roo Code / Copilot agent.
S:
- stream_options.include_usage passthrough: OpenAiRequestMapper forwards the
stream_options object verbatim (new InferenceParameters.withStreamOptions) so
the native server emits the trailing usage chunk OpenAI clients expect.
- cached_tokens safety net: OpenAiSseFormatter.ensureUsageCachedTokens guarantees
usage.prompt_tokens_details.cached_tokens is present on the streamed usage chunk,
fixing the documented Copilot custom-endpoint crash (microsoft/vscode #273482)
regardless of upstream. Applied in the SSE path; token-delta chunks pass through
unparsed.
- CORS: a com.sun.net.httpserver Filter answers OPTIONS preflights with 204 +
Access-Control-Allow-{Origin,Methods,Headers} and stamps Allow-Origin on every
response. New OpenAiServerConfig.corsAllowOrigin (default "*").
Tests: +infill/alias/CORS HTTP tests, +stream_options mapper test, +5
ensureUsageCachedTokens unit tests, +1 C++ arguments-as-string guard. Full server
+ json + arch suite green (77 model-free tests); C++ tool-call/stream suite green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
… docs
Continues the IDE/agent backend work (Medium items + documentation):
- POST /v1/rerank (+ /rerank, /reranking): RAG document reranking. Native
handleRerank (made public, consistent with the other handle* methods) returns
{document,index,score}; OaiRerankSupport reshapes it into the OpenAI rerank
response with sorted {index, relevance_score}, top_n, and a `data` alias of
`results` (Continue #6478). New OpenAiBackend.rerank + LlamaModelBackend.rerank.
- response_format passthrough (json_object / json_schema) for OpenAI structured
outputs (new InferenceParameters.withResponseFormat; mapper forwards verbatim).
- Vision: --mmproj CLI flag (image_url content parts already pass through verbatim).
- CLI: --reranking (enableReranking), --mmproj (setMmproj) on OpenAiServerCli.
Docs:
- New docs/feature-investigation-ide-agent-backend.md (the deep-research report +
an implementation-status preamble).
- README endpoints table + notes (rerank/infill, CORS, /v1-less aliases, response_
format, the Copilot inline-completion limitation), CLAUDE.md server bullet,
package-info, and TODO.md (DONE list + the deferred decisions: Ollama emulation,
Anthropic /v1/messages + OpenAI /v1/responses shims, Continue native /completion,
per-model FIM registry, /props).
Tests: +OaiRerankSupportTest (10), +rerank HTTP route, +response_format mapper test,
+--reranking/--mmproj CLI tests. Full server+json+arch suite green (138 tests);
javadoc + spotless clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Adds an Ollama-compatible surface so Copilot's built-in Ollama provider and Ollama-hardcoded tools can drive the local model, translating to/from the internal OpenAI chat path (no second inference path): - GET /api/version, GET /api/tags, POST /api/show — discovery; /api/show advertises capabilities (completion/tools/insert [+vision when --mmproj]) and context length, which Copilot reads to enable tools/vision and size prompts. - POST /api/chat — non-streaming (single JSON) and streaming (NDJSON, one object per line, terminated by a "done":true line). Request options (num_predict→max_tokens, temperature/top_p/top_k/seed/stop) and `format` (json / schema → response_format) are mapped; Ollama tool-call arguments (object) ↔ OpenAI (JSON string) are converted both ways. - ToolCallDeltaAccumulator: reusable helper that reconstructs whole tool calls from OpenAI streaming delta.tool_calls fragments (shared by the non-OpenAI shims that deliver tool calls whole). Streamed tool calls are emitted on the Ollama done line. - OpenAiServerConfig.supportsVision (set by --mmproj) feeds the /api/show vision flag. All pure translation lives in OllamaApiSupport + ToolCallDeltaAccumulator (model-free unit-tested); the server handlers are thin and reuse the OpenAiBackend seam. Tests: +OllamaApiSupportTest (12), +ToolCallDeltaAccumulatorTest (3), +Ollama HTTP route tests (version/tags/show/chat non-stream + NDJSON stream). Server+arch suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…shims Adds two more client-protocol surfaces over the internal OpenAI chat core (pure translation, no second inference path; tool calls reconstructed from OpenAI delta.tool_calls via the shared ToolCallDeltaAccumulator): Anthropic Messages (POST /v1/messages): - Request: system string/blocks → system message; content blocks (text / tool_use / tool_result) flattened to OpenAI messages (a user tool_result → a role:"tool" message); Anthropic tools (input_schema) → OpenAI function tools; tool_choice auto/any mapped. - Non-streaming response: text + tool_use content blocks, stop_reason mapping (tool_calls→tool_use, length→max_tokens), usage. - Streaming: the Anthropic SSE event sequence via AnthropicStreamTranslator (message_start → text content block start/delta/stop → tool_use blocks → message_ delta → message_stop), with heartbeats. OpenAI Responses (POST /v1/responses): - Request: instructions → system; input string/array (message / function_call / function_call_output items) → OpenAI messages; flat function tools → nested. - Non-streaming response: a `response` object whose output holds a message item (output_text) + one function_call item per tool call, with usage. - Streaming: the Responses SSE event sequence via ResponsesStreamTranslator (response.created → output_item.added/content_part.added → output_text.delta* → *.done → function_call items → response.completed), with monotonic sequence_number, and heartbeats. Both surfaces are reachable with and without the /v1 prefix and behind the CORS filter. Docs (README endpoints table, CLAUDE.md server bullet, package-info, TODO) updated; the two items are moved from "deferred" to done. Tests: +AnthropicApiSupportTest (6), +AnthropicStreamTranslatorTest (4), +ResponsesApiSupportTest (5), +ResponsesStreamTranslatorTest (4), + Anthropic/ Responses HTTP route tests (non-stream + stream). Full server+json+arch suite green (251 tests); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Fills the remaining tractable IDE-backend endpoints (those not blocked on a native streaming-completion path): - GET /props (llama.cpp-native): reports default_generation_settings.n_ctx and a modalities block (vision from --mmproj), which autocomplete clients such as llama.vscode read to size their context window. OpenAiSseFormatter.propsJson; unauthenticated like /health. - POST /api/generate (Ollama-native prompt completion / FIM): maps to the native /v1/completions handler, or to /infill when a `suffix` is present (FIM). Options (num_predict→max_tokens, temperature/top_p/top_k/seed/stop) are mapped. Non-streaming returns one JSON; stream:true returns NDJSON (a content line + a done line). Generation completes before emission — documented as a single content chunk, since there is no streaming raw-completion path (tracked in TODO as the shared blocker for streaming /v1/completions, token-streaming /api/generate, and Continue's native /completion). Docs (README, CLAUDE.md, package-info, TODO) updated; TODO now records the streaming raw-completion JNI path as the one remaining blocker and trims the items it gates. Tests: +/props + /api/generate translator unit tests (OllamaApiSupportTest, OpenAiSseFormatterTest) and HTTP route tests (props, generate non-stream/FIM/stream). Full server+json+arch suite green (261 tests); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Make TODO.md reflect the shipped state accurately and capture every remaining item: - Intro now lists the complete surface: the OpenAI routes plus /props and the three alternative protocol surfaces (Ollama /api/*, Anthropic /v1/messages, OpenAI /v1/responses) — previously it stopped at /health. - Add the one genuinely-open item that was missing: live end-to-end validation of the Ollama/Anthropic/Responses surfaces against a real model + real clients (today they are covered only by model-free unit + fake-backend HTTP tests; only the OpenAI chat path has a gated integration test). Done items remain marked DONE; the deferred follow-ups (streaming raw-completion JNI path and what it gates, incremental tool-call streaming, per-model FIM registry, multi-model registry, Gemma 4 validation) are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…cols CI's test-java-linux-x86_64 job already has the three ingredients in one place — the Linux x86_64 native lib (downloaded artifact), the models (incl. Qwen3-0.6B), and `mvn test` — so OpenAiCompatServerIntegrationTest already round-trips the OpenAI chat path over a real socket. Extend that same gated fixture (same loaded Qwen3 model, self-skips when absent) to smoke the new surfaces end-to-end: - Ollama /api/chat (non-stream + NDJSON stream) and discovery (/api/version, /api/tags, /api/show). - Anthropic /v1/messages (non-stream + SSE event stream). - OpenAI /v1/responses (non-stream + SSE event stream). - /props. Assertions are structural only (markers the translators always emit: Ollama "done":true, Anthropic event: message_start/message_stop, Responses event: response.created/response.completed, response object shape) so they are robust to the tiny model's wording — matching the existing chat round-trip's approach. Embeddings / rerank / infill round-trips still need their own server fixtures in the matching mode (models already in CI); tracked in TODO. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…_result-only turn
Low-hanging unit-test coverage on the recently-added protocol translators (all pure,
model-free), plus one correctness fix surfaced by a new test:
Fix:
- AnthropicApiSupport: a user turn carrying only tool_result blocks now emits exactly
one role:"tool" message instead of that plus a spurious empty {"role":"user",
"content":""}. Guarded with a hadToolResult flag; assistant tool-call turns still
carry null content.
New / extended tests:
- OpenAiServerConfigTest (new): builder defaults, isAuthenticationEnabled (null/empty/
set), CORS + vision knobs, and the security contract that toString() never leaks the
API key (only authEnabled boolean).
- OpenAiServerCli: --mmproj now asserted to flip toServerConfig().supportsVision.
- AnthropicApiSupport: system-as-blocks concatenation + stop_sequences mapping,
tool_choice "any" -> "required", and the fixed tool_result-only branch.
- OllamaApiSupport: format-as-schema -> response_format json_schema; options.stop
forwarding.
- ResponsesApiSupport: no-instructions (no system message), assistant message item,
and non-function tools dropped.
- OpenAiCompatServerHttpTest: GET on the new POST routes (/v1/rerank, /v1/messages,
/v1/responses, /api/chat, /api/generate) returns 405 (shared requirePostJson preamble).
Server + arch suite green (150 model-free tests); javadoc + spotless clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ompletion/infill/generate
Completes the live end-to-end coverage of the IDE-backend surfaces. Each fixture boots
a real server over a real socket in the matching model mode, reuses a model CI already
downloads, self-skips when absent, and asserts structural shapes only:
- OpenAiServerEmbeddingsIntegrationTest (CodeLlama-7B + enableEmbedding): POST
/v1/embeddings returns an OpenAI {object:list, data:[{object:embedding, embedding:[…]}]}
shape; also covers the bare /embeddings alias.
- OpenAiServerRerankIntegrationTest (jina-reranker + enableReranking): POST /v1/rerank
returns sorted {index, relevance_score} results capped by top_n, with the `data` alias.
- OpenAiServerCompletionIntegrationTest (CodeLlama-7B): POST /v1/completions, /infill, and
Ollama /api/generate (plain + FIM via `suffix`) — CodeLlama is FIM-capable per
LlamaModelTest#testGenerateInfill.
Also: add TestConstants.RERANKING_MODEL_PATH and route RerankingModelTest through it
(removes the duplicated literal). Used Java-8-safe idioms throughout.
These run in the same CI job that already round-trips the OpenAI chat path, so the
Ollama/Anthropic/Responses/embeddings/rerank/completion surfaces are now all validated
end-to-end against real models; only manual editor-client validation remains (TODO).
Server + arch suite green (integration fixtures self-skip without models locally);
javadoc + spotless clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…nsolidated server The SonarQube "Build and analyze" job runs spotbugs at effort=Max/threshold=Low with fb-contrib + findsecbugs, which flagged 30 Low-priority findings across the new net.ladenthin.llama.server.* classes. Resolved as follows. Fixed in code: - Add Lombok @tostring to the stateful infra classes (AnthropicStreamTranslator, ResponsesStreamTranslator, ToolCallDeltaAccumulator, LlamaModelBackend, OpenAiCompatServer) — the project's established remedy for IMC_IMMATURE_CLASS_NO_TOSTRING (see Java8CompatibilityHelper). @tostring on the server is leak-safe: it renders the config via OpenAiServerConfig.toString(), which already redacts the API key. - Add @EqualsAndHashCode to the immutable OpenAiServerConfig value type (IMC_IMMATURE_CLASS_NO_EQUALS). - printReady(): println('[')/println(']') instead of length-1 strings (UCPM). Suppressed (documented in spotbugs-exclude.xml) as by-design / false positive on protocol-infrastructure code, mirroring the existing server PATH_TRAVERSAL/CRLF and OpenAiServerCli.parse entries: - IMPROPER_UNICODE: equalsIgnoreCase on ASCII HTTP method tokens (RFC 7230/7231). - LEST: the UncheckedIOException -> IOException unwrap rethrows the original cause. - WEM: input-validation precondition guards (same shape as ChatMessage.requireNonNull). - MDM: main() blocks forever to keep the JVM alive for the daemon HTTP threads. - NOS: per-request stream write lock passed as a parameter (not a shared field). - MRC: sseDone()/heartbeat() are self-documenting SSE protocol-token accessors. - PRMC: ResponsesApiSupport.dataObject() is a fresh-node factory, not cacheable. - IMC_NO_EQUALS on the identity-managed ToolCallDeltaAccumulator. Verified locally: mvn spotbugs:check -> BUILD SUCCESS (0 bugs); spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
OpenAiServerEmbeddingsIntegrationTest loaded CodeLlama-7B with enableEmbedding() only, which leaves the pooling type NONE (CodeLlama's GGUF reports pooling = -1). The OpenAI /v1/embeddings path (LlamaModel.handleEmbeddings with oaicompat=true) rejects pooling NONE, so both test methods received HTTP 500 instead of 200 (Java Tests Ubuntu job on PR #243). Set .setPoolingType(PoolingType.MEAN) so CodeLlama produces a single pooled sentence vector the OAI endpoint can return (MEAN/LAST both work for decoder-only models, per LlamaEmbeddingsTest). The low-level LlamaModel#embed path is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…p opaque-field toString Two static-analysis findings on the new server code, both surfaced once the spotbugs fix let the Maven build reach the sonar:sonar step for the first time: 1. SonarCloud quality gate (E Reliability on New Code, S2095 BLOCKER): the four streaming handlers (streamChat / streamOllamaChat / streamAnthropic / streamResponses) opened os = exchange.getResponseBody() and closed it via the closeQuietly() helper in finally. Sonar's S2095 does not trace closes through a helper, so it saw a possibly-unclosed resource. Fixed by closing os directly in the finally, under the per-stream write lock so the close still never races an in-flight heartbeat write; the closeQuietly helper is removed. Also moved heartbeatExecutor.scheduleAtFixedRate inside the try so a scheduling failure can no longer leak the stream (a real, if rare, pre-existing leak). 2. CodeQL "Use of default toString()" on LlamaModelBackend (non-blocking alert): the @tostring added in the previous commit rendered fields whose classes only inherit Object.toString (the request mapper and native model handle; HttpServer and the CORS Filter on the server). Dropped @tostring from LlamaModelBackend and OpenAiCompatServer — opaque-resource/service classes where a generated toString only emits identity hashes — and suppressed IMC_IMMATURE_CLASS_NO_TOSTRING for them with rationale. The translator/accumulator classes keep @tostring (their fields render meaningful state and are CodeQL-clean). Verified locally: spotbugs:check -> 0 bugs; 67 server unit tests pass (incl. the 32 OpenAiCompatServerHttpTest streaming-path tests over real sockets); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…rCloud S2445) The SonarCloud quality gate kept failing with "E Reliability Rating on New Code" even after the previous commit reworked the streaming close, and the gate was unchanged across that push — the signal that the Blocker reliability bug was the synchronization TARGET, not the close. The streaming write helpers synchronized on a writeLock passed as a method parameter (and, after the last commit, on a method-local Object), which SonarCloud flags as java:S2445 "Blocks should be synchronized on read-only fields" — a Blocker reliability bug. (fb-contrib's NOS flagged the same code; that suppression masked it from spotbugs but Sonar evaluates it independently.) Fix: introduce a small per-request AutoCloseable ResponseStream that owns the response OutputStream and a private final lock, and serializes writeStrict / writeQuietly / close on that owned lock. The four streaming handlers drive it via try-with-resources, so the stream is closed (under the lock, after the heartbeat is cancelled) on every path — which also satisfies S2095 and lets the now-stale NOS suppression be removed. Per-request locking is preserved (each request has its own lock), so independent concurrent streams never serialize against each other. Verified locally: spotbugs:check -> 0 bugs; 48 server unit tests pass incl. the 32 OpenAiCompatServerHttpTest streaming-path tests over real sockets; spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…bjllama First step toward driving the OpenAI-compatible server natively from JNI, shipped inside libjllama rather than as a standalone llama-server executable (a JNI .so/.dll/.dylib loads anywhere a JVM runs; a separate binary does not, which is the whole point of preferring the JNI path here). This commit only makes the HTTP layer build and link — no JNI route wiring yet. What changed (CMakeLists.txt): - Compile tools/server/server-http.cpp (the upstream server_http_context HTTP transport) and vendor/cpp-httplib/httplib.cpp directly into jllama, on all platforms (the getifaddrs API-24 gate cpp-httplib needs on Android is already satisfied by the existing __ANDROID_UNAVAILABLE_SYMBOLS_ARE_WEAK__ define). - <cpp-httplib/httplib.h> already resolves via llama-common's vendor/ include dir, whose bundled nlohmann/json is the same 3.12.0 as our FetchContent copy, so nothing is shadowed and no extra include dir is required for it. - Mirror upstream's cpp-httplib tuning defines (payload/URI/backlog limits, TCP_NODELAY) on jllama so httplib.cpp and the server-http.cpp that includes httplib.h agree on the inline behaviour those macros control. - Silence httplib.cpp warnings (-w / /w), matching upstream's own target. - Link ws2_32 on MinGW (MSVC auto-links it via a pragma in httplib.h). - No SSL: CPPHTTPLIB_OPENSSL_SUPPORT is left undefined (plain HTTP for now; bind localhost or front with a TLS proxy). WebUI stub (src/main/cpp/webui_stub/ui.h): - server-http.cpp does #include "ui.h" — the asset table tools/ui (llama-ui) normally GENERATES via the llama-ui-embed host tool. We do not ship the Svelte WebUI (it needs npm or a prebuilt-asset download), so this header supplies the exact "empty asset table" interface embed.cpp emits for n_assets == 0: the llama_ui_asset struct plus llama_ui_find_asset / llama_ui_use_gzip / llama_ui_get_assets. LLAMA_UI_HAS_ASSETS is intentionally left undefined, so every static-asset-serving block in server-http.cpp compiles out; the single unguarded use iterates the (empty) asset list. Header-only (.h) so it is outside the clang-format glob, which only covers *.cpp/*.hpp. server.cpp (standalone main() + route wiring) stays excluded — wiring those routes to a JNI entry point is the next step. Verified locally (Linux x86_64): - cmake --build --target jllama -> [100%] Built target jllama (clean). - libjllama.so contains server_http_context::init/start/stop (T) and ~1.8k httplib symbols, with zero undefined server-http/httplib symbols. - NativeLibraryLoadSmokeTest: Tests run: 1, Failures: 0, Skipped: 0 (the larger lib still loads and JNI_OnLoad resolves every referenced Java class). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…uilds Implements the build-once-share-artifact approach for embedding the llama.cpp Svelte WebUI into libjllama, so the in-process server (server-http.cpp) serves the real UI instead of the empty-asset stub. The repo commits no build outputs, so the WebUI is produced per-pipeline and never checked in (same policy as the native libs). CI (.github/workflows/publish.yml): - New build-webui job (ubuntu; the only job that runs npm): resolves the pinned b<nnnn> tag from CMakeLists.txt GIT_TAG, sparse-checks-out ggml-org/llama.cpp@<tag> tools/ui, runs the upstream Svelte build (npm ci && npm run build), gzips dist/ (LLAMA_UI_GZIP parity), builds the self-contained llama-ui-embed host tool (plain C++17, no npm) and runs it to produce the platform-independent webui-generated/ ui.cpp + ui.h, uploaded as the webui-generated artifact. - All 10 release-artifact build jobs now use needs:[startgate, build-webui] and download the artifact into webui-generated/ before building. npm never runs in the dockcross cross-compilers (no node) or per-platform — only once, in one job. CMake (CMakeLists.txt "WebUI assets" block): - When webui-generated/ui.cpp + ui.h are present, compile ui.cpp in and add its dir to the include path; the generated ui.h #defines LLAMA_UI_HAS_ASSETS, activating server-http.cpp's static-asset routes (compiled out under the stub). When absent, fall back to the empty-asset stub webui_stub/ui.h so local builds and any job without the artifact still build and run (no embedded UI). The WebUI version auto-follows GIT_TAG, so a llama.cpp bump needs no extra step. webui-generated/ is git-ignored; CLAUDE.md documents the pipeline + a local recipe. Verified locally (Linux x86_64) with the real llama-ui-embed tool (no npm) on a synthetic 9-asset dist: generated ui.cpp/ui.h carry LLAMA_UI_HAS_ASSETS + use_gzip (9 assets); jllama rebuilds with server-http.cpp's asset routes compiled in, ui.cpp compiled, libjllama.so linked (llama_ui_get_assets/find_asset defined, 0 undefined); NativeLibraryLoadSmokeTest passes; removing webui-generated/ -> CMake reports the stub fallback and jllama still builds. publish.yml parses (pyyaml); exactly the 10 native build jobs gate on build-webui. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…/npm The SonarCloud quality gate failed on exactly two conditions (Reliability E and Security C on new code), driven by 2 Blocker bugs + 25 Major security findings. (The ~45 other annotations are maintainability code smells that do not affect the gate.) Reliability (java:S2095, OpenAiCompatServer.main L888/L889): Sonar did not trace that the LlamaModel and OpenAiCompatServer are closed by the shutdown hook, so it flagged them as never closed. Refactor main() to hold both in a try-with-resources; a two-latch shutdown keeps termination graceful and race-free (the hook signals stopRequested then waits on cleanedUp, so the JVM — which blocks until shutdown hooks return — does not halt until the close has run). This also closes the model if server startup throws, which the previous code did not. Security (.github/workflows/publish.yml): - npm ci -> npm ci --ignore-scripts in the build-webui job, so dependency lifecycle scripts do not run during install (the WebUI build still runs via `npm run build`). - Every curl model-download now passes --proto =https --proto-redir =https, so neither the URL nor any redirect can downgrade to cleartext HTTP (the URLs are already https; this enforces it). 31 invocations hardened. These are exactly the 2 reliability + 25 security issues SonarCloud listed, so both ratings should return to A. Verified: mvn spotless:apply + compile clean; publish.yml parses. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
c431600 to
cf65c3d
Compare
…ble surfaces #244 made the chat core honor parallel_tool_calls, but only the OpenAI /v1/chat/completions surface forwarded it; the alternative protocol surfaces (which translate into that same chat core) silently dropped the equivalent flag. Close the gap: - Anthropic /v1/messages (AnthropicApiSupport.toOpenAiChatRequest): map tool_choice.disable_parallel_tool_use=true -> parallel_tool_calls=false (default stays parallel when unset/false). - OpenAI Responses /v1/responses (ResponsesApiSupport.toOpenAiChatRequest): forward parallel_tool_calls, and also forward tool_choice (string form), which was being dropped entirely — both now reach the shared OpenAiRequestMapper. Tests: - AnthropicApiSupportTest / ResponsesApiSupportTest: unit-cover the new mappings (set, and omitted-when-absent). - OpenAiServerToolCallingIntegrationTest (new): real-model end-to-end over HTTP using the Qwen2.5-1.5B tool model #244 wired into CI. tool_choice="required" forces a call, so it deterministically asserts the server returns a well-formed tool_calls array (arguments as a JSON string, llama.cpp #20198) and that parallel_tool_calls=false travels HTTP -> mapper -> native intact. Self-skips when the model is absent. Verified locally: spotless, compile, spotbugs clean; model-free translator tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
6 tasks
…mpiler cache
Two complementary fixes for the macOS build, behind a new `use_cache`
workflow_dispatch input (default true). Phase 1 = macOS only.
BUILD_JOBS knob (previously investigated, never landed): build.sh now honors
$BUILD_JOBS for the cmake -j level (default = all cores; portable nproc/sysctl
detection), and the 3 macOS build jobs set BUILD_JOBS=2. GitHub's ~7 GB macOS arm64
runners OOM under -j$(nproc) when the 16.6k-line httplib.cpp co-schedules with the model
TUs; the runner is then killed as SIGTERM/143 ("received a shutdown signal") — not a
real timeout. Capping concurrent compiles bounds peak memory.
sccache -> Depot Cache (WebDAV): build.sh routes the compiler through sccache
(-DCMAKE_*_COMPILER_LAUNCHER) only when USE_CACHE=true AND sccache + a cache token are
present, then prints `sccache --show-stats`. The 3 macOS jobs brew-install sccache and
set SCCACHE_WEBDAV_ENDPOINT=https://cache.depot.dev +
SCCACHE_WEBDAV_TOKEN=${{ secrets.DEPOT_TOKEN }}. Because llama.cpp is pinned, the ~280
upstream object files are content-identical every run, so a warm cache recompiles only
changed files — staying -O3, bit-identical and release-safe. Depot's cache is shared
across all branches, so every branch builds incrementally (and warm builds also cut the
macOS memory pressure further).
Safety: inert until the DEPOT_TOKEN secret exists and on fork PRs (secrets hidden) —
those just compile normally; the install step is continue-on-error and use_cache=false
forces a clean from-scratch build. build.sh gating verified locally across all four
cases (warm / Linux-untouched / no-token / explicit-off).
Phase 2 (later): dockcross Linux/Android/CUDA (needs the token + sccache binary passed
into the container), Windows, and the Linux-host test-cpp job.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
… CLAUDE.md - README: a small "Build cache:" badge group crediting Depot (sccache → Depot Cache), matching the existing tool-badge style. - CLAUDE.md: a "CI build cache & parallelism (sccache + Depot)" section for maintainers — the BUILD_JOBS knob (macOS -j2 / OOM rationale), the sccache WebDAV → Depot wiring (SCCACHE_WEBDAV_ENDPOINT/TOKEN, the DEPOT_TOKEN secret), content-addressed / pinned-tag hit behavior, release-safety, the use_cache flag + fork-PR/no-token inertness, and the phase-1 (macOS) vs phase-2 (dockcross/Windows/Linux-host) rollout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…st builds Extends the shared sccache -> Depot compiler cache (phase 1 was macOS-only) to the time-consuming cross-compiles so they build incrementally on a warm cache too. - build.sh: when caching is requested but no sccache is on PATH (the dockcross manylinux/Android containers and Linux hosts don't ship it; macOS uses brew), fetch the static musl sccache v0.8.2 binary (validated: runs + has WebDAV support). Best-effort and inert-safe — any fetch/network failure leaves sccache absent and the build proceeds uncached. - publish.yml: the 4 dockcross jobs (manylinux_2_28 CUDA, manylinux2014, Linux aarch64, Android aarch64) and the Linux-host C++ test job set USE_CACHE + SCCACHE_WEBDAV_ENDPOINT/TOKEN. The dockcross jobs also set DOCKCROSS_ARGS="-e SCCACHE_WEBDAV_ENDPOINT -e SCCACHE_WEBDAV_TOKEN -e USE_CACHE" so the wrapper forwards them into the container ($FINAL_ARGS, sourced from DOCKCROSS_ARGS, is injected into docker run). CUDA note: only the C/C++ launcher is set, so the gcc-compiled bulk (134 model TUs + ggml + httplib) caches; the nvcc .cu kernels still compile normally (sccache's nvcc support is limited) — a large but not total speedup there. Inert-safe as before: no token / fork PRs / use_cache=false -> normal uncached build; artifacts stay -O3 and bit-identical. Verified locally: build.sh gating across no-token / present / fetch-fail / disabled; sccache v0.8.2 download; YAML parses; DOCKCROSS_ARGS only on the 4 container jobs. Phase 2b TODO: the Android-OpenCL job (separate build_opencl_android.sh) and the Windows jobs (build.bat + MSVC via sccache-action). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…ity gate) The phase-2 sccache fetch in build.sh used `curl -fsSL` (which follows redirects via -L) without --proto =https --proto-redir =https, tripping the same "Not enforcing HTTPS / redirections to insecure websites" Major hotspot the model-download curls were already hardened against — which dropped the New-Code Security Rating to C and failed the gate. Add the proto flags so neither the URL nor the GitHub release redirect can downgrade to cleartext. Verified the download still succeeds through the github.com -> objects.githubusercontent.com (HTTPS) redirect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
…crashed in-container) The phase-2 cross-compile caching broke the dockcross builds: inside the manylinux container the fetched sccache panicked while wrapping the cross-compiler (log: a Rust backtrace + "Run with SCCACHE_LOG=debug", failing ggml.c.o), and because sccache is the compiler launcher that fails the whole build. The build.sh "inert-safe" guard only covers sccache being *absent*, not present-but-crashing. Remove the sccache env from the 4 dockcross jobs (manylinux_2_28 CUDA, manylinux2014, Linux aarch64, Android aarch64) and the Linux-host C++ test job. With no token reaching those jobs, build.sh skips the fetch and they compile normally (uncached, green) again. macOS caching (phase 1) is unaffected and stays — it uses native clang + brew sccache and is proven working/fast. The build.sh fetch logic remains but is dormant without a token. Dockcross caching is deferred: it needs an sccache-in-container health check (probe-compile before trusting the launcher) and SCCACHE_LOG=debug diagnosis of the cross-compiler panic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
This PR consolidates the two independent OpenAI-compatible server implementations (
OpenAiCompatServerfrom PR #240 andLlamaServerfrom #242) into a single unified codebase. The winning implementation isOpenAiCompatServer(JDKcom.sun.net.httpserver, streaming SSE, zero extra dependencies), extended with:/api/chat,/api/generate,/api/tags,/api/version), Anthropic Messages (POST /v1/messages), OpenAI Responses (POST /v1/responses), and llama.cpp-native (POST /infillfor fill-in-the-middle)POST /v1/completions(text),POST /v1/embeddings,POST /v1/rerank,GET /health(liveness probe)OpenAiServerClireplacesLlamaServerArgs, with support for--embedding,--reranking,--mmprojflagsstream_options.include_usagepassthrough,cached_tokenssafety net, CORS/OPTIONSpreflightresponse_formatsupport for JSON schema validationThe NanoHTTPD dependency is removed; the server now runs on the JDK's built-in HTTP server with zero extra runtime dependencies.
Test plan
OpenAiServerCliTest,OllamaApiSupportTest,AnthropicApiSupportTest,ResponsesApiSupportTest,OaiRerankSupportTest,OpenAiSseFormatterTest,OpenAiRequestMapperTestOpenAiCompatServerHttpTest,OpenAiCompatServerIntegrationTest,OpenAiServerCompletionIntegrationTest,OpenAiServerEmbeddingsIntegrationTest,OpenAiServerRerankIntegrationTestREADME.md,TODO.md,docs/feature-investigation-ide-agent-backend.md,package-info.javaRelated issues / PRs
Closes the "TWO implementations to consolidate" item in
TODO.md. Consolidates PR #240 (OpenAiCompatServer) and #242 (LlamaServer/ NanoHTTPD) into a single unified server.Checklist
CONTRIBUTING.mdandCODE_OF_CONDUCT.mdhttps://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF