You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(server): rerank + structured outputs + vision flag (M batch) and docs
Continues the IDE/agent backend work (Medium items + documentation):
- POST /v1/rerank (+ /rerank, /reranking): RAG document reranking. Native
handleRerank (made public, consistent with the other handle* methods) returns
{document,index,score}; OaiRerankSupport reshapes it into the OpenAI rerank
response with sorted {index, relevance_score}, top_n, and a `data` alias of
`results` (Continue #6478). New OpenAiBackend.rerank + LlamaModelBackend.rerank.
- response_format passthrough (json_object / json_schema) for OpenAI structured
outputs (new InferenceParameters.withResponseFormat; mapper forwards verbatim).
- Vision: --mmproj CLI flag (image_url content parts already pass through verbatim).
- CLI: --reranking (enableReranking), --mmproj (setMmproj) on OpenAiServerCli.
Docs:
- New docs/feature-investigation-ide-agent-backend.md (the deep-research report +
an implementation-status preamble).
- README endpoints table + notes (rerank/infill, CORS, /v1-less aliases, response_
format, the Copilot inline-completion limitation), CLAUDE.md server bullet,
package-info, and TODO.md (DONE list + the deferred decisions: Ollama emulation,
Anthropic /v1/messages + OpenAI /v1/responses shims, Continue native /completion,
per-model FIM registry, /props).
Tests: +OaiRerankSupportTest (10), +rerank HTTP route, +response_format mapper test,
+--reranking/--mmproj CLI tests. Full server+json+arch suite green (138 tests);
javadoc + spotless clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
Copy file name to clipboardExpand all lines: CLAUDE.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -472,8 +472,8 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
472
472
-`LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
473
473
-`OSInfo` — Detects OS and architecture for library resolution.
474
474
-**`server` package — OpenAI-compatible HTTP endpoint (a single implementation).**
475
-
-`server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `GET /v1/models` and `GET /health`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`.
476
-
- Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
475
+
- `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198).
476
+
- Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`; `corsAllowOrigin`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`; flags incl. `--mmproj`/`--embedding`/`--reranking`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON + usage normalization), `OaiRerankSupport` (pure rerank request/response shaping), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
477
477
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java``requires`). See README "OpenAI-compatible HTTP server".
0 commit comments