Skip to content

Commit 601b9de

Browse files
committed
feat(server): rerank + structured outputs + vision flag (M batch) and docs
Continues the IDE/agent backend work (Medium items + documentation): - POST /v1/rerank (+ /rerank, /reranking): RAG document reranking. Native handleRerank (made public, consistent with the other handle* methods) returns {document,index,score}; OaiRerankSupport reshapes it into the OpenAI rerank response with sorted {index, relevance_score}, top_n, and a `data` alias of `results` (Continue #6478). New OpenAiBackend.rerank + LlamaModelBackend.rerank. - response_format passthrough (json_object / json_schema) for OpenAI structured outputs (new InferenceParameters.withResponseFormat; mapper forwards verbatim). - Vision: --mmproj CLI flag (image_url content parts already pass through verbatim). - CLI: --reranking (enableReranking), --mmproj (setMmproj) on OpenAiServerCli. Docs: - New docs/feature-investigation-ide-agent-backend.md (the deep-research report + an implementation-status preamble). - README endpoints table + notes (rerank/infill, CORS, /v1-less aliases, response_ format, the Copilot inline-completion limitation), CLAUDE.md server bullet, package-info, and TODO.md (DONE list + the deferred decisions: Ollama emulation, Anthropic /v1/messages + OpenAI /v1/responses shims, Continue native /completion, per-model FIM registry, /props). Tests: +OaiRerankSupportTest (10), +rerank HTTP route, +response_format mapper test, +--reranking/--mmproj CLI tests. Full server+json+arch suite green (138 tests); javadoc + spotless clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
1 parent d9e38cf commit 601b9de

17 files changed

Lines changed: 717 additions & 31 deletions

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -472,8 +472,8 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
472472
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
473473
- `OSInfo` — Detects OS and architecture for library resolution.
474474
- **`server` package — OpenAI-compatible HTTP endpoint (a single implementation).**
475-
- `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `GET /v1/models` and `GET /health`, so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint") can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion``requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`.
476-
- Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
475+
- `server.OpenAiCompatServer` — built only on the JDK's `com.sun.net.httpserver` (no new dependency), both embeddable and the fat-jar `Main-Class`. Serves `POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`, `POST /v1/embeddings`, `POST /v1/rerank`, `POST /infill`, `GET /v1/models` and `GET /health` (every route is also reachable without the `/v1` prefix), so editors that speak the OpenAI protocol (e.g. VS Code Copilot "Custom Endpoint", Cline, Roo Code, Continue) can drive a local model. Streaming chat uses the native OAI chunk path (`LlamaModel.streamChatCompletion` → `requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++ `wrap_stream_chunk` helper), preserving `delta.tool_calls`; completions/embeddings/infill forward verbatim to the matching `LlamaModel.handle*`; rerank reshapes `handleRerank` into the OAI `results`/`data` shape. The chat mapper forwards `stream_options` and `response_format` and defaults `cache_prompt=true`; a CORS `Filter` answers `OPTIONS` preflights; `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees `usage.prompt_tokens_details.cached_tokens` on the streamed usage chunk (Copilot crash fix, microsoft/vscode #273482). **Agentic tool-calling is the primary target**; a C++ guard (`test_server.cpp`) pins `tool_calls.function.arguments` as a JSON string (llama.cpp #20198).
476+
- Supporting classes: `OpenAiServerConfig` (builder; optional bearer auth; binds `127.0.0.1`; `corsAllowOrigin`), `OpenAiServerCli` (testable CLI arg parser → `ModelParameters` + `OpenAiServerConfig`; flags incl. `--mmproj`/`--embedding`/`--reranking`), `OpenAiRequestMapper` (OAI chat request → `InferenceParameters`), `OpenAiSseFormatter` (SSE/models/error JSON + usage normalization), `OaiRerankSupport` (pure rerank request/response shaping), and the model-free test seam `OpenAiBackend`/`ChunkSink` + `LlamaModelBackend`. The streaming envelope is parsed by `json.ChatStreamChunkParser`.
477477
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
478478

479479
**Native layer** (`src/main/cpp/`):

README.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -409,12 +409,24 @@ serves:
409409
| `POST /v1/chat/completions` | `LlamaModel.streamChatCompletion` (streaming SSE) / `chatComplete` (blocking) |
410410
| `POST /v1/completions` | `LlamaModel.handleCompletionsOai` |
411411
| `POST /v1/embeddings` (requires `--embedding`) | `LlamaModel.handleEmbeddings` |
412+
| `POST /v1/rerank` (requires `--reranking`) | `LlamaModel.handleRerank` (reshaped to `results`/`data`) |
413+
| `POST /infill` | `LlamaModel.handleInfill` (fill-in-the-middle autocomplete) |
412414
| `GET /v1/models` | the configured model id |
413415
| `GET /health` | static `{"status":"ok"}` (unauthenticated) |
414416

415417
Chat completions support **streaming via Server-Sent Events** and non-streaming, forwarding
416-
`messages`/`tools` verbatim. The streaming path carries `delta.tool_calls`, so agent/tool-calling
417-
clients work. Completions and embeddings are non-streaming (the full JSON result per request).
418+
`messages`/`tools` verbatim. The streaming path carries `delta.tool_calls` and (with
419+
`stream_options.include_usage`) a trailing `usage` chunk, so **agent/tool-calling clients work**
420+
this is the recommended surface for VS Code Copilot agent mode, Cline, Roo Code and Continue.
421+
`response_format` (`json_object` / `json_schema`) is forwarded for structured outputs. Completions,
422+
embeddings, rerank and infill are non-streaming.
423+
424+
Every route is also reachable **without the `/v1` prefix**, the server answers **CORS preflight**
425+
(`OPTIONS`) and stamps `Access-Control-Allow-Origin` (so browser/webview clients work), and
426+
`POST /infill` is the llama.cpp-native FIM endpoint for local ghost-text autocomplete plugins
427+
(llama.vscode, Twinny, Tabby, Continue's `llama.cpp` provider). Note: GitHub Copilot's **inline**
428+
completions cannot be served by any local endpoint — only its chat/agent surfaces — so use one of
429+
those autocomplete plugins for ghost text.
418430

419431
Embed it in your app:
420432

@@ -442,7 +454,8 @@ java -cp target/llama-<version>.jar net.ladenthin.llama.server.OpenAiCompatServe
442454
```
443455

444456
Run with `--help` for the full option list (`-m/--model`, `--host`, `-p/--port`, `-c/--ctx-size`,
445-
`-ngl/--n-gpu-layers`, `-t/--threads`, `--parallel`, `--model-id`, `--api-key`, `--embedding`).
457+
`-ngl/--n-gpu-layers`, `-t/--threads`, `--parallel`, `--model-id`, `--api-key`, `--mmproj`,
458+
`--embedding`, `--reranking`).
446459

447460
Verify with curl (streaming chat):
448461

TODO.md

Lines changed: 46 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -15,30 +15,53 @@ cross-cutting initiative.
1515

1616
### OpenAI-compatible HTTP endpoint (shipped; follow-ups open)
1717

18-
`net.ladenthin.llama.server.OpenAiCompatServer` is the single OpenAI-compatible server. It exposes
19-
`POST /v1/chat/completions` (streaming via SSE + non-streaming), `POST /v1/completions`,
20-
`POST /v1/embeddings`, `GET /v1/models` and `GET /health` over the JDK's built-in
21-
`com.sun.net.httpserver` (no new dependency), and is the fat-jar `Main-Class`. Streaming chat uses the
22-
native OAI chunk path (`requestChatCompletionStream` / `receiveChatCompletionChunk` + the C++
23-
`wrap_stream_chunk` helper), preserving `delta.tool_calls` for agent mode; completions/embeddings
24-
forward verbatim to `LlamaModel.handleCompletionsOai` / `handleEmbeddings`. The CLI is parsed by the
25-
testable `OpenAiServerCli`. (Consolidated from the two interim implementations — PR #240's JDK +
26-
streaming server and #242's NanoHTTPD server — by keeping the JDK/streaming core, porting the extra
27-
routes + a fuller CLI + the fat-jar entry point onto it, and deleting the NanoHTTPD impl + its
28-
`org.nanohttpd` dependency.) Follow-ups, deferred until requested:
29-
30-
- **Multi-model registry.** Only one model id is advertised/served today; support several models
31-
chosen by the request `model` field (and listed in `/v1/models`).
32-
- **`stream_options.include_usage` passthrough** so the final streamed `usage` chunk is emitted
33-
(needs a generic raw-param passthrough on `InferenceParameters`, or explicit mapping).
34-
- **Streaming `/v1/completions`.** The chat route streams; `/v1/completions` is non-streaming today
35-
(a `"stream": true` body still returns one full JSON object). Honour SSE there too if a client needs
36-
it.
37-
- **Additional `apiType`s.** VS Code "Custom Endpoint" also offers Anthropic `messages` and OpenAI
38-
`responses`; only `chat-completions` is implemented.
18+
`net.ladenthin.llama.server.OpenAiCompatServer` is the single OpenAI-compatible server (JDK
19+
`com.sun.net.httpserver`, no new dependency, fat-jar `Main-Class`). It exposes
20+
`POST /v1/chat/completions` (streaming SSE + non-streaming), `/v1/completions`, `/v1/embeddings`,
21+
`/v1/rerank`, `/infill`, `GET /v1/models` and `GET /health` (every route also reachable without `/v1`).
22+
The CLI is parsed by the testable `OpenAiServerCli`. (Consolidated from PR #240's JDK + streaming
23+
server and #242's NanoHTTPD server; NanoHTTPD + its dependency deleted.)
24+
25+
**IDE/agent backend hardening — DONE** (from the deep-research investigation
26+
[`docs/feature-investigation-ide-agent-backend.md`](docs/feature-investigation-ide-agent-backend.md);
27+
primary goal: agentic tool-calling with Qwen):
28+
29+
- Agentic tool-calling verified wire-correct: C++ guard pins `tool_calls.function.arguments` as a JSON
30+
**string** (not object) at b9682 (llama.cpp #20198), plus the existing `finish_reason:"tool_calls"`
31+
test.
32+
- `stream_options.include_usage` forwarded (new `InferenceParameters.withStreamOptions`) so the trailing
33+
usage chunk is emitted, and `OpenAiSseFormatter.ensureUsageCachedTokens` guarantees
34+
`usage.prompt_tokens_details.cached_tokens` (fixes the Copilot custom-endpoint crash, vscode #273482).
35+
- `response_format` (`json_object`/`json_schema`) forwarded for structured outputs.
36+
- `POST /infill` (FIM autocomplete for llama.vscode/Twinny/Tabby/Continue) → native `handleInfill`.
37+
- `POST /v1/rerank` (RAG) → `handleRerank` reshaped to `results`/`data` (`OaiRerankSupport`).
38+
- CORS preflight + `Access-Control-Allow-Origin`; bare-path (no `/v1`) aliases; `cache_prompt=true`
39+
default; `--mmproj` (vision), `--embedding`, `--reranking` CLI flags.
40+
41+
**Open follow-ups (deferred — need a decision before building):**
42+
43+
- **Ollama native-API emulation** (`GET /api/version`, `/api/tags`, `POST /api/show`, `/api/chat`,
44+
`/api/generate`). Unlocks Copilot's built-in *Ollama* provider on older VS Code and tools hard-coded
45+
to Ollama's endpoints. Downgraded by the research because the OpenAI Custom Endpoint provider reached
46+
VS Code Stable in 1.122 (May 2026), so the clean OpenAI surface already covers current Copilot
47+
chat/agent. Medium effort + a second protocol to maintain.
48+
- **Anthropic `POST /v1/messages` + OpenAI `POST /v1/responses` shims** for Copilot's other `apiType`s
49+
and Claude-shaped clients (Claude Code). The native layer already emits the Anthropic shape
50+
(`server_task_result_*::to_json_anthropic`, exercised in `test_server.cpp`); the gap is the HTTP
51+
routes + request translation. `chat-completions` suffices for Qwen agentic, so this is medium–large
52+
and deferred.
53+
- **Continue's native `llama.cpp` provider** posts to `POST /completion` (singular) expecting the
54+
*native* (non-OAI) completion shape; we return OAI shapes. Add a `/completion` route → native
55+
`handleCompletions` if Continue's `llama.cpp` (not `openai`) provider must be supported.
56+
- **Per-model FIM template registry** (Qwen/CodeLlama/DeepSeek v1&V2/StarCoder2/Codestral) — only needed
57+
if we also expose `/v1/completions`-with-`suffix` FIM; `/infill` applies the model's FIM tokens
58+
server-side, so this is lower value.
59+
- **`/props` (or `/v1/models`) context-length + capability reporting** so clients can auto-size prompts
60+
and light up tools/vision without manual config.
61+
- **Streaming `/v1/completions`** (the chat route streams; `/v1/completions` is non-streaming today).
62+
- **Multi-model registry.** Only one model id is advertised/served today.
3963
- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9682`) includes the Gemma 4
40-
tool-call parser fixes (landed upstream ~Apr 2026); if not, bump per the upgrade procedure so
41-
streamed/blocking `tool_calls` come through for Gemma 4 GGUFs.
64+
tool-call parser fixes; if not, bump per the upgrade procedure.
4265

4366
### llama.cpp upstream feature exposure (queued, deferred by policy)
4467

0 commit comments

Comments
 (0)