Skip to content

Commit d9e38cf

Browse files
committed
feat(server): add /infill + tool/usage correctness for IDE agent backends
Implements the XS+S recommendations from the IDE/agent backend investigation, targeting agentic tool-calling (Qwen) and local autocomplete: XS: - POST /infill route (FIM autocomplete: llama.vscode/Twinny/Tabby/Continue) — forwards verbatim to the existing native handleInfill; FIM tokens applied server-side from GGUF metadata. New OpenAiBackend.infill + LlamaModelBackend. - Tolerant routing: every route also reachable without the /v1 prefix. - cache_prompt defaulted true in the chat mapper (KV-prefix reuse for IDE latency). - C++ regression guard (#20198): assert tool_calls.function.arguments is a JSON STRING, not an object — passes against pinned b9682, so agentic tool-calling is wire-correct for the OpenAI SDK / Roo Code / Copilot agent. S: - stream_options.include_usage passthrough: OpenAiRequestMapper forwards the stream_options object verbatim (new InferenceParameters.withStreamOptions) so the native server emits the trailing usage chunk OpenAI clients expect. - cached_tokens safety net: OpenAiSseFormatter.ensureUsageCachedTokens guarantees usage.prompt_tokens_details.cached_tokens is present on the streamed usage chunk, fixing the documented Copilot custom-endpoint crash (microsoft/vscode #273482) regardless of upstream. Applied in the SSE path; token-delta chunks pass through unparsed. - CORS: a com.sun.net.httpserver Filter answers OPTIONS preflights with 204 + Access-Control-Allow-{Origin,Methods,Headers} and stamps Allow-Origin on every response. New OpenAiServerConfig.corsAllowOrigin (default "*"). Tests: +infill/alias/CORS HTTP tests, +stream_options mapper test, +5 ensureUsageCachedTokens unit tests, +1 C++ arguments-as-string guard. Full server + json + arch suite green (77 model-free tests); C++ tool-call/stream suite green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01JdLpWD8nedY7LwNnHefZLF
1 parent fa3afb5 commit d9e38cf

13 files changed

Lines changed: 366 additions & 11 deletions

src/main/java/net/ladenthin/llama/parameters/InferenceParameters.java

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ public final class InferenceParameters extends JsonParameters {
5858
private static final String PARAM_INPUT_PREFIX = "input_prefix";
5959
private static final String PARAM_INPUT_SUFFIX = "input_suffix";
6060
private static final String PARAM_CACHE_PROMPT = "cache_prompt";
61+
private static final String PARAM_STREAM_OPTIONS = "stream_options";
6162
private static final String PARAM_N_PREDICT = "n_predict";
6263
private static final String PARAM_TOP_K = "top_k";
6364
private static final String PARAM_TOP_P = "top_p";
@@ -438,6 +439,19 @@ public InferenceParameters withJsonSchema(String schema) {
438439
return withRaw(PARAM_JSON_SCHEMA, schema);
439440
}
440441

442+
/**
443+
* Returns a new request with the OpenAI streaming {@code stream_options} object replaced. Passing
444+
* {@code {"include_usage":true}} makes the native server emit a trailing {@code usage} chunk after
445+
* the stream completes (with an empty {@code choices} array), which OpenAI clients — notably the
446+
* VS&nbsp;Code Copilot custom endpoint — rely on for token accounting.
447+
*
448+
* @param streamOptionsJson the {@code stream_options} object as a JSON-encoded string
449+
* @return a new instance; this instance is unchanged
450+
*/
451+
public InferenceParameters withStreamOptions(String streamOptionsJson) {
452+
return withRaw(PARAM_STREAM_OPTIONS, streamOptionsJson);
453+
}
454+
441455
/**
442456
* Returns a new request with the repetition-penalty prompt-portion override replaced.
443457
*

src/main/java/net/ladenthin/llama/server/LlamaModelBackend.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,4 +80,11 @@ public String embeddings(JsonNode request) {
8080
// oaiCompat=true so the response uses the OpenAI {"object":"list","data":[{embedding}]} shape.
8181
return model.handleEmbeddings(request.toString(), true);
8282
}
83+
84+
@Override
85+
public String infill(JsonNode request) {
86+
// The native /infill handler parses the body itself (input_prefix/input_suffix/...) and applies
87+
// the model's FIM tokens from GGUF metadata; forward verbatim.
88+
return model.handleInfill(request.toString());
89+
}
8390
}

src/main/java/net/ladenthin/llama/server/OpenAiBackend.java

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,4 +60,19 @@ interface OpenAiBackend {
6060
* @throws IOException if generation fails in a way the caller should surface as a server error
6161
*/
6262
String embeddings(JsonNode request) throws IOException;
63+
64+
/**
65+
* Run a (non-streaming) fill-in-the-middle completion ({@code POST /infill}). The request body is
66+
* forwarded verbatim to the native llama.cpp infill handler, which applies the model's FIM control
67+
* tokens server-side from GGUF metadata — so callers send raw {@code input_prefix} /
68+
* {@code input_suffix} (and optional {@code input_extra} / {@code prompt}). This is the endpoint
69+
* that drives local ghost-text autocomplete clients (llama.vscode, llama.vim, Twinny, Tabby,
70+
* Continue's {@code llama.cpp} provider).
71+
*
72+
* @param request the parsed llama.cpp {@code /infill} request (typically {@code input_prefix} +
73+
* {@code input_suffix})
74+
* @return the infill response serialized as JSON (clients read the {@code "content"} field)
75+
* @throws IOException if generation fails in a way the caller should surface as a server error
76+
*/
77+
String infill(JsonNode request) throws IOException;
6378
}

src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java

Lines changed: 77 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@
77
import com.fasterxml.jackson.core.JsonProcessingException;
88
import com.fasterxml.jackson.databind.JsonNode;
99
import com.fasterxml.jackson.databind.ObjectMapper;
10+
import com.sun.net.httpserver.Filter;
1011
import com.sun.net.httpserver.HttpExchange;
12+
import com.sun.net.httpserver.HttpHandler;
1113
import com.sun.net.httpserver.HttpServer;
1214
import java.io.IOException;
1315
import java.io.InputStream;
@@ -71,6 +73,12 @@ public final class OpenAiCompatServer implements AutoCloseable {
7173
/** The embeddings route. */
7274
public static final String PATH_EMBEDDINGS = "/v1/embeddings";
7375

76+
/**
77+
* The fill-in-the-middle (autocomplete) route. Deliberately the llama.cpp-native bare path (no
78+
* {@code /v1}) so ghost-text clients such as llama.vscode and Tabby reach it unchanged.
79+
*/
80+
public static final String PATH_INFILL = "/infill";
81+
7482
/** The model-list route. */
7583
public static final String PATH_MODELS = "/v1/models";
7684

@@ -94,6 +102,7 @@ public final class OpenAiCompatServer implements AutoCloseable {
94102
private final OpenAiServerConfig config;
95103
private final OpenAiBackend backend;
96104
private final HttpServer http;
105+
private final Filter corsFilter;
97106
private final ExecutorService requestExecutor;
98107
private final ScheduledExecutorService heartbeatExecutor;
99108

@@ -122,12 +131,21 @@ public OpenAiCompatServer(LlamaModel model, OpenAiServerConfig config) throws IO
122131
this.requestExecutor = Executors.newCachedThreadPool(namedFactory("jllama-openai-http"));
123132
this.heartbeatExecutor = Executors.newScheduledThreadPool(1, namedFactory("jllama-openai-hb"));
124133
this.http = HttpServer.create(new InetSocketAddress(config.getHost(), config.getPort()), 0);
125-
http.createContext("/", this::handleNotFound);
126-
http.createContext(PATH_HEALTH, this::handleHealth);
127-
http.createContext(PATH_MODELS, this::handleModels);
128-
http.createContext(PATH_CHAT_COMPLETIONS, this::handleChatCompletions);
129-
http.createContext(PATH_COMPLETIONS, this::handleCompletions);
130-
http.createContext(PATH_EMBEDDINGS, this::handleEmbeddings);
134+
this.corsFilter = buildCorsFilter(config.getCorsAllowOrigin());
135+
register("/", this::handleNotFound);
136+
register(PATH_HEALTH, this::handleHealth);
137+
// Each route is registered under its canonical path and a bare alias (clients disagree on
138+
// whether to include the /v1 prefix), so both forms resolve to the same handler.
139+
register(PATH_MODELS, this::handleModels);
140+
register("/models", this::handleModels);
141+
register(PATH_CHAT_COMPLETIONS, this::handleChatCompletions);
142+
register("/chat/completions", this::handleChatCompletions);
143+
register(PATH_COMPLETIONS, this::handleCompletions);
144+
register("/completions", this::handleCompletions);
145+
register(PATH_EMBEDDINGS, this::handleEmbeddings);
146+
register("/embeddings", this::handleEmbeddings);
147+
register(PATH_INFILL, this::handleInfill);
148+
register("/v1/infill", this::handleInfill);
131149
http.setExecutor(requestExecutor);
132150
}
133151

@@ -159,6 +177,42 @@ public void close() {
159177
heartbeatExecutor.shutdownNow();
160178
}
161179

180+
/**
181+
* Register {@code handler} for {@code path} with the CORS filter attached. Centralised so the
182+
* cross-cutting CORS/preflight wiring applies uniformly to every route (including the catch-all).
183+
*/
184+
private void register(String path, HttpHandler handler) {
185+
http.createContext(path, handler).getFilters().add(corsFilter);
186+
}
187+
188+
/**
189+
* Build a CORS filter that stamps {@code Access-Control-Allow-Origin} on every response and answers
190+
* {@code OPTIONS} preflights with {@code 204} + the allowed methods/headers — so browser- and
191+
* webview-based clients (which preflight an {@code Authorization} header) are not blocked.
192+
*/
193+
private static Filter buildCorsFilter(String allowOrigin) {
194+
return new Filter() {
195+
@Override
196+
public void doFilter(HttpExchange exchange, Chain chain) throws IOException {
197+
exchange.getResponseHeaders().set("Access-Control-Allow-Origin", allowOrigin);
198+
if ("OPTIONS".equalsIgnoreCase(exchange.getRequestMethod())) {
199+
exchange.getResponseHeaders().set("Access-Control-Allow-Methods", "GET, POST, OPTIONS");
200+
exchange.getResponseHeaders().set("Access-Control-Allow-Headers", "Content-Type, Authorization");
201+
exchange.getResponseHeaders().set("Access-Control-Max-Age", "86400");
202+
exchange.sendResponseHeaders(204, -1);
203+
exchange.close();
204+
return;
205+
}
206+
chain.doFilter(exchange);
207+
}
208+
209+
@Override
210+
public String description() {
211+
return "CORS preflight + Access-Control-Allow-Origin";
212+
}
213+
};
214+
}
215+
162216
// ----- handlers -----
163217

164218
private void handleChatCompletions(HttpExchange exchange) throws IOException {
@@ -204,6 +258,17 @@ private void handleEmbeddings(HttpExchange exchange) throws IOException {
204258
}
205259
}
206260

261+
private void handleInfill(HttpExchange exchange) throws IOException {
262+
try {
263+
JsonNode request = requirePostJson(exchange);
264+
if (request != null) {
265+
completeNonStreaming(exchange, request, backend::infill);
266+
}
267+
} finally {
268+
exchange.close();
269+
}
270+
}
271+
207272
/**
208273
* Run a non-streaming request through {@code producer} and write its JSON body, translating an
209274
* {@link IllegalArgumentException} to {@code 400} and any other failure to {@code 500}.
@@ -236,7 +301,12 @@ private void streamChat(HttpExchange exchange, JsonNode request) throws IOExcept
236301
config.getHeartbeatMillis(),
237302
TimeUnit.MILLISECONDS);
238303
try {
239-
backend.stream(request, chunkJson -> writeStrict(os, writeLock, OpenAiSseFormatter.sseData(chunkJson)));
304+
backend.stream(
305+
request,
306+
chunkJson -> writeStrict(
307+
os,
308+
writeLock,
309+
OpenAiSseFormatter.sseData(OpenAiSseFormatter.ensureUsageCachedTokens(chunkJson))));
240310
writeStrict(os, writeLock, OpenAiSseFormatter.sseDone());
241311
} catch (IllegalArgumentException e) {
242312
writeQuietly(

src/main/java/net/ladenthin/llama/server/OpenAiRequestMapper.java

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,12 @@ InferenceParameters toInferenceParameters(JsonNode request) {
4040
throw new IllegalArgumentException("'messages' must be a non-empty array");
4141
}
4242

43-
InferenceParameters params = InferenceParameters.empty().withMessagesJson(messages.toString());
43+
// cache_prompt=true reuses the slot's KV prefix across turns — the standard llama.cpp-server
44+
// default and what IDE clients rely on for acceptable repeated-prefix latency. OpenAI requests
45+
// never carry this llama.cpp-specific flag, so defaulting it here is safe.
46+
InferenceParameters params = InferenceParameters.empty()
47+
.withMessagesJson(messages.toString())
48+
.withCachePrompt(true);
4449

4550
JsonNode tools = request.path("tools");
4651
if (tools.isArray() && tools.size() > 0) {
@@ -86,6 +91,13 @@ InferenceParameters toInferenceParameters(JsonNode request) {
8691
params = params.withStopStrings(stops);
8792
}
8893

94+
// Forward stream_options verbatim (e.g. {"include_usage":true}) so the native server emits the
95+
// trailing usage chunk the OpenAI streaming protocol — and the Copilot custom endpoint — expect.
96+
JsonNode streamOptions = request.path("stream_options");
97+
if (streamOptions.isObject()) {
98+
params = params.withStreamOptions(streamOptions.toString());
99+
}
100+
89101
return params;
90102
}
91103

src/main/java/net/ladenthin/llama/server/OpenAiServerConfig.java

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,21 @@ public final class OpenAiServerConfig {
3232
/** Default Server-Sent-Events heartbeat interval, in milliseconds. */
3333
public static final long DEFAULT_HEARTBEAT_MILLIS = 15_000L;
3434

35+
/**
36+
* Default {@code Access-Control-Allow-Origin} value: {@code "*"}. Browser- and webview-based clients
37+
* send a CORS preflight and require this header; {@code "*"} is the pragmatic default for a server
38+
* that binds loopback and authenticates with a bearer token (not cookies).
39+
*/
40+
public static final String DEFAULT_CORS_ALLOW_ORIGIN = "*";
41+
3542
private final String host;
3643
private final int port;
3744
private final @Nullable String apiKey;
3845
private final String modelId;
3946
private final int maxInputTokens;
4047
private final int maxOutputTokens;
4148
private final long heartbeatMillis;
49+
private final String corsAllowOrigin;
4250

4351
private OpenAiServerConfig(Builder builder) {
4452
this.host = builder.host;
@@ -48,6 +56,7 @@ private OpenAiServerConfig(Builder builder) {
4856
this.maxInputTokens = builder.maxInputTokens;
4957
this.maxOutputTokens = builder.maxOutputTokens;
5058
this.heartbeatMillis = builder.heartbeatMillis;
59+
this.corsAllowOrigin = builder.corsAllowOrigin;
5160
}
5261

5362
/**
@@ -122,6 +131,15 @@ public long getHeartbeatMillis() {
122131
return heartbeatMillis;
123132
}
124133

134+
/**
135+
* The {@code Access-Control-Allow-Origin} value sent on every response and CORS preflight.
136+
*
137+
* @return the allowed CORS origin
138+
*/
139+
public String getCorsAllowOrigin() {
140+
return corsAllowOrigin;
141+
}
142+
125143
/**
126144
* Whether bearer-token authentication is enabled (an API key is configured).
127145
*
@@ -152,6 +170,8 @@ public String toString() {
152170
+ maxOutputTokens
153171
+ ", heartbeatMillis="
154172
+ heartbeatMillis
173+
+ ", corsAllowOrigin="
174+
+ corsAllowOrigin
155175
+ '}';
156176
}
157177

@@ -165,6 +185,7 @@ public static final class Builder {
165185
private int maxInputTokens = DEFAULT_MAX_INPUT_TOKENS;
166186
private int maxOutputTokens = DEFAULT_MAX_OUTPUT_TOKENS;
167187
private long heartbeatMillis = DEFAULT_HEARTBEAT_MILLIS;
188+
private String corsAllowOrigin = DEFAULT_CORS_ALLOW_ORIGIN;
168189

169190
private Builder() {}
170191

@@ -245,6 +266,17 @@ public Builder heartbeatMillis(long heartbeatMillis) {
245266
return this;
246267
}
247268

269+
/**
270+
* Sets the {@code Access-Control-Allow-Origin} value (CORS).
271+
*
272+
* @param corsAllowOrigin the allowed origin (e.g. {@code "*"} or a specific scheme/host/port)
273+
* @return this builder
274+
*/
275+
public Builder corsAllowOrigin(String corsAllowOrigin) {
276+
this.corsAllowOrigin = corsAllowOrigin;
277+
return this;
278+
}
279+
248280
/**
249281
* Builds the immutable configuration.
250282
*

src/main/java/net/ladenthin/llama/server/OpenAiSseFormatter.java

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@
44

55
package net.ladenthin.llama.server;
66

7+
import com.fasterxml.jackson.databind.JsonNode;
78
import com.fasterxml.jackson.databind.ObjectMapper;
89
import com.fasterxml.jackson.databind.node.ArrayNode;
910
import com.fasterxml.jackson.databind.node.ObjectNode;
11+
import java.io.IOException;
1012
import org.jspecify.annotations.Nullable;
1113

1214
/**
@@ -73,6 +75,46 @@ static String errorJson(String message, String type, @Nullable String code) {
7375
return root.toString();
7476
}
7577

78+
/**
79+
* Guarantee a streamed chunk's usage object carries {@code usage.prompt_tokens_details.cached_tokens}.
80+
*
81+
* <p>When {@code stream_options.include_usage} is set, the OpenAI streaming protocol emits a trailing
82+
* usage chunk. The VS&nbsp;Code Copilot custom endpoint throws
83+
* {@code Cannot read properties of undefined (reading 'cached_tokens')} (microsoft/vscode #273482) if
84+
* {@code usage.prompt_tokens_details.cached_tokens} is missing, and upstream llama.cpp does not always
85+
* populate it. This fills a default of {@code 0} when absent. Token-delta chunks (which carry no
86+
* non-null usage object) are returned unchanged and unparsed, so the streaming hot path is untouched.
87+
*
88+
* @param chunkJson one {@code chat.completion.chunk} serialized as JSON
89+
* @return the chunk JSON with {@code cached_tokens} guaranteed present inside any non-null usage object
90+
*/
91+
static String ensureUsageCachedTokens(String chunkJson) {
92+
// Fast path: only the trailing usage chunk carries a non-null usage object — skip the rest unparsed.
93+
if (!chunkJson.contains("\"usage\"") || chunkJson.contains("\"usage\":null")) {
94+
return chunkJson;
95+
}
96+
try {
97+
JsonNode root = OBJECT_MAPPER.readTree(chunkJson);
98+
if (!root.isObject() || !root.path("usage").isObject()) {
99+
return chunkJson;
100+
}
101+
ObjectNode usage = (ObjectNode) root.get("usage");
102+
JsonNode details = usage.path("prompt_tokens_details");
103+
if (details.isObject()) {
104+
if (details.has("cached_tokens")) {
105+
return chunkJson; // already correct — emit verbatim
106+
}
107+
((ObjectNode) details).put("cached_tokens", 0);
108+
} else {
109+
usage.putObject("prompt_tokens_details").put("cached_tokens", 0);
110+
}
111+
return root.toString();
112+
} catch (IOException e) {
113+
// Never break a live stream over a formatting nicety.
114+
return chunkJson;
115+
}
116+
}
117+
76118
/**
77119
* Build the {@code GET /v1/models} body advertising a single model.
78120
*

src/main/java/net/ladenthin/llama/server/package-info.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,16 @@
2121
* so streamed {@code delta.tool_calls} are preserved for agent-mode tool use.</li>
2222
* <li>{@code POST /v1/completions} and {@code POST /v1/embeddings} — non-streaming, forwarding the
2323
* request body to the matching {@code LlamaModel.handle*} method.</li>
24+
* <li>{@code POST /infill} — non-streaming fill-in-the-middle for local ghost-text autocomplete
25+
* clients (llama.vscode, Twinny, Tabby); the model's FIM tokens are applied server-side.</li>
2426
* <li>{@code GET /v1/models} — advertises the configured model id.</li>
2527
* <li>{@code GET /health} — unauthenticated liveness probe.</li>
2628
* </ul>
2729
*
30+
* <p>Every route is also reachable without the {@code /v1} prefix, answers CORS preflight
31+
* ({@code OPTIONS}) requests, and stamps {@code Access-Control-Allow-Origin} on responses so
32+
* browser/webview clients are not blocked.</p>
33+
*
2834
* <p>The HTTP surface is decoupled from the model behind {@link net.ladenthin.llama.server.OpenAiBackend}
2935
* (production implementation {@link net.ladenthin.llama.server.LlamaModelBackend}) so routing,
3036
* authentication, SSE framing and heartbeats are unit-testable with a fake backend — no socket and no

0 commit comments

Comments
 (0)