feat(server): stream POST /v1/completions token-by-token (no new native code)

vaijurao · claude · vaijurao · commit cf635fdfdd39 · 2026-06-21T10:50:38.000-07:00
The native streaming raw-completion path already exists (requestCompletion / receiveCompletionJson,
exposed as LlamaModel.generate(InferenceParameters) -&gt; LlamaIterable), so streaming /v1/completions is
pure server wiring — no JNI/C++ change:

- OpenAiRequestMapper.toCompletionParameters maps an OpenAI completion request (prompt + sampling) to
  InferenceParameters; the shared sampling/cache/output fields are factored into applyCommonFields
  (reused by the chat mapper, whose tests confirm no behaviour change).
- OpenAiBackend.streamCompletions(request, sink) (default throws UnsupportedOperationException) +
  LlamaModelBackend impl drives generate() and emits one OpenAI text_completion chunk per token,
  mapping StopReason -&gt; finish_reason (length / stop); a sink IOException cancels the native task via
  LlamaIterable.close().
- OpenAiSseFormatter.completionChunk builds the text_completion chunk; OpenAiCompatServer.handleCompletions
  branches on stream:true to a new streamCompletions SSE handler (mirrors streamChat: heartbeats, [DONE],
  graceful client-disconnect).

Verified: new streaming HTTP test + 16 mapper + 36 server + 40 adjacent server tests green; Spotless +
Javadoc clean. TODO.md updated: corrects the stale "new native method needed" note, marks /v1/completions
done, and adds a grounded future-modality (audio/image OUTPUT) design note — llama.cpp generates text only,
so that surface stays a documented extension point rather than speculative dead code.

Remaining consumers (same pattern, follow-ups): token-streaming Ollama /api/generate and Continue's
native POST /completion.

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/TODO.md b/TODO.md
@@ -153,13 +153,25 @@ primary goal: agentic tool-calling with Qwen):
 
 **Open follow-ups (deferred):**
 
-- **Streaming raw-completion path (the shared blocker).** A new native streaming method
-  (`requestCompletionStream` alongside the existing chat one) is needed before these can be done
-  token-incrementally: (a) **streaming `/v1/completions`**, (b) **token-streaming `/api/generate`**
-  (today it computes the full text then emits one NDJSON content line), and (c) **Continue's native
-  `llama.cpp` provider** which streams `POST /completion` in the native (non-OAI) shape. Until then these
-  either run non-streaming or emit a single content chunk. JNI + C++ work; the agentic-chat goal does
-  not need it.
+- **Streaming raw-completion path — IN PROGRESS (no new native method needed).** The earlier premise was
+  wrong: a streaming raw-completion JNI path **already exists** (`requestCompletion`/`receiveCompletionJson`,
+  exposed as `LlamaModel.generate(InferenceParameters) → LlamaIterable`), so this is **Java-only server
+  wiring**, not JNI/C++. Progress: **(a) streaming `POST /v1/completions` — DONE** (`OpenAiRequestMapper`
+  `toCompletionParameters` + `OpenAiBackend.streamCompletions` driving `generate()` + an
+  `OpenAiSseFormatter.completionChunk` `text_completion` chunk + the `streamCompletions` SSE handler;
+  HTTP test green). **Remaining:** (b) **token-streaming Ollama `/api/generate`** (translate the
+  `text_completion` chunks to NDJSON, mirroring the chat→Ollama translator) and (c) **Continue's native
+  `POST /completion`** route in the llama.cpp-native streaming shape (`{"content":…,"stop":…}` per chunk).
+- **Future *output* modalities (audio / image) — design note, not yet actionable.** llama.cpp's server
+  produces **text** (plus embeddings/rerank); it does **not** generate images or audio output, so there is
+  no engine behind a TTS/image-gen response today and building that API surface now would be dead code.
+  When/if it becomes real, the integration points are already isolated: a new `OpenAiBackend.stream*`
+  primitive + an `OpenAiSseFormatter.*Chunk` formatter per modality, wired into a per-route handler — the
+  exact shape the text `streamCompletions` path now establishes. Two concrete future hooks: (1) llama.cpp's
+  **OuteTTS** audio path (if it lands in the embedded server) → an `/v1/audio/speech`-style route emitting
+  audio chunks; (2) routing image/audio generation to an **external** model behind the same server (the
+  binding would proxy, not generate). Keep `LlamaOutput`/chunk formatters modality-neutral so neither
+  requires reworking the streaming core.
 - **Incremental tool-call streaming on the alternative surfaces.** Ollama/Anthropic/Responses emit each
   tool call *whole* at end-of-stream (reconstructed by `ToolCallDeltaAccumulator`) rather than streaming
   argument fragments. Fine for clients that apply tool calls after generation; revisit if a client needs
diff --git a/src/main/java/net/ladenthin/llama/server/LlamaModelBackend.java b/src/main/java/net/ladenthin/llama/server/LlamaModelBackend.java
@@ -6,8 +6,12 @@
 
 import com.fasterxml.jackson.databind.JsonNode;
 import java.io.IOException;
+import java.util.UUID;
+import net.ladenthin.llama.LlamaIterable;
 import net.ladenthin.llama.LlamaModel;
 import net.ladenthin.llama.parameters.InferenceParameters;
+import net.ladenthin.llama.value.LlamaOutput;
+import net.ladenthin.llama.value.StopReason;
 
 /**
  * Production {@link OpenAiBackend} that runs requests against a loaded {@link LlamaModel}.
@@ -80,6 +84,36 @@ public String completions(JsonNode request) {
         return model.handleCompletionsOai(request.toString());
     }
 
+    @Override
+    public void streamCompletions(JsonNode request, ChunkSink sink) throws IOException {
+        InferenceParameters params = mapper.toCompletionParameters(request);
+        String modelId = request.path("model").asText("llama");
+        String id = "cmpl-" + UUID.randomUUID().toString().replace("-", "");
+        long created = System.currentTimeMillis() / 1000L;
+        // Relays a sink IOException (client disconnect) out of the token loop; try-with-resources then
+        // cancels the in-flight native task via LlamaIterable.close().
+        IOException sinkFailure = null;
+        try (LlamaIterable it = model.generate(params)) {
+            for (LlamaOutput out : it) {
+                String finishReason = out.stop ? completionFinishReason(out.stopReason) : null;
+                try {
+                    sink.accept(OpenAiSseFormatter.completionChunk(id, created, modelId, out.text, finishReason));
+                } catch (IOException e) {
+                    sinkFailure = e;
+                    break;
+                }
+            }
+        }
+        if (sinkFailure != null) {
+            throw sinkFailure;
+        }
+    }
+
+    /** Map a {@link StopReason} to the OpenAI {@code finish_reason} ("length" on the token cap, else "stop"). */
+    private static String completionFinishReason(StopReason reason) {
+        return reason == StopReason.MAX_TOKENS ? "length" : "stop";
+    }
+
     @Override
     public String embeddings(JsonNode request) {
         // oaiCompat=true so the response uses the OpenAI {"object":"list","data":[{embedding}]} shape.
diff --git a/src/main/java/net/ladenthin/llama/server/OpenAiBackend.java b/src/main/java/net/ladenthin/llama/server/OpenAiBackend.java
@@ -62,6 +62,20 @@ default String metrics() throws IOException {
      */
     String completions(JsonNode request) throws IOException;
 
+    /**
+     * Run a <em>streaming</em> text completion ({@code POST /v1/completions} with {@code stream:true}),
+     * delivering each OpenAI {@code text_completion} chunk to {@code sink} in order. Implementations must
+     * not emit the terminating {@code [DONE]} marker; the caller adds it. The default throws
+     * {@link UnsupportedOperationException}; backends that support streaming completions override it.
+     *
+     * @param request the parsed OpenAI {@code /v1/completions} request (must contain {@code "prompt"})
+     * @param sink receiver for each streamed chunk's JSON
+     * @throws IOException if a chunk cannot be delivered or generation fails
+     */
+    default void streamCompletions(JsonNode request, ChunkSink sink) throws IOException {
+        throw new UnsupportedOperationException("streaming /v1/completions is not supported by this backend");
+    }
+
     /**
      * Generate embeddings ({@code POST /v1/embeddings}). Requires the model to have been loaded in
      * embedding mode; otherwise the native call fails and the caller surfaces a server error.
diff --git a/src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java b/src/main/java/net/ladenthin/llama/server/OpenAiCompatServer.java
@@ -42,7 +42,8 @@
  *   <li>{@code POST /v1/chat/completions} — streaming (Server-Sent Events) and non-streaming chat
  *       completions, forwarded faithfully (messages/tools verbatim; streamed {@code delta.tool_calls}
  *       preserved).</li>
- *   <li>{@code POST /v1/completions} — non-streaming text completion.</li>
+ *   <li>{@code POST /v1/completions} — text completion, streaming (Server-Sent Events, token by token
+ *       via {@code generate(...)}) when {@code stream:true} and non-streaming otherwise.</li>
  *   <li>{@code POST /v1/embeddings} — embeddings (requires the model to be loaded in embedding
  *       mode).</li>
  *   <li>{@code GET /v1/models} — advertises the single configured model.</li>
@@ -302,13 +303,48 @@ private void handleCompletions(HttpExchange exchange) throws IOException {
         try {
             JsonNode request = requirePostJson(exchange);
             if (request != null) {
-                completeNonStreaming(exchange, request, backend::completions);
+                if (request.path("stream").asBoolean(false)) {
+                    streamCompletions(exchange, request);
+                } else {
+                    completeNonStreaming(exchange, request, backend::completions);
+                }
             }
         } finally {
             exchange.close();
         }
     }
 
+    private void streamCompletions(HttpExchange exchange, JsonNode request) throws IOException {
+        exchange.getResponseHeaders().set("Content-Type", CONTENT_TYPE_SSE);
+        exchange.getResponseHeaders().set("Cache-Control", "no-cache");
+        exchange.sendResponseHeaders(HTTP_OK, 0);
+        try (ResponseStream out = new ResponseStream(exchange.getResponseBody())) {
+            ScheduledFuture<?> heartbeat = null;
+            try {
+                heartbeat = heartbeatExecutor.scheduleAtFixedRate(
+                        () -> out.writeQuietly(OpenAiSseFormatter.heartbeat()),
+                        config.getHeartbeatMillis(),
+                        config.getHeartbeatMillis(),
+                        TimeUnit.MILLISECONDS);
+                backend.streamCompletions(request, chunkJson -> out.writeStrict(OpenAiSseFormatter.sseData(chunkJson)));
+                out.writeStrict(OpenAiSseFormatter.sseDone());
+            } catch (IllegalArgumentException e) {
+                out.writeQuietly(
+                        OpenAiSseFormatter.sseData(OpenAiSseFormatter.errorJson(message(e), ERROR_TYPE_REQUEST, null)));
+            } catch (IOException e) {
+                LOG.debug("client disconnected during stream", e);
+            } catch (RuntimeException e) {
+                LOG.warn("streaming completion failed", e);
+                out.writeQuietly(
+                        OpenAiSseFormatter.sseData(OpenAiSseFormatter.errorJson(message(e), ERROR_TYPE_SERVER, null)));
+            } finally {
+                if (heartbeat != null) {
+                    heartbeat.cancel(false);
+                }
+            }
+        }
+    }
+
     private void handleEmbeddings(HttpExchange exchange) throws IOException {
         try {
             JsonNode request = requirePostJson(exchange);
diff --git a/src/main/java/net/ladenthin/llama/server/OpenAiRequestMapper.java b/src/main/java/net/ladenthin/llama/server/OpenAiRequestMapper.java
@@ -48,20 +48,9 @@ InferenceParameters toInferenceParameters(JsonNode request) {
                 .withMessagesJson(messages.toString())
                 .withCachePrompt(true);
 
-        // Preserve llama.cpp extensions when advanced clients opt into them.
-        JsonNode cachePrompt = request.path("cache_prompt");
-        if (cachePrompt.isBoolean()) {
-            params = params.withCachePrompt(cachePrompt.asBoolean());
-        }
-        JsonNode cacheReuse = request.path("n_cache_reuse");
-        if (cacheReuse.isIntegralNumber()) {
-            params = params.withCacheReuse(cacheReuse.asInt());
-        }
-        JsonNode slotId = request.path("id_slot");
-        if (slotId.isIntegralNumber()) {
-            params = params.withSlotId(slotId.asInt());
-        }
+        params = applyCommonFields(params, request);
 
+        // Tools are chat-only.
         JsonNode tools = request.path("tools");
         if (tools.isArray() && tools.size() > 0) {
             params = params.withToolsJson(tools.toString()).withUseChatTemplate(true);
@@ -75,6 +64,51 @@ InferenceParameters toInferenceParameters(JsonNode request) {
             }
         }
 
+        return params;
+    }
+
+    /**
+     * Translate an OpenAI {@code /v1/completions} request (a raw {@code prompt} string) into
+     * {@link InferenceParameters} for the streaming {@code generate} path.
+     *
+     * @param request the parsed OpenAI completion request object
+     * @return inference parameters carrying the prompt and mapped sampling options
+     * @throws IllegalArgumentException if {@code prompt} is missing or not a string
+     */
+    InferenceParameters toCompletionParameters(JsonNode request) {
+        JsonNode prompt = request.path("prompt");
+        if (!prompt.isTextual()) {
+            throw new IllegalArgumentException("'prompt' must be a string");
+        }
+        InferenceParameters params =
+                InferenceParameters.empty().withPrompt(prompt.asText()).withCachePrompt(true);
+        return applyCommonFields(params, request);
+    }
+
+    /**
+     * Apply the sampling / KV-cache / output-shaping fields shared by chat and completion requests
+     * (temperature, top_p/top_k, seed, penalties, max tokens, stop, stream_options, response_format,
+     * plus the llama.cpp cache extensions). Tools and messages/prompt are handled by the callers.
+     *
+     * @param params the parameters to extend
+     * @param request the parsed OpenAI request object
+     * @return a new instance with the recognised fields applied
+     */
+    private InferenceParameters applyCommonFields(InferenceParameters params, JsonNode request) {
+        // Preserve llama.cpp extensions when advanced clients opt into them.
+        JsonNode cachePrompt = request.path("cache_prompt");
+        if (cachePrompt.isBoolean()) {
+            params = params.withCachePrompt(cachePrompt.asBoolean());
+        }
+        JsonNode cacheReuse = request.path("n_cache_reuse");
+        if (cacheReuse.isIntegralNumber()) {
+            params = params.withCacheReuse(cacheReuse.asInt());
+        }
+        JsonNode slotId = request.path("id_slot");
+        if (slotId.isIntegralNumber()) {
+            params = params.withSlotId(slotId.asInt());
+        }
+
         JsonNode temperature = request.path("temperature");
         if (temperature.isNumber()) {
             params = params.withTemperature((float) temperature.asDouble());
diff --git a/src/main/java/net/ladenthin/llama/server/OpenAiSseFormatter.java b/src/main/java/net/ladenthin/llama/server/OpenAiSseFormatter.java
@@ -134,6 +134,37 @@ static String modelsJson(String modelId) {
         return root.toString();
     }
 
+    /**
+     * Build one OpenAI {@code text_completion} streaming chunk for {@code POST /v1/completions}.
+     *
+     * @param id the completion id, stable across the whole stream
+     * @param created the creation timestamp in epoch seconds
+     * @param model the served model id
+     * @param text the incremental token text carried by this chunk
+     * @param finishReason the finish reason on the final chunk, or {@code null} for intermediate chunks
+     * @return the chunk serialized as JSON
+     */
+    static String completionChunk(String id, long created, String model, String text, @Nullable String finishReason) {
+        ObjectNode choice = OBJECT_MAPPER.createObjectNode();
+        choice.put("text", text);
+        choice.put("index", 0);
+        choice.putNull("logprobs");
+        if (finishReason == null) {
+            choice.putNull("finish_reason");
+        } else {
+            choice.put("finish_reason", finishReason);
+        }
+        ArrayNode choices = OBJECT_MAPPER.createArrayNode();
+        choices.add(choice);
+        ObjectNode root = OBJECT_MAPPER.createObjectNode();
+        root.put("id", id);
+        root.put("object", "text_completion");
+        root.put("created", created);
+        root.put("model", model);
+        root.set("choices", choices);
+        return root.toString();
+    }
+
     /**
      * Build the llama.cpp-native {@code GET /props} body. Autocomplete clients (e.g. llama.vscode) read
      * {@code default_generation_settings.n_ctx} from here to size their context window, and newer clients
diff --git a/src/test/java/net/ladenthin/llama/server/OpenAiCompatServerHttpTest.java b/src/test/java/net/ladenthin/llama/server/OpenAiCompatServerHttpTest.java
@@ -54,6 +54,18 @@ public void streamingReturnsSseChunksThenDone() throws IOException {
         }
     }
 
+    @Test
+    public void streamingCompletionsReturnsTextCompletionChunksThenDone() throws IOException {
+        try (OpenAiCompatServer server = new OpenAiCompatServer(new FakeBackend(), config()).start()) {
+            String body = "{\"stream\":true,\"prompt\":\"hi\"}";
+            Response response = post(server.getPort(), "/v1/completions", body, "");
+            assertThat(response.code, is(200));
+            assertThat(response.body, containsString("data: "));
+            assertThat(response.body, containsString("text_completion"));
+            assertThat(response.body, containsString("data: [DONE]"));
+        }
+    }
+
     @Test
     public void streamingEmitsHeartbeatsDuringAGap() throws IOException {
         OpenAiServerConfig cfg = OpenAiServerConfig.builder()
@@ -457,6 +469,12 @@ public String completions(JsonNode request) {
             return "{\"object\":\"text_completion\",\"choices\":[{\"text\":\"hello\"}]}";
         }
 
+        @Override
+        public void streamCompletions(JsonNode request, ChunkSink sink) throws IOException {
+            sink.accept("{\"object\":\"text_completion\",\"choices\":[{\"text\":\"he\",\"finish_reason\":null}]}");
+            sink.accept("{\"object\":\"text_completion\",\"choices\":[{\"text\":\"llo\",\"finish_reason\":\"stop\"}]}");
+        }
+
         @Override
         public String embeddings(JsonNode request) {
             return "{\"object\":\"list\",\"data\":[{\"object\":\"embedding\",\"embedding\":[0.1,0.2]}]}";