You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
server: real-time reasoning interruption via control endpoint (#23971)
* server: real-time reasoning interruption via control endpoint
Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
* ui: track reasoning phase via explicit streaming state
Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.
* ui: extract control endpoint and action into constants
Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.
* server: target reasoning control by completion id
Address @ngxson review on the control endpoint.
Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.
* ui: target reasoning control by completion id
Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.
* server: address review from @ngxson
Move the control fields into task_params and drop the redundant
comments on the control path.
* server: document the reasoning control endpoint
* Update tools/ui/src/lib/types/database.d.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* ui: rename cmplId to completionId
Per @allozaur review, clearer name for the streamed completion id.
* ui: wire completion id capture through the agentic flow
The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.
* ui: target reasoning control model from the message
The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Copy file name to clipboardExpand all lines: tools/server/README.md
+18Lines changed: 18 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1244,6 +1244,8 @@ The `response_format` parameter supports both plain JSON output (e.g. `{"type":
1244
1244
1245
1245
`reasoning_format`: The reasoning format to be parsed. If set to `none`, it will output the raw generated text.
1246
1246
1247
+
`reasoning_control`: Arms realtime reasoning control for this completion so it can be ended early via `/v1/chat/completions/control`. Defaults to `false`.
1248
+
1247
1249
`generation_prompt`: The generation prompt that was prefilled in by the template. Prepended to model output before parsing.
1248
1250
1249
1251
`parse_tool_calls`: Whether to parse the generated tool call.
@@ -1350,6 +1352,22 @@ The server supports parsing and returning reasoning via the `reasoning_content`
1350
1352
1351
1353
Reasoning input (preserve reasoning in history) is also supported by some specific templates. For more details, please refer to [PR#18994](https://github.com/ggml-org/llama.cpp/pull/18994).
1352
1354
1355
+
### POST `/v1/chat/completions/control`: Control a running chat completion in real time
1356
+
1357
+
Acts on an in-flight completion identified by its `id` (the `id` field streamed back by `/v1/chat/completions`). The request is processed in parallel with the SSE stream, so the client sends it while still reading tokens.
1358
+
1359
+
*Options:*
1360
+
1361
+
`id`: (Required) The chat completion id to act on. A completion that has already finished matches nothing and the call is a no-op.
1362
+
1363
+
`action`: (Required) The control action to perform. Currently the only supported value is `reasoning_end`, which forces the end of the current reasoning block so the model moves on to the final answer. Requires `reasoning_control: true` on the original completion request.
1364
+
1365
+
`model`: (Required in router mode) The model name, used to route the request to the right instance. Ignored in single model mode.
1366
+
1367
+
**Response format**
1368
+
1369
+
Returns a JSON object with a boolean `success` field, and an optional `message` field describing the reason when `success` is `false`.
1370
+
1353
1371
### POST `/v1/responses`: OpenAI-compatible Responses API
0 commit comments