Skip to content

Commit 354ebac

Browse files
server: real-time reasoning interruption via control endpoint (#23971)
* server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
1 parent 1fd5f48 commit 354ebac

22 files changed

Lines changed: 277 additions & 6 deletions

common/common.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,7 @@ struct common_params_sampling {
277277
std::vector<llama_token> reasoning_budget_end; // end tag token sequence
278278
std::vector<llama_token> reasoning_budget_forced; // forced sequence (message + end tag)
279279
std::string reasoning_budget_message; // message injected before end tag when budget exhausted
280+
bool reasoning_control = false; // create the budget sampler on demand so reasoning can be ended at runtime
280281

281282
bool backend_sampling = false;
282283

common/sampling.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
293293
}
294294

295295
// reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression)
296-
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0)) {
296+
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0 || params.reasoning_control)) {
297297
rbudget = common_reasoning_budget_init(
298298
vocab,
299299
params.reasoning_budget_start,

tools/server/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1244,6 +1244,8 @@ The `response_format` parameter supports both plain JSON output (e.g. `{"type":
12441244

12451245
`reasoning_format`: The reasoning format to be parsed. If set to `none`, it will output the raw generated text.
12461246

1247+
`reasoning_control`: Arms realtime reasoning control for this completion so it can be ended early via `/v1/chat/completions/control`. Defaults to `false`.
1248+
12471249
`generation_prompt`: The generation prompt that was prefilled in by the template. Prepended to model output before parsing.
12481250

12491251
`parse_tool_calls`: Whether to parse the generated tool call.
@@ -1350,6 +1352,22 @@ The server supports parsing and returning reasoning via the `reasoning_content`
13501352

13511353
Reasoning input (preserve reasoning in history) is also supported by some specific templates. For more details, please refer to [PR#18994](https://github.com/ggml-org/llama.cpp/pull/18994).
13521354

1355+
### POST `/v1/chat/completions/control`: Control a running chat completion in real time
1356+
1357+
Acts on an in-flight completion identified by its `id` (the `id` field streamed back by `/v1/chat/completions`). The request is processed in parallel with the SSE stream, so the client sends it while still reading tokens.
1358+
1359+
*Options:*
1360+
1361+
`id`: (Required) The chat completion id to act on. A completion that has already finished matches nothing and the call is a no-op.
1362+
1363+
`action`: (Required) The control action to perform. Currently the only supported value is `reasoning_end`, which forces the end of the current reasoning block so the model moves on to the final answer. Requires `reasoning_control: true` on the original completion request.
1364+
1365+
`model`: (Required in router mode) The model name, used to route the request to the right instance. Ignored in single model mode.
1366+
1367+
**Response format**
1368+
1369+
Returns a JSON object with a boolean `success` field, and an optional `message` field describing the reason when `success` is `false`.
1370+
13531371
### POST `/v1/responses`: OpenAI-compatible Responses API
13541372

13551373
*Options:*

tools/server/server-common.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1132,6 +1132,7 @@ json oaicompat_chat_params_parse(
11321132
llama_params["reasoning_budget_start_tag"] = chat_params.thinking_start_tag;
11331133
llama_params["reasoning_budget_end_tag"] = chat_params.thinking_end_tag;
11341134
llama_params["reasoning_budget_message"] = opt.reasoning_budget_message;
1135+
llama_params["reasoning_control"] = json_value(body, "reasoning_control", false);
11351136
}
11361137
}
11371138

tools/server/server-context.cpp

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1263,6 +1263,20 @@ struct server_context_impl {
12631263
return nullptr;
12641264
}
12651265

1266+
server_slot * get_slot_by_cmpl_id(const std::string & cmpl_id) {
1267+
if (cmpl_id.empty()) {
1268+
return nullptr;
1269+
}
1270+
1271+
for (server_slot & slot : slots) {
1272+
if (slot.is_processing() && slot.task && slot.task->params.oaicompat_cmpl_id == cmpl_id) {
1273+
return &slot;
1274+
}
1275+
}
1276+
1277+
return nullptr;
1278+
}
1279+
12661280
server_slot * get_available_slot(const server_task & task) {
12671281
server_slot * ret = nullptr;
12681282

@@ -2114,6 +2128,37 @@ struct server_context_impl {
21142128
}
21152129
}
21162130
} break;
2131+
case SERVER_TASK_TYPE_CONTROL:
2132+
{
2133+
auto res = std::make_unique<server_task_result_control>();
2134+
res->id = task.id;
2135+
2136+
server_slot * slot = get_slot_by_cmpl_id(task.params.control_cmpl_id);
2137+
if (slot == nullptr) {
2138+
res->success = false;
2139+
res->message = "no active completion for this id";
2140+
queue_results.send(std::move(res));
2141+
break;
2142+
}
2143+
2144+
if (task.params.control_action == "reasoning_end") {
2145+
// the budget sampler only exists when reasoning control was armed
2146+
if (!slot->task->params.sampling.reasoning_control) {
2147+
res->success = false;
2148+
res->message = "reasoning control not enabled for this completion";
2149+
queue_results.send(std::move(res));
2150+
break;
2151+
}
2152+
// act on the live slot mid generation, never defer
2153+
common_sampler_reasoning_budget_force(slot->smpl.get());
2154+
res->success = true;
2155+
} else {
2156+
res->success = false;
2157+
res->message = "unknown control action";
2158+
}
2159+
2160+
queue_results.send(std::move(res));
2161+
} break;
21172162
case SERVER_TASK_TYPE_NEXT_RESPONSE:
21182163
{
21192164
// do nothing
@@ -4266,6 +4311,43 @@ void server_routes::init_routes() {
42664311
TASK_RESPONSE_TYPE_OAI_CHAT);
42674312
};
42684313

4314+
this->post_control = [this](const server_http_req & req) {
4315+
auto res = create_response();
4316+
const json body = json::parse(req.body);
4317+
4318+
const std::string cmpl_id = json_value(body, "id", std::string());
4319+
const std::string action = json_value(body, "action", std::string());
4320+
if (cmpl_id.empty()) {
4321+
res->error(format_error_response("missing completion id", ERROR_TYPE_INVALID_REQUEST));
4322+
return res;
4323+
}
4324+
if (action != "reasoning_end") {
4325+
res->error(format_error_response("unknown control action", ERROR_TYPE_INVALID_REQUEST));
4326+
return res;
4327+
}
4328+
4329+
auto & rd = res->rd;
4330+
{
4331+
server_task task(SERVER_TASK_TYPE_CONTROL);
4332+
task.id = rd.get_new_id();
4333+
task.params.control_cmpl_id = cmpl_id;
4334+
task.params.control_action = action;
4335+
rd.post_task(std::move(task));
4336+
}
4337+
4338+
auto result = rd.next(req.should_stop);
4339+
if (!result) {
4340+
GGML_ASSERT(req.should_stop());
4341+
return res;
4342+
}
4343+
if (result->is_error()) {
4344+
res->error(result->to_json());
4345+
return res;
4346+
}
4347+
res->ok(result->to_json());
4348+
return res;
4349+
};
4350+
42694351
this->post_responses_oai = [this](const server_http_req & req) {
42704352
auto res = create_response();
42714353
std::vector<raw_buffer> files;

tools/server/server-context.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ struct server_routes {
110110
server_http_context::handler_t post_completions;
111111
server_http_context::handler_t post_completions_oai;
112112
server_http_context::handler_t post_chat_completions;
113+
server_http_context::handler_t post_control;
113114
server_http_context::handler_t post_responses_oai;
114115
server_http_context::handler_t post_transcriptions_oai;
115116
server_http_context::handler_t post_anthropic_messages;

tools/server/server-task.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -499,6 +499,7 @@ task_params server_task::params_from_json_cmpl(
499499
const auto end_tag = json_value(data, "reasoning_budget_end_tag", std::string());
500500
const auto message = json_value(data, "reasoning_budget_message", std::string());
501501
params.sampling.reasoning_budget_tokens = budget;
502+
params.sampling.reasoning_control = json_value(data, "reasoning_control", false);
502503

503504
if (!start_tag.empty()) {
504505
params.sampling.reasoning_budget_start = common_tokenize(vocab, start_tag, false, true);

tools/server/server-task.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ enum server_task_type {
1919
SERVER_TASK_TYPE_RERANK,
2020
SERVER_TASK_TYPE_INFILL,
2121
SERVER_TASK_TYPE_CANCEL,
22+
SERVER_TASK_TYPE_CONTROL,
2223
SERVER_TASK_TYPE_NEXT_RESPONSE,
2324
SERVER_TASK_TYPE_METRICS,
2425
SERVER_TASK_TYPE_SLOT_SAVE,
@@ -84,6 +85,10 @@ struct task_params {
8485
std::string oaicompat_model;
8586
std::string oaicompat_cmpl_id;
8687

88+
// realtime control (SERVER_TASK_TYPE_CONTROL)
89+
std::string control_action;
90+
std::string control_cmpl_id;
91+
8792
// per-request parameters for chat parsing
8893
common_chat_parser_params chat_parser_params;
8994

@@ -551,6 +556,19 @@ struct server_task_result_slot_erase : server_task_result {
551556
virtual json to_json() override;
552557
};
553558

559+
struct server_task_result_control : server_task_result {
560+
bool success = false;
561+
std::string message; // optional detail when success is false
562+
563+
virtual json to_json() override {
564+
json out = json { { "success", success } };
565+
if (!message.empty()) {
566+
out["message"] = message;
567+
}
568+
return out;
569+
}
570+
};
571+
554572
struct server_task_result_get_lora : server_task_result {
555573
struct lora {
556574
common_adapter_lora_info info;

tools/server/server.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,7 @@ int llama_server(int argc, char ** argv) {
149149
routes.post_completions = models_routes->proxy_post;
150150
routes.post_completions_oai = models_routes->proxy_post;
151151
routes.post_chat_completions = models_routes->proxy_post;
152+
routes.post_control = models_routes->proxy_post;
152153
routes.post_responses_oai = models_routes->proxy_post;
153154
routes.post_transcriptions_oai = models_routes->proxy_post;
154155
routes.post_anthropic_messages = models_routes->proxy_post;
@@ -185,6 +186,7 @@ int llama_server(int argc, char ** argv) {
185186
ctx_http.post("/v1/completions", ex_wrapper(routes.post_completions_oai));
186187
ctx_http.post("/chat/completions", ex_wrapper(routes.post_chat_completions));
187188
ctx_http.post("/v1/chat/completions", ex_wrapper(routes.post_chat_completions));
189+
ctx_http.post("/v1/chat/completions/control", ex_wrapper(routes.post_control));
188190
ctx_http.post("/v1/responses", ex_wrapper(routes.post_responses_oai));
189191
ctx_http.post("/responses", ex_wrapper(routes.post_responses_oai));
190192
ctx_http.post("/v1/audio/transcriptions", ex_wrapper(routes.post_transcriptions_oai));

tools/ui/src/lib/components/app/chat/ChatForm/ChatForm.svelte

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -541,6 +541,7 @@
541541
canSend={canSubmit}
542542
{disabled}
543543
{isLoading}
544+
isReasoning={chatStore.isReasoning}
544545
{isRecording}
545546
{showAddButton}
546547
{showModelSelector}

0 commit comments

Comments
 (0)