You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(reasoning): honor per-request reasoning_effort on chat completions (#10082)
The OpenAI `reasoning_effort` field only reached the prompt template; it
never toggled the backend's thinking. Map it onto
ReasoningConfig.DisableReasoning (which becomes the enable_thinking gRPC
metadata) in the request merge, so reasoning_effort="none" disables
reasoning per request: the use case from #10072 (run a single Qwen3-style
model and turn reasoning off for low-latency tasks while keeping it on
for others).
Effort levels (minimal/low/medium/high) enable thinking unless the model
config explicitly disabled it (reasoning.disable: true wins and is never
re-enabled by a request); "none" always disables.
Closes#10072
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Copy file name to clipboardExpand all lines: docs/content/advanced/model-configuration.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -412,7 +412,10 @@ These load-time options control how the backend parses `<think>` reasoning block
412
412
| `prefill_assistant` | bool | `true` | When `false`, the trailing assistant message is not pre-filled by the chat template. |
413
413
414
414
{{% notice note %}}
415
-
This is the load-time reasoning configuration. The orthogonal per-request `enable_thinking` chat-template kwarg (set via the YAML `reasoning.disable` field) toggles thinking on/off per call without restarting the model.
415
+
This is the load-time reasoning configuration. The orthogonal per-request `enable_thinking` chat-template kwarg toggles thinking on/off per call without restarting the model. It can be driven either by the YAML `reasoning.disable` field (model default) or per request via the OpenAI `reasoning_effort` field on `/v1/chat/completions`:
416
+
417
+
- `reasoning_effort: "none"` disables thinking for that request (`enable_thinking=false`) - useful to run a single reasoning model like Qwen3 for low-latency tasks while still enabling reasoning on other requests.
418
+
- `reasoning_effort: "minimal"| "low" | "medium" | "high"` enables thinking, unless the model config explicitly set `reasoning.disable: true` (an operator's explicit disable wins and is never re-enabled by a request).
0 commit comments