Skip to content

Commit 4a2cc64

Browse files
localai-botmudler
andauthored
feat(reasoning): honor per-request reasoning_effort on chat completions (#10082)
The OpenAI `reasoning_effort` field only reached the prompt template; it never toggled the backend's thinking. Map it onto ReasoningConfig.DisableReasoning (which becomes the enable_thinking gRPC metadata) in the request merge, so reasoning_effort="none" disables reasoning per request: the use case from #10072 (run a single Qwen3-style model and turn reasoning off for low-latency tasks while keeping it on for others). Effort levels (minimal/low/medium/high) enable thinking unless the model config explicitly disabled it (reasoning.disable: true wins and is never re-enabled by a request); "none" always disables. Closes #10072 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
1 parent 4647770 commit 4a2cc64

4 files changed

Lines changed: 191 additions & 1 deletion

File tree

core/backend/options_internal_test.go

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"encoding/json"
55

66
"github.com/mudler/LocalAI/core/config"
7+
"github.com/mudler/LocalAI/pkg/reasoning"
78

89
. "github.com/onsi/ginkgo/v2"
910
. "github.com/onsi/gomega"
@@ -42,3 +43,35 @@ var _ = Describe("grpcModelOpts EngineArgs", func() {
4243
Expect(opts.EngineArgs).To(BeEmpty())
4344
})
4445
})
46+
47+
// Guards the DisableReasoning -> enable_thinking metadata conversion that the
48+
// per-request reasoning_effort feature (issue #10072) relies on: the request
49+
// merge sets ReasoningConfig.DisableReasoning, and gRPCPredictOpts is where it
50+
// becomes the gRPC PredictOptions.Metadata the backend reads.
51+
var _ = Describe("gRPCPredictOpts enable_thinking metadata", func() {
52+
// withReasoning builds a fully-defaulted config (gRPCPredictOpts dereferences
53+
// many pointer fields) and overrides only the reasoning toggle.
54+
withReasoning := func(disable *bool) config.ModelConfig {
55+
cfg := config.ModelConfig{}
56+
cfg.SetDefaults()
57+
cfg.ReasoningConfig = reasoning.Config{DisableReasoning: disable}
58+
return cfg
59+
}
60+
disabled := true
61+
enabled := false
62+
63+
It("emits enable_thinking=false when reasoning is disabled", func() {
64+
opts := gRPCPredictOpts(withReasoning(&disabled), "/tmp/models")
65+
Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "false"))
66+
})
67+
68+
It("emits enable_thinking=true when reasoning is enabled", func() {
69+
opts := gRPCPredictOpts(withReasoning(&enabled), "/tmp/models")
70+
Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "true"))
71+
})
72+
73+
It("omits enable_thinking when reasoning is unset", func() {
74+
opts := gRPCPredictOpts(withReasoning(nil), "/tmp/models")
75+
Expect(opts.Metadata).ToNot(HaveKey("enable_thinking"))
76+
})
77+
})

core/http/middleware/request.go

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,26 @@ func mergeOpenAIRequestAndModelConfig(config *config.ModelConfig, input *schema.
310310
config.Temperature = input.Temperature
311311
}
312312

313+
// Map the per-request reasoning_effort onto the reasoning toggle the
314+
// backend reads (enable_thinking metadata, set in gRPCPredictOpts).
315+
// "none" disables thinking for this request - the use case from #10072,
316+
// running a single Qwen3-style model and turning reasoning off per
317+
// request. Any explicit effort level enables thinking, UNLESS the model
318+
// config explicitly disabled it (DisableReasoning==true wins): an
319+
// operator who deliberately turned reasoning off should not be overridden
320+
// by a request. A value of "none" always disables, since that never
321+
// conflicts with a config that also disables.
322+
switch strings.ToLower(input.ReasoningEffort) {
323+
case "none":
324+
disable := true
325+
config.ReasoningConfig.DisableReasoning = &disable
326+
case "minimal", "low", "medium", "high":
327+
if config.ReasoningConfig.DisableReasoning == nil || !*config.ReasoningConfig.DisableReasoning {
328+
enable := false
329+
config.ReasoningConfig.DisableReasoning = &enable
330+
}
331+
}
332+
313333
// Collapse the modern max_completion_tokens alias into the
314334
// legacy Maxtokens field so downstream code reads exactly one.
315335
// MaxCompletionTokens wins on conflict — it's the canonical

core/http/middleware/request_test.go

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -597,3 +597,137 @@ var _ = Describe("SetModelAndConfig tool_choice parsing (chat completions)", fun
597597
})
598598
})
599599
})
600+
601+
// These tests cover the per-request reasoning_effort -> enable_thinking mapping.
602+
// The merge lives in mergeOpenAIRequestAndModelConfig (called from
603+
// SetOpenAIRequest), so they drive the full middleware chain like the
604+
// production /v1/chat/completions route does. The block builds its own app per
605+
// test so the model config can be varied (some cases need reasoning.disable set
606+
// in the model YAML to assert that an explicit config disable wins).
607+
//
608+
// Mapping under test (issue #10072):
609+
// - reasoning_effort=none -> DisableReasoning=true
610+
// - reasoning_effort=low/medium/high -> DisableReasoning=false, UNLESS the
611+
// model config explicitly set true
612+
// - empty / unrecognized -> no change
613+
var _ = Describe("SetModelAndConfig reasoning_effort parsing (chat completions)", func() {
614+
var modelDir string
615+
616+
BeforeEach(func() {
617+
var err error
618+
modelDir, err = os.MkdirTemp("", "localai-test-models-*")
619+
Expect(err).ToNot(HaveOccurred())
620+
})
621+
622+
AfterEach(func() {
623+
_ = os.RemoveAll(modelDir)
624+
})
625+
626+
// buildApp writes a model config with the given YAML body and returns an app
627+
// plus a pointer to the captured per-request config.
628+
buildApp := func(cfgYAML string) (*echo.Echo, **config.ModelConfig) {
629+
Expect(os.WriteFile(filepath.Join(modelDir, "test-model.yaml"), []byte(cfgYAML), 0644)).To(Succeed())
630+
631+
ss := &system.SystemState{Model: system.Model{ModelsPath: modelDir}}
632+
appConfig := config.NewApplicationConfig()
633+
appConfig.SystemState = ss
634+
mcl := config.NewModelConfigLoader(modelDir)
635+
ml := model.NewModelLoader(ss)
636+
re := NewRequestExtractor(mcl, ml, appConfig)
637+
638+
captured := new(*config.ModelConfig)
639+
app := echo.New()
640+
app.POST("/v1/chat/completions",
641+
func(c echo.Context) error {
642+
if cfg, ok := c.Get(CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig); ok {
643+
*captured = cfg
644+
}
645+
return c.String(http.StatusOK, "ok")
646+
},
647+
re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }),
648+
func(next echo.HandlerFunc) echo.HandlerFunc {
649+
return func(c echo.Context) error {
650+
if err := re.SetOpenAIRequest(c); err != nil {
651+
return err
652+
}
653+
return next(c)
654+
}
655+
},
656+
)
657+
return app, captured
658+
}
659+
660+
chatReq := func(effort string) string {
661+
return `{"model":"test-model",` +
662+
`"messages":[{"role":"user","content":"hi"}],` +
663+
`"reasoning_effort":` + effort + `}`
664+
}
665+
666+
plainCfg := "name: test-model\nbackend: llama-cpp\n"
667+
668+
It("disables thinking for reasoning_effort=none", func() {
669+
app, captured := buildApp(plainCfg)
670+
rec := postJSON(app, "/v1/chat/completions", chatReq(`"none"`))
671+
672+
Expect(rec.Code).To(Equal(http.StatusOK))
673+
Expect(*captured).ToNot(BeNil())
674+
Expect((*captured).ReasoningConfig.DisableReasoning).ToNot(BeNil())
675+
Expect(*(*captured).ReasoningConfig.DisableReasoning).To(BeTrue())
676+
})
677+
678+
It("enables thinking for reasoning_effort=high when config is unset", func() {
679+
app, captured := buildApp(plainCfg)
680+
rec := postJSON(app, "/v1/chat/completions", chatReq(`"high"`))
681+
682+
Expect(rec.Code).To(Equal(http.StatusOK))
683+
Expect(*captured).ToNot(BeNil())
684+
Expect((*captured).ReasoningConfig.DisableReasoning).ToNot(BeNil())
685+
Expect(*(*captured).ReasoningConfig.DisableReasoning).To(BeFalse())
686+
})
687+
688+
It("enables thinking for reasoning_effort=high when config explicitly set false", func() {
689+
app, captured := buildApp(plainCfg + "reasoning:\n disable: false\n")
690+
rec := postJSON(app, "/v1/chat/completions", chatReq(`"high"`))
691+
692+
Expect(rec.Code).To(Equal(http.StatusOK))
693+
Expect(*captured).ToNot(BeNil())
694+
Expect((*captured).ReasoningConfig.DisableReasoning).ToNot(BeNil())
695+
Expect(*(*captured).ReasoningConfig.DisableReasoning).To(BeFalse())
696+
})
697+
698+
It("config wins: reasoning_effort=high cannot re-enable when config explicitly disabled", func() {
699+
app, captured := buildApp(plainCfg + "reasoning:\n disable: true\n")
700+
rec := postJSON(app, "/v1/chat/completions", chatReq(`"high"`))
701+
702+
Expect(rec.Code).To(Equal(http.StatusOK))
703+
Expect(*captured).ToNot(BeNil())
704+
Expect((*captured).ReasoningConfig.DisableReasoning).ToNot(BeNil())
705+
Expect(*(*captured).ReasoningConfig.DisableReasoning).To(BeTrue())
706+
})
707+
708+
It("is a no-op when reasoning_effort is empty", func() {
709+
app, captured := buildApp(plainCfg)
710+
rec := postJSON(app, "/v1/chat/completions",
711+
`{"model":"test-model","messages":[{"role":"user","content":"hi"}]}`)
712+
713+
Expect(rec.Code).To(Equal(http.StatusOK))
714+
Expect(*captured).ToNot(BeNil())
715+
Expect((*captured).ReasoningConfig.DisableReasoning).To(BeNil())
716+
})
717+
718+
It("is case-insensitive (None disables, HIGH enables)", func() {
719+
app, captured := buildApp(plainCfg)
720+
rec := postJSON(app, "/v1/chat/completions", chatReq(`"None"`))
721+
Expect(rec.Code).To(Equal(http.StatusOK))
722+
Expect(*captured).ToNot(BeNil())
723+
Expect((*captured).ReasoningConfig.DisableReasoning).ToNot(BeNil())
724+
Expect(*(*captured).ReasoningConfig.DisableReasoning).To(BeTrue())
725+
726+
app2, captured2 := buildApp(plainCfg)
727+
rec2 := postJSON(app2, "/v1/chat/completions", chatReq(`"HIGH"`))
728+
Expect(rec2.Code).To(Equal(http.StatusOK))
729+
Expect(*captured2).ToNot(BeNil())
730+
Expect((*captured2).ReasoningConfig.DisableReasoning).ToNot(BeNil())
731+
Expect(*(*captured2).ReasoningConfig.DisableReasoning).To(BeFalse())
732+
})
733+
})

docs/content/advanced/model-configuration.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -412,7 +412,10 @@ These load-time options control how the backend parses `<think>` reasoning block
412412
| `prefill_assistant` | bool | `true` | When `false`, the trailing assistant message is not pre-filled by the chat template. |
413413

414414
{{% notice note %}}
415-
This is the load-time reasoning configuration. The orthogonal per-request `enable_thinking` chat-template kwarg (set via the YAML `reasoning.disable` field) toggles thinking on/off per call without restarting the model.
415+
This is the load-time reasoning configuration. The orthogonal per-request `enable_thinking` chat-template kwarg toggles thinking on/off per call without restarting the model. It can be driven either by the YAML `reasoning.disable` field (model default) or per request via the OpenAI `reasoning_effort` field on `/v1/chat/completions`:
416+
417+
- `reasoning_effort: "none"` disables thinking for that request (`enable_thinking=false`) - useful to run a single reasoning model like Qwen3 for low-latency tasks while still enabling reasoning on other requests.
418+
- `reasoning_effort: "minimal" | "low" | "medium" | "high"` enables thinking, unless the model config explicitly set `reasoning.disable: true` (an operator's explicit disable wins and is never re-enabled by a request).
416419
{{% /notice %}}
417420

418421
### Multimodal Backend Options

0 commit comments

Comments
 (0)