common: skip reasoning budget sampler when no budget is requested (ggml-org#21870)

berkidem · mengqin · commit a9ed7967d84c · 2026-04-20T05:30:04.000-07:00
* common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in ggml-org#21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in ggml-org#21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on ggml-org#21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from ggml-org#20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
diff --git a/common/sampling.cpp b/common/sampling.cpp
@@ -287,8 +287,8 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
         }
     }
 
-    // reasoning budget sampler
-    if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty()) {
+    // reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression)
+    if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0)) {
         rbudget = common_reasoning_budget_init(
             vocab,
             params.reasoning_budget_start,

Original file line number	Diff line number	Diff line change
`@@ -287,8 +287,8 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st`
`287`	`287`	`}`
`288`	`288`	`}`
`289`	`289`
`290`		`- // reasoning budget sampler`
`291`		`- if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty()) {`
	`290`	`+ // reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression)`
	`291`	`+ if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy \|\| params.reasoning_budget_tokens >= 0)) {`
`292`	`292`	`rbudget = common_reasoning_budget_init(`
`293`	`293`	`vocab,`
`294`	`294`	`params.reasoning_budget_start,`