Skip to content

Commit a9ed796

Browse files
berkidemmengqin
authored andcommitted
common: skip reasoning budget sampler when no budget is requested (ggml-org#21870)
* common: skip reasoning budget sampler when no budget is requested After I added thinking_start_tag / thinking_end_tag for gemma4 in ggml-org#21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state). More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in ggml-org#21784 (98 t/s to 70 t/s on Vulkan). So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before. * common: preserve rbudget when grammar is lazy Following up on the review feedback on ggml-org#21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from ggml-org#20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
1 parent 79aa8e6 commit a9ed796

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

common/sampling.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -287,8 +287,8 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
287287
}
288288
}
289289

290-
// reasoning budget sampler
291-
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty()) {
290+
// reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression)
291+
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0)) {
292292
rbudget = common_reasoning_budget_init(
293293
vocab,
294294
params.reasoning_budget_start,

0 commit comments

Comments
 (0)