feat(Inference): disable extended-thinking budget for fast/snap-judgment calls#1348
Open
caraka wants to merge 1 commit into
Open
feat(Inference): disable extended-thinking budget for fast/snap-judgment calls#1348caraka wants to merge 1 commit into
caraka wants to merge 1 commit into
Conversation
…ent calls Every `claude --print` subprocess silently spends an extended-thinking budget, adding ~10s+ of latency to snap-judgment calls regardless of model tier (measured ~14.9s vs ~3.2s on a classification payload). Add a thinking?: boolean option to inference() and a --no-thinking CLI flag that set MAX_THINKING_TOKENS=0; level: 'fast' implies it. Other levels unchanged unless they opt out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Every
claude --printsubprocess spawned byInference.tssilently spends an extended-thinking budget. On snap-judgment payloads (classification, tab titles, session naming) this adds ~10s+ of pure latency with negligible quality benefit — and the thinking budget, not the model tier, dominates the cost.Measured on a classification-shaped payload:
claude --print --model haiku(default)MAX_THINKING_TOKENS=0--model sonnet,MAX_THINKING_TOKENS=0Downgrading the model barely helps; disabling the thinking budget is what collapses the latency.
Change
thinking?: booleanoption toinference()and a--no-thinkingCLI flag. Whenthinking: false(or--no-thinking), the call setsMAX_THINKING_TOKENS=0in the subprocess env, disabling the extended-thinking budget.level: 'fast'implies it —fastis for snap tasks (quick generation, basic classification) by definition, so it no longer pays the thinking tax.standardandsmartare unchanged unless they opt out explicitly.Single file, +22/−2, no new dependencies. Parses/transpiles clean; CLI usage string updated.
Reproduction
Note
The one behavior change is the
fast-level default (fast now disables thinking). Happy to make that opt-in instead if you'd prefer zero default change — it's a one-line tweak.