Skip to content

feat(Inference): disable extended-thinking budget for fast/snap-judgment calls#1348

Open
caraka wants to merge 1 commit into
danielmiessler:mainfrom
caraka:feat/inference-disable-thinking-for-fast-calls
Open

feat(Inference): disable extended-thinking budget for fast/snap-judgment calls#1348
caraka wants to merge 1 commit into
danielmiessler:mainfrom
caraka:feat/inference-disable-thinking-for-fast-calls

Conversation

@caraka

@caraka caraka commented Jun 13, 2026

Copy link
Copy Markdown

Problem

Every claude --print subprocess spawned by Inference.ts silently spends an extended-thinking budget. On snap-judgment payloads (classification, tab titles, session naming) this adds ~10s+ of pure latency with negligible quality benefit — and the thinking budget, not the model tier, dominates the cost.

Measured on a classification-shaped payload:

call latency
claude --print --model haiku (default) ~14.9s
same call, MAX_THINKING_TOKENS=0 ~3.2s
--model sonnet, MAX_THINKING_TOKENS=0 ~3.6s

Downgrading the model barely helps; disabling the thinking budget is what collapses the latency.

Change

  • Add a thinking?: boolean option to inference() and a --no-thinking CLI flag. When thinking: false (or --no-thinking), the call sets MAX_THINKING_TOKENS=0 in the subprocess env, disabling the extended-thinking budget.
  • level: 'fast' implies it — fast is for snap tasks (quick generation, basic classification) by definition, so it no longer pays the thinking tax. standard and smart are unchanged unless they opt out explicitly.

Single file, +22/−2, no new dependencies. Parses/transpiles clean; CLI usage string updated.

Reproduction

time claude --print --model haiku --tools '' --output-format text \
  --system-prompt 'Classify the message as A or B. Output one letter.' <<< 'hello'

time MAX_THINKING_TOKENS=0 claude --print --model haiku --tools '' --output-format text \
  --system-prompt 'Classify the message as A or B. Output one letter.' <<< 'hello'

Note

The one behavior change is the fast-level default (fast now disables thinking). Happy to make that opt-in instead if you'd prefer zero default change — it's a one-line tweak.

…ent calls

Every `claude --print` subprocess silently spends an extended-thinking budget, adding ~10s+ of latency to snap-judgment calls regardless of model tier (measured ~14.9s vs ~3.2s on a classification payload). Add a thinking?: boolean option to inference() and a --no-thinking CLI flag that set MAX_THINKING_TOKENS=0; level: 'fast' implies it. Other levels unchanged unless they opt out.
@caraka caraka marked this pull request as draft June 13, 2026 17:08
@caraka caraka marked this pull request as ready for review June 13, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant