feat(Inference): disable extended-thinking budget for fast/snap-judgment calls by caraka · Pull Request #1348 · danielmiessler/Personal_AI_Infrastructure

caraka · 2026-06-13T17:07:25Z

Problem

Every claude --print subprocess spawned by Inference.ts silently spends an extended-thinking budget. On snap-judgment payloads (classification, tab titles, session naming) this adds ~10s+ of pure latency with negligible quality benefit — and the thinking budget, not the model tier, dominates the cost.

Measured on a classification-shaped payload:

call	latency
`claude --print --model haiku` (default)	~14.9s
same call, `MAX_THINKING_TOKENS=0`	~3.2s
`--model sonnet`, `MAX_THINKING_TOKENS=0`	~3.6s

Downgrading the model barely helps; disabling the thinking budget is what collapses the latency.

Change

Add a thinking?: boolean option to inference() and a --no-thinking CLI flag. When thinking: false (or --no-thinking), the call sets MAX_THINKING_TOKENS=0 in the subprocess env, disabling the extended-thinking budget.
level: 'fast' implies it — fast is for snap tasks (quick generation, basic classification) by definition, so it no longer pays the thinking tax. standard and smart are unchanged unless they opt out explicitly.

Single file, +22/−2, no new dependencies. Parses/transpiles clean; CLI usage string updated.

Reproduction

time claude --print --model haiku --tools '' --output-format text \
  --system-prompt 'Classify the message as A or B. Output one letter.' <<< 'hello'

time MAX_THINKING_TOKENS=0 claude --print --model haiku --tools '' --output-format text \
  --system-prompt 'Classify the message as A or B. Output one letter.' <<< 'hello'

Note

The one behavior change is the fast-level default (fast now disables thinking). Happy to make that opt-in instead if you'd prefer zero default change — it's a one-line tweak.

…ent calls Every `claude --print` subprocess silently spends an extended-thinking budget, adding ~10s+ of latency to snap-judgment calls regardless of model tier (measured ~14.9s vs ~3.2s on a classification payload). Add a thinking?: boolean option to inference() and a --no-thinking CLI flag that set MAX_THINKING_TOKENS=0; level: 'fast' implies it. Other levels unchanged unless they opt out.

caraka marked this pull request as draft June 13, 2026 17:08

caraka marked this pull request as ready for review June 13, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(Inference): disable extended-thinking budget for fast/snap-judgment calls#1348

feat(Inference): disable extended-thinking budget for fast/snap-judgment calls#1348
caraka wants to merge 1 commit into
danielmiessler:mainfrom
caraka:feat/inference-disable-thinking-for-fast-calls

caraka commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

caraka commented Jun 13, 2026

Problem

Change

Reproduction

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant