Skip to content

Extended thinking silently dropped on tool-continuation turns (thinking_blocks not threaded through history) #87

@triscacezar-droid

Description

@triscacezar-droid

Symptom

ml-intern requests extended thinking via thinking={"type": "adaptive"} (see agent/core/llm_params.py:153), but the assistant messages it stores in history never include thinking_blocks or reasoning_content. On every tool-continuation turn, LiteLLM logs:

LiteLLM:WARNING: Dropping 'thinking' param because the last assistant message
with tool_calls has no thinking_blocks. The model won't use extended thinking
for this turn.

This silently disables extended thinking for ~every turn after the first tool call — effectively turning Opus into non-thinking-mode Opus for the bulk of an agent's work, which is a meaningful reasoning-quality degradation for what ml-intern is used for.

Reproduction

Any agent run that goes through a tool call:

ml-intern "run bash 'date +%Y'. then tell me the year"

…produces the warning on turn 2. With a thinking-worthy prompt, the model does produce thinking on turn 1, but it's discarded before turn 2, so the warning still fires.

Root cause

In agent/core/agent_loop.py:

  1. _call_llm_streaming and _call_llm_non_streaming don't capture thinking_blocks / reasoning_content from responses.
  2. LLMResult has no fields for them.
  3. The three Message(role="assistant", ...) construction sites in the loop drop the thinking state entirely.

So the next acompletion call sees a tool-call-bearing assistant message with no thinking_blocks, and LiteLLM strips the thinking param to avoid an API error.

Proposed fix

~30 lines, no behavior change for non-thinking models:

  • Add thinking_blocks + reasoning_content fields to LLMResult.
  • Streaming path: collect raw chunks during iteration; after the stream finishes, call litellm.stream_chunk_builder(chunks) to get a reassembled ModelResponse, and pull message.thinking_blocks / message.reasoning_content off it. Best-effort — wrap in try/except so unfamiliar providers just fall back to no thinking reassembly.
  • Non-streaming path: read them directly off response.choices[0].message.
  • Attach both to every Message(role="assistant", ...) in the loop (truncation-hint site, no-tool-calls site, with-tool-calls site).

Verified locally: stream_chunk_builder handles Anthropic adaptive-thinking deltas correctly (1 thinking block + full reasoning_content reassembled from the streamed chunks); the warning disappears on turns where the model actually produced thinking; trivial prompts where adaptive thinking legitimately skips still show the warning because there's genuinely nothing to attach — which is semantically correct.

PR to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions