Skip to content

Improve <think>/<thinking> tag identification in streaming responses#40

Open
mludvig wants to merge 6 commits into
danny-avila:mainfrom
mludvig:feat/nova-thinking
Open

Improve <think>/<thinking> tag identification in streaming responses#40
mludvig wants to merge 6 commits into
danny-avila:mainfrom
mludvig:feat/nova-thinking

Conversation

@mludvig
Copy link
Copy Markdown

@mludvig mludvig commented Dec 10, 2025

Some models like Amazon Nova send their thoughts wrapped in .. or .. tags. It is not guaranteed that these strings arrive as single streaming events, in fact more often than not they don't and we may receive '<', 'thinking', '>' tokens separately.

This patch introduces a state machine that deals with such a fragmented stream, identifies the tags and properly emits the thoughts content as ContentTypes.THINK for a correct frontend rendering. Fixes the output from at least Amazon Nova and quite likely from other models that emit these tags too.

Copilot AI review requested due to automatic review settings December 10, 2025 01:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the previous batch-processing approach to detecting <think> and <thinking> tags with a streaming state machine that handles tags split across multiple chunks. This is essential for models like Amazon Nova that emit these tags in fragments rather than complete units.

Key changes:

  • Implements a 4-state machine ('normal', 'buffering_open', 'thinking', 'buffering_close') to detect and strip thinking tags from streaming content
  • Removes the old parseThinkingContent function that required complete strings
  • Adds state tracking fields (thinkingState, tagBuffer) to AgentContext for maintaining parser state across chunks

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/stream.ts Removed old batch parseThinkingContent function and replaced conditional logic with streaming state machine that processes content character-by-character while buffering potential tag sequences
src/specs/fragmented-thinking.test.ts Added new test file to verify the state machine correctly handles thinking tags split by whitespace boundaries using the fake streaming model
src/agents/AgentContext.ts Added thinkingState and tagBuffer fields to maintain state machine context across streaming chunks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/specs/fragmented-thinking.test.ts
Comment thread src/stream.ts
Comment thread src/specs/fragmented-thinking.test.ts Outdated
Comment thread src/specs/fragmented-thinking.test.ts Outdated
Comment thread src/agents/AgentContext.ts Outdated
@mludvig
Copy link
Copy Markdown
Author

mludvig commented Dec 10, 2025

@danny-avila copilot comments resolved

@mludvig
Copy link
Copy Markdown
Author

mludvig commented Jan 27, 2026

Hi @danny-avila
Any news on this one? Would like to have Amazon Nova thinking properly supported.
Thanks!

Add state machine to detect <thinking> and <think> tags that arrive
split across multiple streaming chunks. This fixes Amazon Nova and similar
models that send tokens like '<thinking', '>', content, '</', 'thinking',
'>' as separate stream events.

The state machine buffers content starting with '<' and waits for
enough characters to determine if it's a thinking tag before routing
to TEXT or THINK content types.
Verifies that <thinking> tags are correctly detected and stripped from
streamed content, with thinking content routed to THINK type and regular
text routed to TEXT type.
The state machine now strips <think>/<thinking> tags from content,
so update the test to expect capture group 1 (content inside tags)
instead of capture group 0 (entire match including tags).
Ensures partial tags aren't lost when stream ends mid-buffer
@mludvig mludvig force-pushed the feat/nova-thinking branch from 0a493ef to 6ec4814 Compare January 27, 2026 05:01
danny-avila added a commit that referenced this pull request May 5, 2026
Audit verified: all 5 valid (real bypass shapes / API contract
violations / OOM risks).

P1 #37 — destructive path normalization. Patterns like \`rm -rf
\$HOME/\`, \`rm -rf ~/\`, \`rm -rf "\$HOME/"\` slipped past the bare
+ quoted destructive guards because the trailing slash broke the
end-anchor / quote-pair shapes. Extracted a shared
\`DESTRUCTIVE_TARGET\` (\`(?:\\/|~|\\\$\\{?HOME\\}?|\\.)\\/?\`) used
by both pattern lists so spelling equivalences are kept consistent.
9 tests pinned (all the spelling variants + a benign no-regression).

P2 #38 — \`resolveWorkspacePathSafe\` used host \`fs/promises.realpath\`
instead of the configured \`WorkspaceFS.realpath\`. On a custom or
remote engine the host realpath would fail and silently fall back
to lexical containment, leaving the symlink-escape clamp
ineffective. \`realpathOrSelf\` and \`realpathOfPathOrAncestor\`
now take the realpath impl as a parameter; \`resolveWorkspacePathSafe\`
threads \`getWorkspaceFS(config).realpath\` through both.

P2 #39 — direct-path \`additionalContexts\` were silently swallowed.
Hosts that returned \`additionalContext\` from PreToolUse /
PostToolUse / PostToolUseFailure for direct tools (which is every
local-engine tool) had their context discarded — broken hook API
contract. Added \`RunToolBatchContext.additionalContextsSink\`;
\`runDirectToolWithLifecycleHooks\` pushes hook contexts into it;
\`run()\` materializes the accumulated strings as a single
\`HumanMessage\` appended to outputs, matching the event-driven
path's \`injected[]\` shape. PostToolUseFailure was also changed
from fire-and-forget to await so its contexts are captured (the
hook is still observational w.r.t. the tool result).

P2 #40 — syntax-check probe cache was keyed only on spawn backend.
Same shape as P1 #34 (rg cache). Now nested
\`WeakMap<spawn, Map<envHash, ProbeCache>>\`. Stable JSON over
sorted env entries.

P2 #41 — fallback grep read each candidate file fully via
\`readFile\` then \`split('\\n')\`. The wall-clock budget only
checked between files, so a single multi-GB log could OOM the
process even with the regex DoS guards in place. Added a per-file
\`FALLBACK_GREP_MAX_FILE_BYTES = 5 MiB\` cap (stat first, skip with
sentinel if oversize) plus a deadline re-check after each read.
Hosts needing larger files should install ripgrep.

830 tests passing across all suites (was 817), lint baseline unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants