[Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068)#1
Closed
marcostephan wants to merge 3 commits into
Closed
Conversation
6ac2610 to
ef9d764
Compare
… in streaming When stop sequences set output_text_buffer_length > 0, token IDs arrive in delta_token_ids before their text is flushed into delta_text. Without a guard, find() returns -1 and the reasoning/content split is silently corrupted. Add text-presence checks before both find() calls in extract_reasoning_streaming: - </think> end token path (line 215) - <|tool_calls_section_begin|> section start path (line 223) Return None (wait for flush) when the token ID is present but the text is not, matching the fix pattern from PR vllm-project#39044 (BaseThinkingReasoningParser / DeepSeekR1) and PR vllm-project#40352 (Step3ReasoningParser). Fixes vllm-project#41067 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Keyi Li <likey6688@gmail.com>
ef9d764 to
afaade6
Compare
Stacks on top of vllm-project#41068. PR vllm-project#41068 fixes the parser-side off-by-one when the </think> token id arrives ahead of its rendered text in delta_text. A second leak survives downstream: after the parser returns None on the boundary chunk, DelegatingParser.parse_delta sets state.reasoning_ended = True and routes the next chunks through the tool branch — which has no awareness of the still-buffered end-token text and emits it verbatim as delta.content over the next 1..N chunks. Empirically: 8/8 streamed responses leaked the literal '</think>' into delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser change applied alone. Per-delta capture from a real request: delta[N]: reasoning='T' (last reasoning delta; parser returns None) delta[N+1]: content=' ' delta[N+2]: content='</thin' <- buffered </think> flushing delta[N+3]: content='k> ' delta[N+4]: content='I will...' Add four StreamState fields (end_token_skip_buffer, end_token_skip_armed, end_token_skip_done, end_token_skip_literal) plus a small post-boundary skip block in DelegatingParser.parse_delta and a helper _pending_end_token_literal that returns whichever terminator's text is still buffered. On boundary handoff: * The helper checks both </think> and (for parsers that expose it) <|tool_calls_section_begin|>. It returns the literal whose token id is in delta_token_ids but whose text has not yet flushed into delta_text, or None when the boundary text was fully visible. * If the helper returns a literal, the skip arms with that literal stored in state and the boundary chunk's delta_text seeded into the buffer (so a partial prefix of the literal that already flushed isn't lost). * If None, the skip stays disarmed (no buffered tail follows). While armed, subsequent chunks accumulate delta_text into the buffer. When the stored literal becomes visible, the buffer is split: * For </think>, the literal is consumed (downstream content begins after it). * For <|tool_calls_section_begin|>, the literal is preserved in the cleaned text — KimiK2ToolParser uses it to identify the section boundary, matching the immediate-text path in KimiK2ReasoningParser.extract_reasoning_streaming where 'content = delta_text[tool_index:]' keeps the literal. The tool-section branch is opt-in via getattr so it stays a no-op for parsers that don't expose those attributes (DeepSeek-R1, Step3p5, Ernie45, etc. all still benefit from the </think> half). Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py: * test_delegating_parser_strips_buffered_end_token — primary scenario exercised end-to-end through DelegatingParser.parse_delta. * test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk — when </think> id and text arrive together, the skip stays disarmed; otherwise it would over-truncate post-boundary content. * test_delegating_parser_boundary_at_end_of_delta_text — when </think> lands at the very end of delta_text with no content after, the skip must still stay disarmed (buffered tail won't appear). * test_delegating_parser_handles_buffered_tool_section_end — when reasoning ends implicitly via <|tool_calls_section_begin|> under buffering, the skip waits for the section literal (not </think>), preserves it in cleaned text, and routes the in-section payload to the tool parser instead of leaking it as content. * test_delegating_parser_recovers_partial_end_token_in_boundary_chunk — the boundary chunk's delta_text can end mid-literal (e.g. '</th' on the boundary chunk, 'ink>' on the next). The buffer must be seeded with the boundary chunk's delta_text so the partial prefix isn't lost. All five rely on a small _drive helper, a Chunk dataclass, and a wrapped_kimi_parser fixture that composes KimiK2ReasoningParser + KimiK2ToolParser via _WrappedParser. Verification: uv venv --python 3.12 && uv pip install … PYTHONPATH=. .venv/bin/python -m pytest \ --confcutdir=tests/reasoning -p no:cacheprovider \ tests/reasoning/test_kimi_k2_reasoning_parser.py -v # → 15 passed in 6.07s (10 pre-existing + 5 new from this commit) # Reverted vllm/parser/abstract_parser.py to PR vllm-project#41068's HEAD; reran # the new tests: # → 5 failed (each on the literal leaking into content or the # post-boundary content being suppressed) pre-commit run --files vllm/parser/abstract_parser.py \ tests/reasoning/test_kimi_k2_reasoning_parser.py # → ruff-check, ruff-format, typos, mypy, SPDX headers all pass Production rollout: Kimi-K2.5-INT4 P/D disagg deployment, vLLM v0.20.0 + PR vllm-project#41068's parser fix + this commit applied as a runtime overlay. 8/8 leaks pre-fix → 0/25 leaks post-fix (10 direct vLLM port-forward, 15 via service mesh) on the previously-100%-leaking prompt and stop config. Output integrity hand-checked: reasoning ends cleanly at the natural stop point, content begins immediately with the model's actual answer. Duplicate-work check (gh pr list --search '41067 in:body' --state open) returns only PR vllm-project#41068, which patches the parser-side guard. No open PR addresses the DelegatingParser-level leak this commit fixes. This work is AI-assisted. The change was authored end-to-end by Claude under human direction; the human submitter has reviewed every changed line and confirms the analysis, the diff, and the test results above. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Marco Stephan <marco@magic.dev>
de0b079 to
dd18796
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Stacks on top of vllm-project/vllm#41068. PR vllm-project#41068 fixes the parser-side off-by-one when the
</think>token id arrives ahead of its rendered text. A second leak survives downstream of that fix and this PR addresses it.After PR vllm-project#41068's parser returns
Noneon the boundary chunk,DelegatingParser.parse_deltasetsstate.reasoning_ended = Trueand routes the next chunks through the tool branch, which has no awareness of the still-buffered end-token text and emits it verbatim asdelta.contentover the next 1..N chunks. Empirically: 8/8 streamed responses leaked the literal</think>intodelta.contenton a real Kimi-K2.5 deployment with PR vllm-project#41068's parser change applied alone.Branch structure
git diff jasonkeyil/fix/kimi-k2-streaming-end-token-buffer→ +330 lines in 2 files (vllm/parser/abstract_parser.py,tests/reasoning/test_kimi_k2_reasoning_parser.py), zero overlap with vllm-project#41068's hunks.Reproduction (
@sha-292411e, with PR vllm-project#41068 applied alone)Full assembled response:
content=' </think> I will deconstruct the word "reasoning"...'— literal</think>clearly visible to the client.Fix
Four new
StreamStatefields (end_token_skip_buffer,end_token_skip_armed,end_token_skip_done,end_token_skip_literal) plus a small post-boundary skip block inDelegatingParser.parse_deltaand a helper_pending_end_token_literal.On boundary handoff:
</think>and (for parsers that expose it)<|tool_calls_section_begin|>. It returns the literal whose token id is indelta_token_idsbut whose text has not yet flushed intodelta_text, orNonewhen the boundary text was fully visible.delta_textseeded into the buffer (so a partial prefix of the literal that already flushed isn't lost).None, the skip stays disarmed (no buffered tail follows; the parser already split everything in this delta).While armed, subsequent chunks accumulate
delta_textinto the buffer. When the stored literal becomes visible, the buffer is split:</think>, the literal is consumed (downstream content begins after it).<|tool_calls_section_begin|>, the literal is preserved in the cleaned text —KimiK2ToolParseruses it to identify the section boundary, matching the immediate-text path inKimiK2ReasoningParser.extract_reasoning_streamingwhich keeps the literal viacontent = delta_text[tool_index:].The tool-section branch is opt-in via
getattrso it stays a no-op for parsers that don't expose those attributes (DeepSeek-R1, Step3p5, Ernie45, etc. all still benefit from the</think>half).Tests (
tests/reasoning/test_kimi_k2_reasoning_parser.py, 5 new tests)test_delegating_parser_strips_buffered_end_token</think>flush over multiple chunks doesn't leak as contenttest_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk</think>id and text arrive together (otherwise over-truncates)test_delegating_parser_boundary_at_end_of_delta_text</think>lands at end ofdelta_textwith no content aftertest_delegating_parser_handles_buffered_tool_section_end<|tool_calls_section_begin|>: skip waits for the section literal (not</think>), preserves it in cleaned text, routes in-section payload to tool parsertest_delegating_parser_recovers_partial_end_token_in_boundary_chunkdelta_textcan end mid-literal; buffer must be seeded with boundary delta_textAll five rely on a small
_drivehelper, aChunkdataclass, and awrapped_kimi_parserfixture that composesKimiK2ReasoningParser+KimiK2ToolParservia_WrappedParser.Test commands run + results
Per AGENTS.md:
Production rollout
Kimi-K2.5-INT4 P/D disagg, vLLM v0.20.0 + PR vllm-project#41068's parser fix + this commit, applied as a runtime overlay:
Output integrity hand-checked post-fix: reasoning ends cleanly at the natural stop point, content begins immediately with the model's actual answer (no leading
</think>, no missing chars, no gap deltas).Duplicate-work check
AI assistance disclosure
Per AGENTS.md: this work is AI-assisted. The change was authored end-to-end by Claude under human direction; the human submitter has reviewed every changed line and confirms the analysis, the diff, and the test results above.
What this PR does NOT change
extract_reasoning()on the full string.DelegatingParser.parse_deltatoken-id handling:extract_content_idsalready strips the end-token id fromcurrent_token_idson handoff. Only the text path needed fixing.reasoning_end_stris configured: skip stays disarmed (no-op).