Skip to content

[Bugfix] KimiK2ReasoningParser: guard against buffered end-token in streaming#41068

Merged
sfeng33 merged 3 commits into
vllm-project:mainfrom
JasonKeyiL:fix/kimi-k2-streaming-end-token-buffer
May 4, 2026
Merged

[Bugfix] KimiK2ReasoningParser: guard against buffered end-token in streaming#41068
sfeng33 merged 3 commits into
vllm-project:mainfrom
JasonKeyiL:fix/kimi-k2-streaming-end-token-buffer

Conversation

@JasonKeyiL
Copy link
Copy Markdown
Contributor

Summary

Fixes #41067

KimiK2ReasoningParser.extract_reasoning_streaming() calls delta_text.find(token) after checking token_id in delta_token_ids, but does not guard against find() returning -1.

When stop sequences are configured, output_text_buffer_length > 0 causes token IDs to arrive in delta_token_ids before the corresponding text is flushed into delta_text. This makes find() return -1 and silently corrupts the reasoning/content split:

# Without fix — </think> buffered, delta_text='extra'
end_index = delta_text.find("</think>")  # returns -1
reasoning = delta_text[:-1]             # 'extr'  ← drops last char
content   = delta_text[7:]              # ''

The fix adds a text-presence guard before each find() call and returns None (wait for next flush) when the token string is not yet visible — matching the pattern from PR #39044 (which fixed BaseThinkingReasoningParser, DeepSeekR1ReasoningParser, etc.) and PR #40352 (which fixed Step3ReasoningParser). KimiK2ReasoningParser was not included in either.

Two paths are fixed:

  • </think> end token (line 215)
  • <|tool_calls_section_begin|> section start token (line 223)

Why not a duplicate

Test plan

New unit tests that use a mock tokenizer (no GPU / model download needed):

.venv/bin/python -m pytest tests/reasoning/test_kimi_k2_reasoning_parser.py::test_streaming_end_token_id_buffered tests/reasoning/test_kimi_k2_reasoning_parser.py::test_streaming_tool_section_id_buffered -v

Both pass after the fix; both fail on main (returning a corrupted DeltaMessage instead of None).

AI assistance (Claude) was used to identify the bug and draft this fix. The submitter has reviewed every changed line and understands the fix end-to-end.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added the bug Something isn't working label Apr 28, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 28, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @JasonKeyiL.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 28, 2026
@JasonKeyiL JasonKeyiL force-pushed the fix/kimi-k2-streaming-end-token-buffer branch from a765e92 to ef3241e Compare April 28, 2026 00:05
@mergify mergify Bot removed the needs-rebase label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request resolves a bug in KimiK2ReasoningParser where token IDs arriving before their text caused incorrect parsing. The fix introduces checks to wait for the text and includes regression tests. In MiniMaxM2Parser, keyword arguments are now passed to the reasoning parser, but feedback suggests also passing them to the base class constructor for consistency.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/parser/minimax_m2_parser.py (49)

high

The **kwargs are being passed to the underlying MiniMaxM2ReasoningParser but are swallowed for the base DelegatingParser class. While the current base class implementation might not use them, it is better practice to pass *args and **kwargs up to ensure that any base class logic (such as parameter tracking or future updates) correctly receives the request parameters.

        super().__init__(tokenizer, **kwargs)

@JasonKeyiL JasonKeyiL force-pushed the fix/kimi-k2-streaming-end-token-buffer branch from ef3241e to 3df3697 Compare April 28, 2026 00:10
@DarkLight1337 DarkLight1337 added the verified Run pre-commit for new contributors without triggering other tests label Apr 29, 2026
marcostephan added a commit to marcostephan/vllm that referenced this pull request Apr 29, 2026
When stop sequences are configured, vLLM enables an output_text_buffer
that delays text emission so suffix-matching against the stop strings
can be performed without leaking partial matches. As a side effect, the
reasoning end-token id (</think>) can arrive in delta_token_ids ahead of
its rendered text — the text flushes over the next 1-N chunks. Two
related defects follow.

In KimiK2ReasoningParser.extract_reasoning_streaming, the unguarded
delta_text.find(self._end_token) returns -1 on the boundary chunk and
the slice over-truncates reasoning while seeding content with whatever
chars happened to be in delta_text.

After the boundary chunk, DelegatingParser.parse_delta sets
state.reasoning_ended=True and routes subsequent chunks through the
tool branch, which has no awareness of the still-buffered end-token
text. The buffered </think> then flushes verbatim into delta.content
over the next 1-N chunks. This bypasses the reasoning parser entirely
and is therefore not caught by patches confined to that class.

Reproduced on a real serving deployment with max_tokens=512,
temperature=0, stop=['<|im_end|>', '\n\nUser:'], single-turn request:
8/8 streamed responses leaked the literal '</think>' into delta.content.
Per-delta capture showed the boundary as

  delta[N]:   reasoning='T'         (last reasoning delta)
  delta[N+1]: content=' '
  delta[N+2]: content='</thin'      (buffered </think> flushing)
  delta[N+3]: content='k> '
  delta[N+4]: content='I will...'   (real content begins)

Fix:

* KimiK2ReasoningParser: when the end-token id is in delta_token_ids
  but the literal is not in delta_text, return None and let the
  buffered text flush. Same guard for the implicit
  <|tool_calls_section_begin|> reasoning end.

* DelegatingParser.parse_delta: add a post-boundary skip phase. When
  the parser returned None on the boundary handoff (i.e. the end-token
  text is still buffered), accumulate phase-B delta_text until the
  end-token literal appears, then strip everything up through it and
  forward the remainder to the tool parser. Three new StreamState
  fields gate the mechanism so it remains a no-op on streams where the
  parser already split delta_text in a single chunk and on streams
  without a reasoning_end_str. The skip applies to any ReasoningParser
  exposing a non-empty reasoning_end_str, so DeepSeek-R1, Step3,
  Step3p5, Ernie45 and similar parsers benefit without further changes.

Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py:

* test_streaming_end_token_id_arrives_before_text — parser-level guard
  for </think>.
* test_streaming_tool_section_id_arrives_before_text — same guard for
  the implicit tool-section reasoning end.
* test_delegating_parser_does_not_leak_buffered_end_token_text —
  end-to-end through DelegatingParser, verifies no </think> leak and
  post-boundary content survives.
* test_delegating_parser_disarmed_when_parser_splits_in_one_chunk —
  the skip mechanism stays disarmed when delta_text already contains
  </think> on the boundary chunk; would over-truncate otherwise.
* test_delegating_parser_handles_torch_tensor_token_ids — exercises
  the same flow with delta_token_ids constructed as torch tensors and
  converted via .tolist(), mirroring the engine code path.

Verification:

* uv venv + uv pip install per AGENTS.md, then
  PYTHONPATH=. python -m pytest --confcutdir=tests/reasoning -p no:cacheprovider \
      tests/reasoning/test_kimi_k2_reasoning_parser.py -v
  → 13 passed in 6.08s.
* Reverted both implementation files to HEAD~1 (pre-fix), reran the
  five new tests → all five fail with AssertionError: "Leak: '</think>'
  in content" (or AttributeError on the disarm test referencing the
  new StreamState fields).
* pre-commit run on the three changed files: ruff-check, ruff-format,
  typos, mypy, SPDX headers all pass.
* Production serving deployment: 8/8 leaks pre-fix → 0/25 leaks
  post-fix (10 direct vLLM port-forward + 15 via service mesh, on the
  previously-100%-leaking prompt and stop config). Non-streaming
  remained leak-free (structurally unaffected).

This work is AI-assisted. The change was authored end-to-end by Claude
under human direction; the human submitter has reviewed every changed
line and confirms the analysis, the diff, and the test results above.
No open PR was found addressing the same DelegatingParser-level leak
(`gh pr list --repo vllm-project/vllm --search '41067 in:body'` returns
only vllm-project#41068, which patches only the parser-side guard and does not
address the buffered-flush race in the tool-branch handoff).

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
… in streaming

When stop sequences set output_text_buffer_length > 0, token IDs arrive in
delta_token_ids before their text is flushed into delta_text. Without a guard,
find() returns -1 and the reasoning/content split is silently corrupted.

Add text-presence checks before both find() calls in extract_reasoning_streaming:
- </think> end token path (line 215)
- <|tool_calls_section_begin|> section start path (line 223)

Return None (wait for flush) when the token ID is present but the text is not,
matching the fix pattern from PR vllm-project#39044 (BaseThinkingReasoningParser / DeepSeekR1)
and PR vllm-project#40352 (Step3ReasoningParser).

Fixes vllm-project#41067

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Keyi Li <likey6688@gmail.com>
@JasonKeyiL JasonKeyiL force-pushed the fix/kimi-k2-streaming-end-token-buffer branch from 8a304a4 to 093c83a Compare April 30, 2026 01:51
@JasonKeyiL
Copy link
Copy Markdown
Contributor Author

Hi @sfeng33 , would you be mind take a look when you have a moment?
much appreciate it!

@sfeng33 sfeng33 self-assigned this Apr 30, 2026
marcostephan added a commit to marcostephan/vllm that referenced this pull request Apr 30, 2026
Stacks on top of vllm-project#41068 (vendored at this branch's base): that PR fixes
the parser-side off-by-one when the end-token id arrives ahead of its
rendered text in delta_text, but a second leak survives downstream.
After the parser returns None on the boundary chunk,
DelegatingParser.parse_delta sets state.reasoning_ended = True and
routes the next chunks through the tool branch — which has no awareness
of the still-buffered end-token text and emits it verbatim as
delta.content over the next 1..N chunks. With vllm-project#41068 alone applied, 8/8
streamed responses on a real Kimi-K2.5 deployment leaked the literal
'</think>' into delta.content (max_tokens=512, temperature=0,
stop=['<|im_end|>', '\n\nUser:'], single-turn).

Add three StreamState fields and a small post-boundary skip block in
DelegatingParser.parse_delta. When the parser returns None on the
boundary handoff (i.e. the end-token text is still buffered) the skip
arms; subsequent chunks accumulate delta_text until the end-token
literal becomes visible, are stripped through it, and the remainder is
forwarded to the tool parser. The skip stays disarmed when the parser
already split delta_text in the same chunk (no buffered tail follows).
The mechanism applies to any ReasoningParser exposing a non-empty
reasoning_end_str, so DeepSeek-R1, Step3, Step3p5, Ernie45 and similar
parsers benefit without further changes.

Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py:

- test_delegating_parser_strips_buffered_end_token (parametrized over
  list and torch tensor.tolist() input shapes — engine call site
  produces token_ids via as_list(output.token_ids) on a torch tensor)
- test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk

Verification:

  uv venv --python 3.12 && uv pip install …
  PYTHONPATH=. .venv/bin/python -m pytest \
      --confcutdir=tests/reasoning -p no:cacheprovider \
      tests/reasoning/test_kimi_k2_reasoning_parser.py -v
  # → 13 passed in 4.40s (10 pre-existing + 3 new from this commit)

  # Reverted vllm/parser/abstract_parser.py to PR vllm-project#41068 HEAD, reran
  # the three new tests → all three fail (AssertionError on the literal
  # </think> appearing in content, AttributeError on the disarm test
  # referencing the new StreamState fields).

  pre-commit run --files vllm/parser/abstract_parser.py \
      tests/reasoning/test_kimi_k2_reasoning_parser.py
  # → ruff-check, ruff-format, typos, mypy, SPDX headers all pass

Production rollout (Kimi-K2.5-INT4 P/D disagg, vLLM nightly
fe9c3d6 + vllm-project#41068's parser fix + this commit, applied as a
runtime overlay): 8/8 leaks pre-fix → 0/25 leaks post-fix (10 direct
vLLM port-forward + 15 via service mesh, on the previously-100%-
leaking prompt and stop config). Output integrity hand-checked:
reasoning ends cleanly at the natural stop point, content begins
immediately with the model's actual answer, no missing chars.

This work is AI-assisted. The change was authored end-to-end by
Claude under human direction; the human submitter has reviewed every
changed line and confirms the analysis, the diff, and the test
results above. `gh pr list --repo vllm-project/vllm --search '41067
in:body'` returns only vllm-project#41068 (which patches the parser-side guard)
— no open PR addresses the DelegatingParser-level leak this commit
fixes.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
@marcostephan
Copy link
Copy Markdown

marcostephan commented Apr 30, 2026

Hey, confirming this fix is needed. We hit the exact same bug on a Kimi-K2.5-INT4 disaggregated deployment running the upstream nightly (vllm-openai:cu130-nightly-fe9c3d6c5f66). 8/8 streamed responses leaked the literal </think> into delta.content on max_tokens=512, temperature=0, stop=['<|im_end|>', '\n\nUser:'], single-turn requests.

I spent a day digging into this with Claude. One thing worth surfacing: this PR's guard is necessary, but on its own doesn't fully fix the leak. We still got 8/8 leaks with just this PR's logic applied as an overlay on our serving image.

The second-stage leak is in DelegatingParser.parse_delta: after extract_reasoning_streaming returns None on the boundary chunk (which is exactly what your new guard makes it do), state.reasoning_ended flips True and subsequent chunks go through the tool branch. The buffered </think> then flushes a few chars at a time while reasoning is "ended", and the tool branch doesn't know those chars are the lagging end-token, so it emits them as delta.content. Per-delta capture from a real request:

delta[N]:   reasoning='T'         (last reasoning delta; parser returns None)
delta[N+1]: content=' '
delta[N+2]: content='</thin'      <- buffered </think> flushing
delta[N+3]: content='k> '
delta[N+4]: content='I will...'

The follow-up patch is a small post-boundary skip in DelegatingParser.parse_delta: 3 new StreamState fields plus a buffer-and-strip block. With both fixes applied, 0/25 leaks across our production verification on the prompt that previously leaked 100%. Same logic should also help DeepSeek-R1, Step3, Step3p5, Ernie45 and any other parser with a non-empty reasoning_end_str.

Cleanest would probably be me opening a PR into your branch with the additional fix so it all lands as one upstream PR. Otherwise this can land as-is and I'll send a follow-up against main, or we roll both into a fresh PR with you as co-author. Whatever works for you.

Branch with the follow-up is here for reference: marcostephan#1 (+174 lines on top of your branch: the skip block plus 3 new tests parametrized over list and torch.tensor.tolist() token-id shapes).

Happy either way. Let me know what you prefer.

marcostephan added a commit to marcostephan/vllm that referenced this pull request Apr 30, 2026
Stacks on top of vllm-project#41068. PR vllm-project#41068 fixes the parser-side off-by-one
when the </think> token id arrives ahead of its rendered text in
delta_text. A second leak survives downstream: after the parser
returns None on the boundary chunk, DelegatingParser.parse_delta sets
state.reasoning_ended = True and routes the next chunks through the
tool branch — which has no awareness of the still-buffered end-token
text and emits it verbatim as delta.content over the next 1..N chunks.

Empirically: 8/8 streamed responses leaked the literal '</think>' into
delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser
change applied alone. Per-delta capture from a real request:

  delta[N]:   reasoning='T'         (last reasoning delta; parser returns None)
  delta[N+1]: content=' '
  delta[N+2]: content='</thin'      <- buffered </think> flushing
  delta[N+3]: content='k> '
  delta[N+4]: content='I will...'

Add four StreamState fields (end_token_skip_buffer,
end_token_skip_armed, end_token_skip_done, end_token_skip_literal)
plus a small post-boundary skip block in DelegatingParser.parse_delta
and a helper _pending_end_token_literal that returns whichever
terminator's text is still buffered.

On boundary handoff:

* The helper checks both </think> and (for parsers that expose it)
  <|tool_calls_section_begin|>. It returns the literal whose token id
  is in delta_token_ids but whose text has not yet flushed into
  delta_text, or None when the boundary text was fully visible.
* If the helper returns a literal, the skip arms with that literal
  stored in state and the boundary chunk's delta_text seeded into the
  buffer (so a partial prefix of the literal that already flushed
  isn't lost).
* If None, the skip stays disarmed (no buffered tail follows).

While armed, subsequent chunks accumulate delta_text into the buffer.
When the stored literal becomes visible, the buffer is split:

* For </think>, the literal is consumed (downstream content begins
  after it).
* For <|tool_calls_section_begin|>, the literal is preserved in the
  cleaned text — KimiK2ToolParser uses it to identify the section
  boundary, matching the immediate-text path in
  KimiK2ReasoningParser.extract_reasoning_streaming where
  'content = delta_text[tool_index:]' keeps the literal.

The tool-section branch is opt-in via getattr so it stays a no-op for
parsers that don't expose those attributes (DeepSeek-R1, Step3p5,
Ernie45, etc. all still benefit from the </think> half).

Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py:

* test_delegating_parser_strips_buffered_end_token — primary scenario
  exercised end-to-end through DelegatingParser.parse_delta.
* test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk
  — when </think> id and text arrive together, the skip stays
  disarmed; otherwise it would over-truncate post-boundary content.
* test_delegating_parser_boundary_at_end_of_delta_text — when </think>
  lands at the very end of delta_text with no content after, the skip
  must still stay disarmed (buffered tail won't appear).
* test_delegating_parser_handles_buffered_tool_section_end — when
  reasoning ends implicitly via <|tool_calls_section_begin|> under
  buffering, the skip waits for the section literal (not </think>),
  preserves it in cleaned text, and routes the in-section payload to
  the tool parser instead of leaking it as content.
* test_delegating_parser_recovers_partial_end_token_in_boundary_chunk
  — the boundary chunk's delta_text can end mid-literal (e.g. '</th'
  on the boundary chunk, 'ink>' on the next). The buffer must be
  seeded with the boundary chunk's delta_text so the partial prefix
  isn't lost.

All five rely on a small _drive helper, a Chunk dataclass, and a
wrapped_kimi_parser fixture that composes KimiK2ReasoningParser +
KimiK2ToolParser via _WrappedParser.

Verification:

  uv venv --python 3.12 && uv pip install …
  PYTHONPATH=. .venv/bin/python -m pytest \
      --confcutdir=tests/reasoning -p no:cacheprovider \
      tests/reasoning/test_kimi_k2_reasoning_parser.py -v
  # → 15 passed in 6.07s (10 pre-existing + 5 new from this commit)

  # Reverted vllm/parser/abstract_parser.py to PR vllm-project#41068's HEAD; reran
  # the new tests:
  # → 5 failed (each on the literal leaking into content or the
  #   post-boundary content being suppressed)

  pre-commit run --files vllm/parser/abstract_parser.py \
      tests/reasoning/test_kimi_k2_reasoning_parser.py
  # → ruff-check, ruff-format, typos, mypy, SPDX headers all pass

Production rollout: Kimi-K2.5-INT4 P/D disagg deployment, vLLM v0.20.0
+ PR vllm-project#41068's parser fix + this commit applied as a runtime overlay.
8/8 leaks pre-fix → 0/25 leaks post-fix (10 direct vLLM port-forward,
15 via service mesh) on the previously-100%-leaking prompt and stop
config. Output integrity hand-checked: reasoning ends cleanly at the
natural stop point, content begins immediately with the model's
actual answer.

Duplicate-work check (gh pr list --search '41067 in:body' --state open)
returns only PR vllm-project#41068, which patches the parser-side guard. No open
PR addresses the DelegatingParser-level leak this commit fixes.

This work is AI-assisted. The change was authored end-to-end by
Claude under human direction; the human submitter has reviewed every
changed line and confirms the analysis, the diff, and the test
results above.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
@JasonKeyiL
Copy link
Copy Markdown
Contributor Author

Hi @marcostephan, option 1 works great for me,
feel free to open PR into this branch with additional fix, I'd happy to review and merge it in!

@JasonKeyiL
Copy link
Copy Markdown
Contributor Author

Hi @sfeng33,
this pr has been opened for few days, could we approve and merge this first? and have @marcostephan create separate pr against main for the additional fix he identified.
thanks!
cc: @marcostephan

@JasonKeyiL
Copy link
Copy Markdown
Contributor Author

Hi @aarnphm @chaunceyjiang @bbrowning , could you take a look? thanks!

@marcostephan
Copy link
Copy Markdown

@JasonKeyiL Opened a pr in your fork. Also happy to create another one upstream if it gets merged before

@ehfd
Copy link
Copy Markdown
Contributor

ehfd commented May 4, 2026

Awesome! We could merge the fork PR before upstream.

Copy link
Copy Markdown
Collaborator

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix!

@sfeng33 sfeng33 added the ready ONLY add when PR is ready to merge/full CI is needed label May 4, 2026
@sfeng33 sfeng33 enabled auto-merge (squash) May 4, 2026 15:59
@sfeng33 sfeng33 merged commit 712ad02 into vllm-project:main May 4, 2026
48 checks passed
@JasonKeyiL
Copy link
Copy Markdown
Contributor Author

Thank you! @sfeng33

@JasonKeyiL
Copy link
Copy Markdown
Contributor Author

Hi @marcostephan, looks like its getting merged, can you create one upstream?

marcostephan added a commit to marcostephan/vllm that referenced this pull request May 5, 2026
…al end-token

Follow-up to the previous commit on this branch, addressing review
feedback from gemini-code-assist on PR vllm-project#41717.

When the boundary chunk's delta_text contains reasoning content
immediately followed by a partial `</think>` literal (e.g.
`"final_thought</th"` with the end-token id in delta_token_ids), the
Kimi parser returns None per vllm-project#41068 (literal not yet visible) and the
entire `"final_thought</th"` lands in `end_token_skip_buffer`. On the
next chunk that completes the literal, the prior split

    cleaned = state.end_token_skip_buffer[content_start:]

discarded everything before the literal — silently dropping the
`"final_thought"` prefix from the response.

Fix: capture the pre-literal slice into `recovered_reasoning_prefix`
during skip resolution and merge it into `delta_message.reasoning`
after the tool branch runs (the tool branch returns a fresh
DeltaMessage, so the merge has to happen post-branch to avoid being
overwritten).

Test added: test_delegating_parser_recovers_reasoning_prefix_buffered_with_partial_end_token
- Boundary chunk: `"final_thought</th"` + end-token id.
- Verifies `final_thought` ends up in delta.reasoning, not dropped or
  leaked into delta.content.
- Confirmed failing on the prior commit (drops `final_thought`),
  passing here.

Verification:

    PYTHONPATH=. .venv/bin/python -m pytest \
        --confcutdir=tests/reasoning -p no:cacheprovider \
        tests/reasoning/test_kimi_k2_reasoning_parser.py -v
    # 16 passed in 6.55s (15 prior + 1 new)

    pre-commit run --files vllm/parser/abstract_parser.py \
        tests/reasoning/test_kimi_k2_reasoning_parser.py
    # ruff, format, typos, mypy, SPDX all pass

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
…treaming (vllm-project#41068)

Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
…treaming (vllm-project#41068)

Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
…treaming (vllm-project#41068)

Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…treaming (vllm-project#41068)

Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…treaming (vllm-project#41068)

Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…treaming (vllm-project#41068)

Signed-off-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Keyi Li <likey6688@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
marcostephan added a commit to marcostephan/vllm that referenced this pull request May 27, 2026
Stacks on top of vllm-project#41068. PR vllm-project#41068 fixes the parser-side off-by-one
when the </think> token id arrives ahead of its rendered text in
delta_text. A second leak survives downstream: after the parser
returns None on the boundary chunk, DelegatingParser.parse_delta sets
state.reasoning_ended = True and routes the next chunks through the
tool branch — which has no awareness of the still-buffered end-token
text and emits it verbatim as delta.content over the next 1..N chunks.

Empirically: 8/8 streamed responses leaked the literal '</think>' into
delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser
change applied alone. Per-delta capture from a real request:

  delta[N]:   reasoning='T'         (last reasoning delta; parser returns None)
  delta[N+1]: content=' '
  delta[N+2]: content='</thin'      <- buffered </think> flushing
  delta[N+3]: content='k> '
  delta[N+4]: content='I will...'

Add four StreamState fields (end_token_skip_buffer,
end_token_skip_armed, end_token_skip_done, end_token_skip_literal)
plus a small post-boundary skip block in DelegatingParser.parse_delta
and a helper _pending_end_token_literal that returns whichever
terminator's text is still buffered.

On boundary handoff:

* The helper checks both </think> and (for parsers that expose it)
  <|tool_calls_section_begin|>. It returns the literal whose token id
  is in delta_token_ids but whose text has not yet flushed into
  delta_text, or None when the boundary text was fully visible.
* If the helper returns a literal, the skip arms with that literal
  stored in state and the boundary chunk's delta_text seeded into the
  buffer (so a partial prefix of the literal that already flushed
  isn't lost).
* If None, the skip stays disarmed (no buffered tail follows).

While armed, subsequent chunks accumulate delta_text into the buffer.
When the stored literal becomes visible, the buffer is split:

* For </think>, the literal is consumed (downstream content begins
  after it).
* For <|tool_calls_section_begin|>, the literal is preserved in the
  cleaned text — KimiK2ToolParser uses it to identify the section
  boundary, matching the immediate-text path in
  KimiK2ReasoningParser.extract_reasoning_streaming where
  'content = delta_text[tool_index:]' keeps the literal.

The tool-section branch is opt-in via getattr so it stays a no-op for
parsers that don't expose those attributes (DeepSeek-R1, Step3p5,
Ernie45, etc. all still benefit from the </think> half).

Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py:

* test_delegating_parser_strips_buffered_end_token — primary scenario
  exercised end-to-end through DelegatingParser.parse_delta.
* test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk
  — when </think> id and text arrive together, the skip stays
  disarmed; otherwise it would over-truncate post-boundary content.
* test_delegating_parser_boundary_at_end_of_delta_text — when </think>
  lands at the very end of delta_text with no content after, the skip
  must still stay disarmed (buffered tail won't appear).
* test_delegating_parser_handles_buffered_tool_section_end — when
  reasoning ends implicitly via <|tool_calls_section_begin|> under
  buffering, the skip waits for the section literal (not </think>),
  preserves it in cleaned text, and routes the in-section payload to
  the tool parser instead of leaking it as content.
* test_delegating_parser_recovers_partial_end_token_in_boundary_chunk
  — the boundary chunk's delta_text can end mid-literal (e.g. '</th'
  on the boundary chunk, 'ink>' on the next). The buffer must be
  seeded with the boundary chunk's delta_text so the partial prefix
  isn't lost.

All five rely on a small _drive helper, a Chunk dataclass, and a
wrapped_kimi_parser fixture that composes KimiK2ReasoningParser +
KimiK2ToolParser via _WrappedParser.

Verification:

  uv venv --python 3.12 && uv pip install …
  PYTHONPATH=. .venv/bin/python -m pytest \
      --confcutdir=tests/reasoning -p no:cacheprovider \
      tests/reasoning/test_kimi_k2_reasoning_parser.py -v
  # → 15 passed in 6.07s (10 pre-existing + 5 new from this commit)

  # Reverted vllm/parser/abstract_parser.py to PR vllm-project#41068's HEAD; reran
  # the new tests:
  # → 5 failed (each on the literal leaking into content or the
  #   post-boundary content being suppressed)

  pre-commit run --files vllm/parser/abstract_parser.py \
      tests/reasoning/test_kimi_k2_reasoning_parser.py
  # → ruff-check, ruff-format, typos, mypy, SPDX headers all pass

Production rollout: Kimi-K2.5-INT4 P/D disagg deployment, vLLM v0.20.0
+ PR vllm-project#41068's parser fix + this commit applied as a runtime overlay.
8/8 leaks pre-fix → 0/25 leaks post-fix (10 direct vLLM port-forward,
15 via service mesh) on the previously-100%-leaking prompt and stop
config. Output integrity hand-checked: reasoning ends cleanly at the
natural stop point, content begins immediately with the model's
actual answer.

Duplicate-work check (gh pr list --search '41067 in:body' --state open)
returns only PR vllm-project#41068, which patches the parser-side guard. No open
PR addresses the DelegatingParser-level leak this commit fixes.

This work is AI-assisted. The change was authored end-to-end by
Claude under human direction; the human submitter has reviewed every
changed line and confirms the analysis, the diff, and the test
results above.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
marcostephan added a commit to marcostephan/vllm that referenced this pull request May 27, 2026
…al end-token

Follow-up to the previous commit on this branch, addressing review
feedback from gemini-code-assist on PR vllm-project#41717.

When the boundary chunk's delta_text contains reasoning content
immediately followed by a partial `</think>` literal (e.g.
`"final_thought</th"` with the end-token id in delta_token_ids), the
Kimi parser returns None per vllm-project#41068 (literal not yet visible) and the
entire `"final_thought</th"` lands in `end_token_skip_buffer`. On the
next chunk that completes the literal, the prior split

    cleaned = state.end_token_skip_buffer[content_start:]

discarded everything before the literal — silently dropping the
`"final_thought"` prefix from the response.

Fix: capture the pre-literal slice into `recovered_reasoning_prefix`
during skip resolution and merge it into `delta_message.reasoning`
after the tool branch runs (the tool branch returns a fresh
DeltaMessage, so the merge has to happen post-branch to avoid being
overwritten).

Test added: test_delegating_parser_recovers_reasoning_prefix_buffered_with_partial_end_token
- Boundary chunk: `"final_thought</th"` + end-token id.
- Verifies `final_thought` ends up in delta.reasoning, not dropped or
  leaked into delta.content.
- Confirmed failing on the prior commit (drops `final_thought`),
  passing here.

Verification:

    PYTHONPATH=. .venv/bin/python -m pytest \
        --confcutdir=tests/reasoning -p no:cacheprovider \
        tests/reasoning/test_kimi_k2_reasoning_parser.py -v
    # 16 passed in 6.55s (15 prior + 1 new)

    pre-commit run --files vllm/parser/abstract_parser.py \
        tests/reasoning/test_kimi_k2_reasoning_parser.py
    # ruff, format, typos, mypy, SPDX all pass

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed verified Run pre-commit for new contributors without triggering other tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: KimiK2ReasoningParser silently corrupts streaming output when stop sequences buffer text

5 participants