Skip to content

[Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068)#1

Closed
marcostephan wants to merge 3 commits into
mainfrom
fix/kimi-reasoning-parser-streaming-buffered-end-token
Closed

[Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068)#1
marcostephan wants to merge 3 commits into
mainfrom
fix/kimi-reasoning-parser-streaming-buffered-end-token

Conversation

@marcostephan
Copy link
Copy Markdown
Owner

@marcostephan marcostephan commented Apr 29, 2026

TL;DR

Stacks on top of vllm-project/vllm#41068. PR vllm-project#41068 fixes the parser-side off-by-one when the </think> token id arrives ahead of its rendered text. A second leak survives downstream of that fix and this PR addresses it.

After PR vllm-project#41068's parser returns None on the boundary chunk, DelegatingParser.parse_delta sets state.reasoning_ended = True and routes the next chunks through the tool branch, which has no awareness of the still-buffered end-token text and emits it verbatim as delta.content over the next 1..N chunks. Empirically: 8/8 streamed responses leaked the literal </think> into delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser change applied alone.

Branch structure

1f08d3c (PR #41068 HEAD: Merge main into fix/kimi-k2-streaming-end-token-buffer)
└── dd18796 (this PR: DelegatingParser post-boundary skip)

git diff jasonkeyil/fix/kimi-k2-streaming-end-token-buffer → +330 lines in 2 files (vllm/parser/abstract_parser.py, tests/reasoning/test_kimi_k2_reasoning_parser.py), zero overlap with vllm-project#41068's hunks.

Reproduction (@sha-292411e, with PR vllm-project#41068 applied alone)

delta[N]:   reasoning='T'         (last reasoning delta — N+1's boundary chunk has end-token id, parser returns None)
delta[N+1]: content=' '
delta[N+2]: content='</thin'      ← buffered </think> flushing through the tool branch
delta[N+3]: content='k> '
delta[N+4]: content='I will...'   (real content begins)

Full assembled response: content=' </think> I will deconstruct the word "reasoning"...' — literal </think> clearly visible to the client.

Fix

Four new StreamState fields (end_token_skip_buffer, end_token_skip_armed, end_token_skip_done, end_token_skip_literal) plus a small post-boundary skip block in DelegatingParser.parse_delta and a helper _pending_end_token_literal.

On boundary handoff:

  • The helper checks both </think> and (for parsers that expose it) <|tool_calls_section_begin|>. It returns the literal whose token id is in delta_token_ids but whose text has not yet flushed into delta_text, or None when the boundary text was fully visible.
  • If the helper returns a literal, the skip arms with that literal stored in state and the boundary chunk's delta_text seeded into the buffer (so a partial prefix of the literal that already flushed isn't lost).
  • If None, the skip stays disarmed (no buffered tail follows; the parser already split everything in this delta).

While armed, subsequent chunks accumulate delta_text into the buffer. When the stored literal becomes visible, the buffer is split:

  • For </think>, the literal is consumed (downstream content begins after it).
  • For <|tool_calls_section_begin|>, the literal is preserved in the cleaned text — KimiK2ToolParser uses it to identify the section boundary, matching the immediate-text path in KimiK2ReasoningParser.extract_reasoning_streaming which keeps the literal via content = delta_text[tool_index:].

The tool-section branch is opt-in via getattr so it stays a no-op for parsers that don't expose those attributes (DeepSeek-R1, Step3p5, Ernie45, etc. all still benefit from the </think> half).

Tests (tests/reasoning/test_kimi_k2_reasoning_parser.py, 5 new tests)

Test Documents
test_delegating_parser_strips_buffered_end_token Primary scenario: buffered </think> flush over multiple chunks doesn't leak as content
test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk Skip stays disarmed when </think> id and text arrive together (otherwise over-truncates)
test_delegating_parser_boundary_at_end_of_delta_text Skip stays disarmed when </think> lands at end of delta_text with no content after
test_delegating_parser_handles_buffered_tool_section_end Implicit reasoning end via <|tool_calls_section_begin|>: skip waits for the section literal (not </think>), preserves it in cleaned text, routes in-section payload to tool parser
test_delegating_parser_recovers_partial_end_token_in_boundary_chunk Boundary chunk's delta_text can end mid-literal; buffer must be seeded with boundary delta_text

All five rely on a small _drive helper, a Chunk dataclass, and a wrapped_kimi_parser fixture that composes KimiK2ReasoningParser + KimiK2ToolParser via _WrappedParser.

Test commands run + results

Per AGENTS.md:

uv venv --python 3.12
uv pip install …  # transformers, pytest, torch, openai_harmony, etc.
PYTHONPATH=. .venv/bin/python -m pytest \
    --confcutdir=tests/reasoning -p no:cacheprovider \
    tests/reasoning/test_kimi_k2_reasoning_parser.py -v
# → 15 passed in 6.07s (10 pre-existing + 5 new from this PR)

# Regression-coverage proof: revert the abstract_parser.py change to PR
# #41068's HEAD, rerun the new tests:
git checkout jasonkeyil/fix/kimi-k2-streaming-end-token-buffer -- vllm/parser/abstract_parser.py
PYTHONPATH=. .venv/bin/python -m pytest \
    --confcutdir=tests/reasoning -p no:cacheprovider \
    tests/reasoning/test_kimi_k2_reasoning_parser.py -k delegating
# → 5 failed (each with AssertionError on the literal leaking into content
#   or the post-boundary content being suppressed)

pre-commit run --files vllm/parser/abstract_parser.py \
    tests/reasoning/test_kimi_k2_reasoning_parser.py
# → ruff-check, ruff-format, typos, mypy, SPDX headers all pass

Production rollout

Kimi-K2.5-INT4 P/D disagg, vLLM v0.20.0 + PR vllm-project#41068's parser fix + this commit, applied as a runtime overlay:

Path Pre-fix (PR vllm-project#41068 alone) Post-fix (PR vllm-project#41068 + this PR)
Streaming, direct vLLM port-forward 8/8 leaks 0/10 leaks
Streaming, via service mesh 5/15 leaks 0/15 leaks
Non-streaming 0/N (structurally unaffected) 0/10 leaks

Output integrity hand-checked post-fix: reasoning ends cleanly at the natural stop point, content begins immediately with the model's actual answer (no leading </think>, no missing chars, no gap deltas).

Duplicate-work check

gh pr list --repo vllm-project/vllm --search '41067 in:body' --state open
# Returns only PR #41068. No open PR addresses the DelegatingParser-level
# leak this PR fixes — verified empirically by the 8/8 reproduction
# under PR #41068's parser change applied alone.

AI assistance disclosure

Per AGENTS.md: this work is AI-assisted. The change was authored end-to-end by Claude under human direction; the human submitter has reviewed every changed line and confirms the analysis, the diff, and the test results above.

What this PR does NOT change

  • Non-streaming path: structurally unaffected; still uses extract_reasoning() on the full string.
  • DelegatingParser.parse_delta token-id handling: extract_content_ids already strips the end-token id from current_token_ids on handoff. Only the text path needed fixing.
  • Behavior when no reasoning parser or no reasoning_end_str is configured: skip stays disarmed (no-op).

@marcostephan marcostephan force-pushed the fix/kimi-reasoning-parser-streaming-buffered-end-token branch from 6ac2610 to ef9d764 Compare April 29, 2026 14:29
Keyi Li and others added 2 commits April 29, 2026 18:51
… in streaming

When stop sequences set output_text_buffer_length > 0, token IDs arrive in
delta_token_ids before their text is flushed into delta_text. Without a guard,
find() returns -1 and the reasoning/content split is silently corrupted.

Add text-presence checks before both find() calls in extract_reasoning_streaming:
- </think> end token path (line 215)
- <|tool_calls_section_begin|> section start path (line 223)

Return None (wait for flush) when the token ID is present but the text is not,
matching the fix pattern from PR vllm-project#39044 (BaseThinkingReasoningParser / DeepSeekR1)
and PR vllm-project#40352 (Step3ReasoningParser).

Fixes vllm-project#41067

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Keyi Li <likey6688@gmail.com>
@marcostephan marcostephan force-pushed the fix/kimi-reasoning-parser-streaming-buffered-end-token branch from ef9d764 to afaade6 Compare April 30, 2026 07:16
@marcostephan marcostephan changed the title [Bugfix] kimi_k2: stop </think> leak when output_text_buffer is active [Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068) Apr 30, 2026
Stacks on top of vllm-project#41068. PR vllm-project#41068 fixes the parser-side off-by-one
when the </think> token id arrives ahead of its rendered text in
delta_text. A second leak survives downstream: after the parser
returns None on the boundary chunk, DelegatingParser.parse_delta sets
state.reasoning_ended = True and routes the next chunks through the
tool branch — which has no awareness of the still-buffered end-token
text and emits it verbatim as delta.content over the next 1..N chunks.

Empirically: 8/8 streamed responses leaked the literal '</think>' into
delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser
change applied alone. Per-delta capture from a real request:

  delta[N]:   reasoning='T'         (last reasoning delta; parser returns None)
  delta[N+1]: content=' '
  delta[N+2]: content='</thin'      <- buffered </think> flushing
  delta[N+3]: content='k> '
  delta[N+4]: content='I will...'

Add four StreamState fields (end_token_skip_buffer,
end_token_skip_armed, end_token_skip_done, end_token_skip_literal)
plus a small post-boundary skip block in DelegatingParser.parse_delta
and a helper _pending_end_token_literal that returns whichever
terminator's text is still buffered.

On boundary handoff:

* The helper checks both </think> and (for parsers that expose it)
  <|tool_calls_section_begin|>. It returns the literal whose token id
  is in delta_token_ids but whose text has not yet flushed into
  delta_text, or None when the boundary text was fully visible.
* If the helper returns a literal, the skip arms with that literal
  stored in state and the boundary chunk's delta_text seeded into the
  buffer (so a partial prefix of the literal that already flushed
  isn't lost).
* If None, the skip stays disarmed (no buffered tail follows).

While armed, subsequent chunks accumulate delta_text into the buffer.
When the stored literal becomes visible, the buffer is split:

* For </think>, the literal is consumed (downstream content begins
  after it).
* For <|tool_calls_section_begin|>, the literal is preserved in the
  cleaned text — KimiK2ToolParser uses it to identify the section
  boundary, matching the immediate-text path in
  KimiK2ReasoningParser.extract_reasoning_streaming where
  'content = delta_text[tool_index:]' keeps the literal.

The tool-section branch is opt-in via getattr so it stays a no-op for
parsers that don't expose those attributes (DeepSeek-R1, Step3p5,
Ernie45, etc. all still benefit from the </think> half).

Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py:

* test_delegating_parser_strips_buffered_end_token — primary scenario
  exercised end-to-end through DelegatingParser.parse_delta.
* test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk
  — when </think> id and text arrive together, the skip stays
  disarmed; otherwise it would over-truncate post-boundary content.
* test_delegating_parser_boundary_at_end_of_delta_text — when </think>
  lands at the very end of delta_text with no content after, the skip
  must still stay disarmed (buffered tail won't appear).
* test_delegating_parser_handles_buffered_tool_section_end — when
  reasoning ends implicitly via <|tool_calls_section_begin|> under
  buffering, the skip waits for the section literal (not </think>),
  preserves it in cleaned text, and routes the in-section payload to
  the tool parser instead of leaking it as content.
* test_delegating_parser_recovers_partial_end_token_in_boundary_chunk
  — the boundary chunk's delta_text can end mid-literal (e.g. '</th'
  on the boundary chunk, 'ink>' on the next). The buffer must be
  seeded with the boundary chunk's delta_text so the partial prefix
  isn't lost.

All five rely on a small _drive helper, a Chunk dataclass, and a
wrapped_kimi_parser fixture that composes KimiK2ReasoningParser +
KimiK2ToolParser via _WrappedParser.

Verification:

  uv venv --python 3.12 && uv pip install …
  PYTHONPATH=. .venv/bin/python -m pytest \
      --confcutdir=tests/reasoning -p no:cacheprovider \
      tests/reasoning/test_kimi_k2_reasoning_parser.py -v
  # → 15 passed in 6.07s (10 pre-existing + 5 new from this commit)

  # Reverted vllm/parser/abstract_parser.py to PR vllm-project#41068's HEAD; reran
  # the new tests:
  # → 5 failed (each on the literal leaking into content or the
  #   post-boundary content being suppressed)

  pre-commit run --files vllm/parser/abstract_parser.py \
      tests/reasoning/test_kimi_k2_reasoning_parser.py
  # → ruff-check, ruff-format, typos, mypy, SPDX headers all pass

Production rollout: Kimi-K2.5-INT4 P/D disagg deployment, vLLM v0.20.0
+ PR vllm-project#41068's parser fix + this commit applied as a runtime overlay.
8/8 leaks pre-fix → 0/25 leaks post-fix (10 direct vLLM port-forward,
15 via service mesh) on the previously-100%-leaking prompt and stop
config. Output integrity hand-checked: reasoning ends cleanly at the
natural stop point, content begins immediately with the model's
actual answer.

Duplicate-work check (gh pr list --search '41067 in:body' --state open)
returns only PR vllm-project#41068, which patches the parser-side guard. No open
PR addresses the DelegatingParser-level leak this commit fixes.

This work is AI-assisted. The change was authored end-to-end by
Claude under human direction; the human submitter has reviewed every
changed line and confirms the analysis, the diff, and the test
results above.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
@marcostephan marcostephan force-pushed the fix/kimi-reasoning-parser-streaming-buffered-end-token branch from de0b079 to dd18796 Compare April 30, 2026 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants