[Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068) by marcostephan · Pull Request #1 · marcostephan/vllm

marcostephan · 2026-04-29T14:06:57Z

TL;DR

Stacks on top of vllm-project/vllm#41068. PR vllm-project#41068 fixes the parser-side off-by-one when the </think> token id arrives ahead of its rendered text. A second leak survives downstream of that fix and this PR addresses it.

After PR vllm-project#41068's parser returns None on the boundary chunk, DelegatingParser.parse_delta sets state.reasoning_ended = True and routes the next chunks through the tool branch, which has no awareness of the still-buffered end-token text and emits it verbatim as delta.content over the next 1..N chunks. Empirically: 8/8 streamed responses leaked the literal </think> into delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser change applied alone.

Branch structure

1f08d3c (PR #41068 HEAD: Merge main into fix/kimi-k2-streaming-end-token-buffer)
└── dd18796 (this PR: DelegatingParser post-boundary skip)

git diff jasonkeyil/fix/kimi-k2-streaming-end-token-buffer → +330 lines in 2 files (vllm/parser/abstract_parser.py, tests/reasoning/test_kimi_k2_reasoning_parser.py), zero overlap with vllm-project#41068's hunks.

Reproduction (`@sha-292411e`, with PR vllm-project#41068 applied alone)

delta[N]:   reasoning='T'         (last reasoning delta — N+1's boundary chunk has end-token id, parser returns None)
delta[N+1]: content=' '
delta[N+2]: content='</thin'      ← buffered </think> flushing through the tool branch
delta[N+3]: content='k> '
delta[N+4]: content='I will...'   (real content begins)

Full assembled response: content=' </think> I will deconstruct the word "reasoning"...' — literal </think> clearly visible to the client.

Fix

Four new StreamState fields (end_token_skip_buffer, end_token_skip_armed, end_token_skip_done, end_token_skip_literal) plus a small post-boundary skip block in DelegatingParser.parse_delta and a helper _pending_end_token_literal.

On boundary handoff:

The helper checks both </think> and (for parsers that expose it) <|tool_calls_section_begin|>. It returns the literal whose token id is in delta_token_ids but whose text has not yet flushed into delta_text, or None when the boundary text was fully visible.
If the helper returns a literal, the skip arms with that literal stored in state and the boundary chunk's delta_text seeded into the buffer (so a partial prefix of the literal that already flushed isn't lost).
If None, the skip stays disarmed (no buffered tail follows; the parser already split everything in this delta).

While armed, subsequent chunks accumulate delta_text into the buffer. When the stored literal becomes visible, the buffer is split:

For </think>, the literal is consumed (downstream content begins after it).
For <|tool_calls_section_begin|>, the literal is preserved in the cleaned text — KimiK2ToolParser uses it to identify the section boundary, matching the immediate-text path in KimiK2ReasoningParser.extract_reasoning_streaming which keeps the literal via content = delta_text[tool_index:].

The tool-section branch is opt-in via getattr so it stays a no-op for parsers that don't expose those attributes (DeepSeek-R1, Step3p5, Ernie45, etc. all still benefit from the </think> half).

Tests (`tests/reasoning/test_kimi_k2_reasoning_parser.py`, 5 new tests)

Test	Documents
`test_delegating_parser_strips_buffered_end_token`	Primary scenario: buffered `</think>` flush over multiple chunks doesn't leak as content
`test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk`	Skip stays disarmed when `</think>` id and text arrive together (otherwise over-truncates)
`test_delegating_parser_boundary_at_end_of_delta_text`	Skip stays disarmed when `</think>` lands at end of `delta_text` with no content after
`test_delegating_parser_handles_buffered_tool_section_end`	Implicit reasoning end via `<\|tool_calls_section_begin\|>`: skip waits for the section literal (not `</think>`), preserves it in cleaned text, routes in-section payload to tool parser
`test_delegating_parser_recovers_partial_end_token_in_boundary_chunk`	Boundary chunk's `delta_text` can end mid-literal; buffer must be seeded with boundary delta_text

All five rely on a small _drive helper, a Chunk dataclass, and a wrapped_kimi_parser fixture that composes KimiK2ReasoningParser + KimiK2ToolParser via _WrappedParser.

Test commands run + results

Per AGENTS.md:

uv venv --python 3.12
uv pip install …  # transformers, pytest, torch, openai_harmony, etc.
PYTHONPATH=. .venv/bin/python -m pytest \
    --confcutdir=tests/reasoning -p no:cacheprovider \
    tests/reasoning/test_kimi_k2_reasoning_parser.py -v
# → 15 passed in 6.07s (10 pre-existing + 5 new from this PR)

# Regression-coverage proof: revert the abstract_parser.py change to PR
# #41068's HEAD, rerun the new tests:
git checkout jasonkeyil/fix/kimi-k2-streaming-end-token-buffer -- vllm/parser/abstract_parser.py
PYTHONPATH=. .venv/bin/python -m pytest \
    --confcutdir=tests/reasoning -p no:cacheprovider \
    tests/reasoning/test_kimi_k2_reasoning_parser.py -k delegating
# → 5 failed (each with AssertionError on the literal leaking into content
#   or the post-boundary content being suppressed)

pre-commit run --files vllm/parser/abstract_parser.py \
    tests/reasoning/test_kimi_k2_reasoning_parser.py
# → ruff-check, ruff-format, typos, mypy, SPDX headers all pass

Production rollout

Kimi-K2.5-INT4 P/D disagg, vLLM v0.20.0 + PR vllm-project#41068's parser fix + this commit, applied as a runtime overlay:

Path	Pre-fix (PR vllm-project#41068 alone)	Post-fix (PR vllm-project#41068 + this PR)
Streaming, direct vLLM port-forward	8/8 leaks	0/10 leaks
Streaming, via service mesh	5/15 leaks	0/15 leaks
Non-streaming	0/N (structurally unaffected)	0/10 leaks

Output integrity hand-checked post-fix: reasoning ends cleanly at the natural stop point, content begins immediately with the model's actual answer (no leading </think>, no missing chars, no gap deltas).

Duplicate-work check

gh pr list --repo vllm-project/vllm --search '41067 in:body' --state open
# Returns only PR #41068. No open PR addresses the DelegatingParser-level
# leak this PR fixes — verified empirically by the 8/8 reproduction
# under PR #41068's parser change applied alone.

AI assistance disclosure

Per AGENTS.md: this work is AI-assisted. The change was authored end-to-end by Claude under human direction; the human submitter has reviewed every changed line and confirms the analysis, the diff, and the test results above.

What this PR does NOT change

Non-streaming path: structurally unaffected; still uses extract_reasoning() on the full string.
DelegatingParser.parse_delta token-id handling: extract_content_ids already strips the end-token id from current_token_ids on handoff. Only the text path needed fixing.
Behavior when no reasoning parser or no reasoning_end_str is configured: skip stays disarmed (no-op).

… in streaming When stop sequences set output_text_buffer_length > 0, token IDs arrive in delta_token_ids before their text is flushed into delta_text. Without a guard, find() returns -1 and the reasoning/content split is silently corrupted. Add text-presence checks before both find() calls in extract_reasoning_streaming: - </think> end token path (line 215) - <|tool_calls_section_begin|> section start path (line 223) Return None (wait for flush) when the token ID is present but the text is not, matching the fix pattern from PR vllm-project#39044 (BaseThinkingReasoningParser / DeepSeekR1) and PR vllm-project#40352 (Step3ReasoningParser). Fixes vllm-project#41067 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Keyi Li <likey6688@gmail.com>

Stacks on top of vllm-project#41068. PR vllm-project#41068 fixes the parser-side off-by-one when the </think> token id arrives ahead of its rendered text in delta_text. A second leak survives downstream: after the parser returns None on the boundary chunk, DelegatingParser.parse_delta sets state.reasoning_ended = True and routes the next chunks through the tool branch — which has no awareness of the still-buffered end-token text and emits it verbatim as delta.content over the next 1..N chunks. Empirically: 8/8 streamed responses leaked the literal '</think>' into delta.content on a real Kimi-K2.5 deployment with PR vllm-project#41068's parser change applied alone. Per-delta capture from a real request: delta[N]: reasoning='T' (last reasoning delta; parser returns None) delta[N+1]: content=' ' delta[N+2]: content='</thin' <- buffered </think> flushing delta[N+3]: content='k> ' delta[N+4]: content='I will...' Add four StreamState fields (end_token_skip_buffer, end_token_skip_armed, end_token_skip_done, end_token_skip_literal) plus a small post-boundary skip block in DelegatingParser.parse_delta and a helper _pending_end_token_literal that returns whichever terminator's text is still buffered. On boundary handoff: * The helper checks both </think> and (for parsers that expose it) <|tool_calls_section_begin|>. It returns the literal whose token id is in delta_token_ids but whose text has not yet flushed into delta_text, or None when the boundary text was fully visible. * If the helper returns a literal, the skip arms with that literal stored in state and the boundary chunk's delta_text seeded into the buffer (so a partial prefix of the literal that already flushed isn't lost). * If None, the skip stays disarmed (no buffered tail follows). While armed, subsequent chunks accumulate delta_text into the buffer. When the stored literal becomes visible, the buffer is split: * For </think>, the literal is consumed (downstream content begins after it). * For <|tool_calls_section_begin|>, the literal is preserved in the cleaned text — KimiK2ToolParser uses it to identify the section boundary, matching the immediate-text path in KimiK2ReasoningParser.extract_reasoning_streaming where 'content = delta_text[tool_index:]' keeps the literal. The tool-section branch is opt-in via getattr so it stays a no-op for parsers that don't expose those attributes (DeepSeek-R1, Step3p5, Ernie45, etc. all still benefit from the </think> half). Tests added in tests/reasoning/test_kimi_k2_reasoning_parser.py: * test_delegating_parser_strips_buffered_end_token — primary scenario exercised end-to-end through DelegatingParser.parse_delta. * test_delegating_parser_skip_disarmed_when_parser_splits_in_one_chunk — when </think> id and text arrive together, the skip stays disarmed; otherwise it would over-truncate post-boundary content. * test_delegating_parser_boundary_at_end_of_delta_text — when </think> lands at the very end of delta_text with no content after, the skip must still stay disarmed (buffered tail won't appear). * test_delegating_parser_handles_buffered_tool_section_end — when reasoning ends implicitly via <|tool_calls_section_begin|> under buffering, the skip waits for the section literal (not </think>), preserves it in cleaned text, and routes the in-section payload to the tool parser instead of leaking it as content. * test_delegating_parser_recovers_partial_end_token_in_boundary_chunk — the boundary chunk's delta_text can end mid-literal (e.g. '</th' on the boundary chunk, 'ink>' on the next). The buffer must be seeded with the boundary chunk's delta_text so the partial prefix isn't lost. All five rely on a small _drive helper, a Chunk dataclass, and a wrapped_kimi_parser fixture that composes KimiK2ReasoningParser + KimiK2ToolParser via _WrappedParser. Verification: uv venv --python 3.12 && uv pip install … PYTHONPATH=. .venv/bin/python -m pytest \ --confcutdir=tests/reasoning -p no:cacheprovider \ tests/reasoning/test_kimi_k2_reasoning_parser.py -v # → 15 passed in 6.07s (10 pre-existing + 5 new from this commit) # Reverted vllm/parser/abstract_parser.py to PR vllm-project#41068's HEAD; reran # the new tests: # → 5 failed (each on the literal leaking into content or the # post-boundary content being suppressed) pre-commit run --files vllm/parser/abstract_parser.py \ tests/reasoning/test_kimi_k2_reasoning_parser.py # → ruff-check, ruff-format, typos, mypy, SPDX headers all pass Production rollout: Kimi-K2.5-INT4 P/D disagg deployment, vLLM v0.20.0 + PR vllm-project#41068's parser fix + this commit applied as a runtime overlay. 8/8 leaks pre-fix → 0/25 leaks post-fix (10 direct vLLM port-forward, 15 via service mesh) on the previously-100%-leaking prompt and stop config. Output integrity hand-checked: reasoning ends cleanly at the natural stop point, content begins immediately with the model's actual answer. Duplicate-work check (gh pr list --search '41067 in:body' --state open) returns only PR vllm-project#41068, which patches the parser-side guard. No open PR addresses the DelegatingParser-level leak this commit fixes. This work is AI-assisted. The change was authored end-to-end by Claude under human direction; the human submitter has reviewed every changed line and confirms the analysis, the diff, and the test results above. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Marco Stephan <marco@magic.dev>

marcostephan force-pushed the fix/kimi-reasoning-parser-streaming-buffered-end-token branch from 6ac2610 to ef9d764 Compare April 29, 2026 14:29

Keyi Li and others added 2 commits April 29, 2026 18:51

Merge branch 'main' into fix/kimi-k2-streaming-end-token-buffer

1f08d3c

marcostephan force-pushed the fix/kimi-reasoning-parser-streaming-buffered-end-token branch from ef9d764 to afaade6 Compare April 30, 2026 07:16

marcostephan changed the title ~~[Bugfix] kimi_k2: stop </think> leak when output_text_buffer is active~~ [Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068) Apr 30, 2026

marcostephan mentioned this pull request Apr 30, 2026

[Bugfix] KimiK2ReasoningParser: guard against buffered end-token in streaming vllm-project/vllm#41068

Merged

marcostephan force-pushed the fix/kimi-reasoning-parser-streaming-buffered-end-token branch from de0b079 to dd18796 Compare April 30, 2026 10:16

marcostephan closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068)#1

[Bugfix] DelegatingParser: strip buffered end-token text from phase B (stacks on #41068)#1
marcostephan wants to merge 3 commits into
mainfrom
fix/kimi-reasoning-parser-streaming-buffered-end-token

marcostephan commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcostephan commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Branch structure

Reproduction (@sha-292411e, with PR vllm-project#41068 applied alone)

Fix

Tests (tests/reasoning/test_kimi_k2_reasoning_parser.py, 5 new tests)

Test commands run + results

Production rollout

Duplicate-work check

AI assistance disclosure

What this PR does NOT change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcostephan commented Apr 29, 2026 •

edited

Loading

Reproduction (`@sha-292411e`, with PR vllm-project#41068 applied alone)

Tests (`tests/reasoning/test_kimi_k2_reasoning_parser.py`, 5 new tests)