Skip to content

fix(disagg): unstuck decode aborts under prealloc pressure#25561

Merged
ShangmingCai merged 3 commits into
sgl-project:mainfrom
whybeyoung:fix_abort_multi
May 18, 2026
Merged

fix(disagg): unstuck decode aborts under prealloc pressure#25561
ShangmingCai merged 3 commits into
sgl-project:mainfrom
whybeyoung:fix_abort_multi

Conversation

@whybeyoung
Copy link
Copy Markdown
Collaborator

@whybeyoung whybeyoung commented May 18, 2026

Fix three abort-handling bugs that caused aborted decode requests to linger until WAITING_TIMEOUT instead of being released immediately.

  1. decode.py _update_handshake_waiters: skip the early-return when any receiver has been flipped to KVPoll.Failed (e.g. by an abort), so aborted reqs are not held until transfer begins.

  2. scheduler.py abort_request (DECODE): in addition to calling kv_receiver.abort(), mark req.finished_reason = FINISH_ABORT for reqs in prealloc/transfer queues, so pop_preallocated / pop_transferred actually drop them.

  3. tokenizer_manager.py abort_request: always forward to the scheduler when tokenizer_worker_num > 1 (the local rid_to_state is per-worker and load balancing may route abort to a non-owner). Add a guard against empty rid being treated as a startswith-prefix match for every request.
    CC @ShangmingCai


CI States

Latest PR Test (Base): Run #26017657590
Latest PR Test (Extra): ⚠️ Not enabled — add run-ci-extra label to opt in.

Fix three abort-handling bugs that caused aborted decode requests to
linger until WAITING_TIMEOUT (~15min) instead of being released
immediately.

1. decode.py _update_handshake_waiters: skip the early-return when any
   receiver has been flipped to KVPoll.Failed (e.g. by an abort), so
   aborted reqs are not held until transfer begins.

2. scheduler.py abort_request (DECODE): in addition to calling
   kv_receiver.abort(), mark req.finished_reason = FINISH_ABORT for
   reqs in prealloc/transfer queues, so pop_preallocated /
   pop_transferred actually drop them.

3. tokenizer_manager.py abort_request: always forward to the scheduler
   when tokenizer_worker_num > 1 (the local rid_to_state is per-worker
   and load balancing may route abort to a non-owner). Add a guard
   against empty rid being treated as a startswith-prefix match for
   every request.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@whybeyoung
Copy link
Copy Markdown
Collaborator Author

whybeyoung commented May 18, 2026

/tag-and-rerun-ci

Comment thread python/sglang/srt/managers/tokenizer_manager.py Outdated
Comment on lines +3663 to +3664
if not isinstance(decode_req.req.finished_reason, FINISH_ABORT):
decode_req.req.finished_reason = FINISH_ABORT()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be redundant since we have prepare_abort in PD module

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sgl-project sgl-project deleted a comment from github-actions Bot May 18, 2026
@ShangmingCai
Copy link
Copy Markdown
Collaborator

CI has passed.
image

@ShangmingCai ShangmingCai merged commit d1acd62 into sgl-project:main May 18, 2026
187 of 199 checks passed
Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants