Skip to content

Commit d257bf1

Browse files
committed
[None][fix] Guard transceiver session short-circuit against ADP+PP consensus deadlock
The sticky-role short-circuits in KvCacheTransceiverV2.check_context_transfer_status and check_gen_transfer_status (sticky markers from #14042) returned early when a transceiver had never opened a send/recv session, to skip the per-iter consensus allgather on pure-role transceivers. Under pipeline parallelism (and ADP request sharding) the per-rank _ever_had_send_session / _ever_had_recv_session markers flip asymmetrically across PP stages, so some ranks short-circuited and skipped the _ctx_consensus pp_allgather barrier while peers entered it, deadlocking the collective. This was observed as a hang in tests/unittest/disaggregated/test_cache_transceiver_single_process.py ::test_cache_transceiver[*-tp4_pp2_dp_both] (8-rank ADP+PP2). Guard the short-circuit with `not self._ctx_need_pp_sync` (pp_size == 1) so it only fires when there is no cross-stage consensus barrier: preserves the pure-role tp_allgather-skip optimization for PP=1 while doing the full symmetric consensus whenever pipeline parallelism is enabled. Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
1 parent bf923a1 commit d257bf1

1 file changed

Lines changed: 11 additions & 4 deletions

File tree

tensorrt_llm/_torch/disaggregation/transceiver.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -465,8 +465,12 @@ def request_and_receive_async(self, req: LlmRequest):
465465
def check_context_transfer_status(
466466
self, at_least_request_num: Optional[int], mark_complete: bool = False
467467
):
468-
# Skip the tp_allgather in _ctx_consensus when this transceiver never sends (pure GEN role).
469-
if not self._ever_had_send_session:
468+
# Skip the consensus collectives when this transceiver never sends (pure GEN role).
469+
# Guarded with pp_size==1 (not _ctx_need_pp_sync): under pipeline parallelism the
470+
# per-rank send marker flips asymmetrically across PP stages, so short-circuiting here
471+
# would let some ranks skip the pp_allgather barrier while peers enter it -> deadlock
472+
# (e.g. ADP+PP tp4_pp2_dp_both). With PP=1 there is no cross-stage consensus barrier.
473+
if not self._ever_had_send_session and not self._ctx_need_pp_sync:
470474
return [], []
471475
block_all = at_least_request_num is None
472476
wait_num = at_least_request_num if not block_all else 0
@@ -521,8 +525,11 @@ def check_context_transfer_status(
521525
return completed, failed
522526

523527
def check_gen_transfer_status(self, at_least_request_num: Optional[int]):
524-
# Skip the allgather in _gen_consensus when this transceiver never receives (pure CTX role).
525-
if not self._ever_had_recv_session:
528+
# Skip the consensus collectives when this transceiver never receives (pure CTX role).
529+
# Guarded with pp_size==1 (not _ctx_need_pp_sync): see check_context_transfer_status --
530+
# under PP the per-rank recv marker flips asymmetrically across stages, so an early
531+
# return would desync the consensus barrier; only short-circuit when PP is absent.
532+
if not self._ever_had_recv_session and not self._ctx_need_pp_sync:
526533
return [], [], []
527534
block_all = at_least_request_num is None
528535
wait_num = at_least_request_num if not block_all else 0

0 commit comments

Comments
 (0)