Commit d257bf1
committed
[None][fix] Guard transceiver session short-circuit against ADP+PP consensus deadlock
The sticky-role short-circuits in KvCacheTransceiverV2.check_context_transfer_status
and check_gen_transfer_status (sticky markers from #14042) returned early when a
transceiver had never opened a send/recv session, to skip the per-iter consensus
allgather on pure-role transceivers.
Under pipeline parallelism (and ADP request sharding) the per-rank
_ever_had_send_session / _ever_had_recv_session markers flip asymmetrically across
PP stages, so some ranks short-circuited and skipped the _ctx_consensus pp_allgather
barrier while peers entered it, deadlocking the collective. This was observed as a
hang in tests/unittest/disaggregated/test_cache_transceiver_single_process.py
::test_cache_transceiver[*-tp4_pp2_dp_both] (8-rank ADP+PP2).
Guard the short-circuit with `not self._ctx_need_pp_sync` (pp_size == 1) so it only
fires when there is no cross-stage consensus barrier: preserves the pure-role
tp_allgather-skip optimization for PP=1 while doing the full symmetric consensus
whenever pipeline parallelism is enabled.
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>1 parent bf923a1 commit d257bf1
1 file changed
Lines changed: 11 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
465 | 465 | | |
466 | 466 | | |
467 | 467 | | |
468 | | - | |
469 | | - | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
470 | 474 | | |
471 | 475 | | |
472 | 476 | | |
| |||
521 | 525 | | |
522 | 526 | | |
523 | 527 | | |
524 | | - | |
525 | | - | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
526 | 533 | | |
527 | 534 | | |
528 | 535 | | |
| |||
0 commit comments