Commit 3569b0a
fix vllm-disagg deadlock: stop router after rank 0 container exits
The vllm-router runs as a separate container on node 0. After node 0's
main container finishes the benchmark and exits, decode nodes remain
stuck waiting for the router port to close. The router cleanup in
job.slurm can't run until srun completes, but srun can't complete
because decode nodes are blocked — deadlock.
Fix: skip exec on rank 0 for vllm-disagg so the srun bash script
continues after docker exits and can stop the router container,
allowing decode nodes to detect the port closure and exit.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>1 parent 51c92a7 commit 3569b0a
1 file changed
Lines changed: 14 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
427 | 427 | | |
428 | 428 | | |
429 | 429 | | |
430 | | - | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
431 | 440 | | |
432 | 441 | | |
433 | 442 | | |
| |||
468 | 477 | | |
469 | 478 | | |
470 | 479 | | |
| 480 | + | |
471 | 481 | | |
472 | | - | |
473 | | - | |
474 | | - | |
475 | | - | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
476 | 485 | | |
477 | 486 | | |
478 | 487 | | |
| |||
0 commit comments