Skip to content

Commit 3569b0a

Browse files
ichbinblauclaude
andcommitted
fix vllm-disagg deadlock: stop router after rank 0 container exits
The vllm-router runs as a separate container on node 0. After node 0's main container finishes the benchmark and exits, decode nodes remain stuck waiting for the router port to close. The router cleanup in job.slurm can't run until srun completes, but srun can't complete because decode nodes are blocked — deadlock. Fix: skip exec on rank 0 for vllm-disagg so the srun bash script continues after docker exits and can stop the router container, allowing decode nodes to detect the port closure and exit. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
1 parent 51c92a7 commit 3569b0a

1 file changed

Lines changed: 14 additions & 5 deletions

File tree

benchmarks/multi_node/amd_utils/job.slurm

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -427,7 +427,16 @@ if [[ \"$ENGINE\" == \"vllm-disagg\" && \"$ROUTER_TYPE\" == \"vllm-router\" && \
427427
--log-level info 2>&1 | tee /run_logs/slurm_job-${SLURM_JOB_ID}/vllm_router_\$(hostname).log \"
428428
fi
429429
430-
exec \$DOCKER_CMD run \
430+
# Skip exec on vllm-disagg rank 0 so we can stop the router after the main
431+
# container exits. Without this, decode nodes block forever waiting for the
432+
# router port to close (the router is a separate container).
433+
MAYBE_EXEC=exec
434+
if [[ \"$ENGINE\" == \"vllm-disagg\" && \"$ROUTER_TYPE\" == \"vllm-router\" && \"\$SLURM_PROCID\" == \"0\" ]]; then
435+
MAYBE_EXEC=
436+
set +e
437+
fi
438+
439+
\$MAYBE_EXEC \$DOCKER_CMD run \
431440
--init \
432441
--stop-timeout 10 \
433442
--device /dev/dri \
@@ -468,11 +477,11 @@ exec \$DOCKER_CMD run \
468477
'"$RUN_FILE_FULL"' 2>&1 | tee /run_logs/slurm_job-'\"\$SLURM_JOB_ID\"'/server_\$(hostname).log
469478
'
470479
480+
# Only reached when exec was skipped (vllm-disagg rank 0)
471481
DOCKER_EXIT_CODE=\$?
472-
if [[ \$DOCKER_EXIT_CODE -ne 0 ]]; then
473-
echo \"ERROR: docker exited rc=\$DOCKER_EXIT_CODE on \$(hostname)\"
474-
exit \$DOCKER_EXIT_CODE
475-
fi
482+
echo \"[rank 0] Main container exited (rc=\$DOCKER_EXIT_CODE). Stopping vllm-router...\"
483+
\$DOCKER_CMD rm -f \"$ROUTER_CONT_NAME\" 2>/dev/null || true
484+
exit \$DOCKER_EXIT_CODE
476485
"
477486

478487
if [[ "${KEEP_CONTAINERS}" != "1" ]]; then

0 commit comments

Comments
 (0)