[docs] Slurm: add guide for running Ray inside Docker containers#63221
Conversation
…-run in Docker When ray symmetric-run finishes, ray stop sends SIGTERM to each Ray process and waits for them via psutil.wait_procs. Reaping the resulting zombies is the parent's job. On a normal Linux host PID 1 is systemd, which reaps. Inside a containerized Slurm compute node PID 1 is slurmd, which does not register a SIGCHLD handler — so the processes stay zombies, psutil.wait_procs treats them as still alive, and ray stop reports "Stopped 0 out of N". Document the deployment-layer fix: give the container a real init (tini / dumb-init / docker run --init / compose init: true), and link to the root-cause analysis and confirmation in PR ray-project#62591. Refs: - ray-project#62591 (comment) - ray-project#62591 (comment) Signed-off-by: Future-Outlier <eric901201@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds a new section to the SLURM user guide documentation regarding running Ray inside Docker containers. It explains the necessity of using a proper init process (like tini or dumb-init) as PID 1 to ensure Ray processes are correctly reaped and do not become zombies. The review feedback suggests improving the Docker Compose example by adding the -D flag to slurmd and using a stable version number, clarifying the output behavior of ray stop, and verifying the accuracy of the PR and comment links provided in the references.
- Drop the imprecise "Docker's built-in --init" wording; --init is a flag that injects tini, not an init process itself. - Replace the unverifiable claim about kernel signal delivery to zombies with the documented fact that only waitpid(2) removes a zombie from the process table (cited from wait(2)). - Add inline citations to wait(2), signal(7), and the psutil source for wait_pid_posix's /proc/<pid> polling fallback (which is what makes ray stop classify zombies as still alive). - Link the SIGKILL escalation to its actual call site in python/ray/scripts/scripts.py. - Add tini and dumb-init repository references with their stated purpose (zombie reaping; dumb-init also handles Linux's PID 1 signal special case). - Group References by topic: Linux semantics, container init runtimes, ray stop tooling, and PR ray-project#62591 discussion. Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds a new section to the Slurm documentation explaining how to handle zombie processes when running Ray inside Docker containers, specifically recommending the use of an init process like tini or dumb-init. The review feedback suggests standardizing the capitalization of "Slurm" throughout the text, updating log line numbers in examples to match the current source code, and using a realistic stable version number in the provided Docker Compose example.
Spell "Slurm" (proper noun) consistently in the four occurrences inside the new "Running inside Docker containers" section. Pre-existing "SLURM" occurrences elsewhere in this file are intentionally left untouched. Signed-off-by: Future-Outlier <eric901201@gmail.com>
d4d8bb3 to
cb50a11
Compare
|
this is a LOT of text to put in the docs; can we just keep most of the text in this issue and keep the docs much shorter, and link this issue for those who want to learn more?? |
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
|
updated |
|
need someone from core to approve - looks good, thanks |
…-project#63221) related issue: ray-project#62390 Due to the issue described in ray-project#62591 (comment), we have to add an init process as PID 1 in our Docker container. Doc link: https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting --------- Signed-off-by: Future-Outlier <eric901201@gmail.com>
…-project#63221) related issue: ray-project#62390 Due to the issue described in ray-project#62591 (comment), we have to add an init process as PID 1 in our Docker container. Doc link: https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
…-project#63221) related issue: ray-project#62390 Due to the issue described in ray-project#62591 (comment), we have to add an init process as PID 1 in our Docker container. Doc link: https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting --------- Signed-off-by: Future-Outlier <eric901201@gmail.com>
…-project#63221) related issue: ray-project#62390 Due to the issue described in ray-project#62591 (comment), we have to add an init process as PID 1 in our Docker container. Doc link: https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
related issue: #62390
Due to the issue described in #62591 (comment), we have to add an init process as PID 1 in our Docker container.
Doc link: https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting