Skip to content

[docs] Slurm: add guide for running Ray inside Docker containers#63221

Merged
MengjinYan merged 10 commits into
ray-project:masterfrom
Future-Outlier:docs/symmetric-run-docker-init
May 12, 2026
Merged

[docs] Slurm: add guide for running Ray inside Docker containers#63221
MengjinYan merged 10 commits into
ray-project:masterfrom
Future-Outlier:docs/symmetric-run-docker-init

Conversation

@Future-Outlier
Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier commented May 8, 2026

related issue: #62390

Due to the issue described in #62591 (comment), we have to add an init process as PID 1 in our Docker container.

Doc link: https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting

…-run in Docker

When ray symmetric-run finishes, ray stop sends SIGTERM to each Ray
process and waits for them via psutil.wait_procs. Reaping the
resulting zombies is the parent's job. On a normal Linux host PID 1
is systemd, which reaps. Inside a containerized Slurm compute node
PID 1 is slurmd, which does not register a SIGCHLD handler — so the
processes stay zombies, psutil.wait_procs treats them as still alive,
and ray stop reports "Stopped 0 out of N".

Document the deployment-layer fix: give the container a real init
(tini / dumb-init / docker run --init / compose init: true), and link
to the root-cause analysis and confirmation in PR ray-project#62591.

Refs:
- ray-project#62591 (comment)
- ray-project#62591 (comment)

Signed-off-by: Future-Outlier <eric901201@gmail.com>
@Future-Outlier Future-Outlier requested a review from a team as a code owner May 8, 2026 07:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new section to the SLURM user guide documentation regarding running Ray inside Docker containers. It explains the necessity of using a proper init process (like tini or dumb-init) as PID 1 to ensure Ray processes are correctly reaped and do not become zombies. The review feedback suggests improving the Docker Compose example by adding the -D flag to slurmd and using a stable version number, clarifying the output behavior of ray stop, and verifying the accuracy of the PR and comment links provided in the references.

Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
- Drop the imprecise "Docker's built-in --init" wording; --init is a flag
  that injects tini, not an init process itself.
- Replace the unverifiable claim about kernel signal delivery to zombies
  with the documented fact that only waitpid(2) removes a zombie from the
  process table (cited from wait(2)).
- Add inline citations to wait(2), signal(7), and the psutil source for
  wait_pid_posix's /proc/<pid> polling fallback (which is what makes
  ray stop classify zombies as still alive).
- Link the SIGKILL escalation to its actual call site in
  python/ray/scripts/scripts.py.
- Add tini and dumb-init repository references with their stated purpose
  (zombie reaping; dumb-init also handles Linux's PID 1 signal special
  case).
- Group References by topic: Linux semantics, container init runtimes,
  ray stop tooling, and PR ray-project#62591 discussion.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
@Future-Outlier
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new section to the Slurm documentation explaining how to handle zombie processes when running Ray inside Docker containers, specifically recommending the use of an init process like tini or dumb-init. The review feedback suggests standardizing the capitalization of "Slurm" throughout the text, updating log line numbers in examples to match the current source code, and using a realistic stable version number in the provided Docker Compose example.

Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst Outdated
Comment thread doc/source/cluster/vms/user-guides/community/slurm.rst
Spell "Slurm" (proper noun) consistently in the four occurrences inside
the new "Running inside Docker containers" section. Pre-existing
"SLURM" occurrences elsewhere in this file are intentionally left
untouched.

Signed-off-by: Future-Outlier <eric901201@gmail.com>
@Future-Outlier Future-Outlier force-pushed the docs/symmetric-run-docker-init branch from d4d8bb3 to cb50a11 Compare May 8, 2026 08:42
@Future-Outlier Future-Outlier changed the title [docs] Slurm: require init as PID 1 for symmetric-run in Docker [docs] Slurm: add guide for running Ray inside Docker containers May 8, 2026
@Future-Outlier Future-Outlier added the go add ONLY when ready to merge, run all tests label May 8, 2026
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core labels May 8, 2026
@richardliaw
Copy link
Copy Markdown
Contributor

this is a LOT of text to put in the docs; can we just keep most of the text in this issue and keep the docs much shorter, and link this issue for those who want to learn more??

Copy link
Copy Markdown
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
@Future-Outlier
Copy link
Copy Markdown
Member Author

updated
cc @richardliaw to merge, tks!

@richardliaw
Copy link
Copy Markdown
Contributor

need someone from core to approve - looks good, thanks

@MengjinYan MengjinYan merged commit 260738a into ray-project:master May 12, 2026
6 checks passed
dancingactor pushed a commit to dancingactor/ray that referenced this pull request May 13, 2026
…-project#63221)

related issue: ray-project#62390


Due to the issue described in
ray-project#62591 (comment),
we have to add an init process as PID 1 in our Docker container.


Doc link:
https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
am-kinetica pushed a commit to kineticadb/ray that referenced this pull request May 14, 2026
…-project#63221)

related issue: ray-project#62390

Due to the issue described in
ray-project#62591 (comment),
we have to add an init process as PID 1 in our Docker container.

Doc link:
https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: anindyam1969 <amukherjee@kinetica.com>
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…-project#63221)

related issue: ray-project#62390


Due to the issue described in
ray-project#62591 (comment),
we have to add an init process as PID 1 in our Docker container.


Doc link:
https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
alexandrplashchinsky pushed a commit to alexandrplashchinsky/ray-alex that referenced this pull request May 29, 2026
…-project#63221)

related issue: ray-project#62390

Due to the issue described in
ray-project#62591 (comment),
we have to add an init process as PID 1 in our Docker container.

Doc link:
https://anyscale-ray--63221.com.readthedocs.build/en/63221/cluster/vms/user-guides/community/slurm.html#troubleshooting

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Alexandr Plashchinsky <alexandr.plashchinsky@alexandrplashchinsky-H765G66H9V.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants