Skip to content

Move slime + Megatron-LM sources to /opt for non-root AML jobs#5046

Merged
neilyan-msft merged 5 commits into
mainfrom
users/neilyan/fix-slime-root-perms-20260515
May 18, 2026
Merged

Move slime + Megatron-LM sources to /opt for non-root AML jobs#5046
neilyan-msft merged 5 commits into
mainfrom
users/neilyan/fix-slime-root-perms-20260515

Conversation

@neilyan-msft
Copy link
Copy Markdown
Contributor

Why

The slime curated image mcr.microsoft.com/azureml/curated/slime-pytorch-2.9-cuda12.8:2 is unusable from Foundry/Vienna Training Block Slime SFT jobs because AML/Singularity runs the job as uid=9000(aiscuser), but /root is mode 700. Source files like /root/slime/train.py and /root/slime/slime/__init__.py raise PermissionError, importlib.util.find_spec(""slime"") fails, and there is no readable train.py for the compiled command. Passing resources.dockerArgs=""--user root"" is not a viable workaround — the actual job still runs as aiscuser.

What

Relocate the editable Slime + Megatron-LM trees from /root to /opt (mode 755) so non-root jobs can read them.

  • Clone slime to /opt/slime, Megatron-LM to /opt/Megatron-LM.
  • Update the Megatron patch path, both pip install -e invocations, the int4_qat in-tree install, WORKDIR, and PYTHONPATH to /opt.
  • chmod -R a+rX /opt/slime /opt/Megatron-LM so a defensive umask cannot strip world read+traverse.

Validation in the image

  • smoke_test.py now asserts slime.__file__ resolves under /opt/slime and that /opt/slime, /opt/slime/train.py, /opt/slime/slime/__init__.py, and /opt/Megatron-LM are world-readable and (for directories) world-traversable.
  • The build runs a non-root runuser -u nobody -- python check that import slime and importlib.util.find_spec(""slime"") succeed.

Acceptance criteria addressed

  • python -c ""import slime"" succeeds from a non-root user.
  • python /opt/slime/train.py (or cd /opt/slime && python train.py) is accessible — train.py is now world-readable.
  • Training Block Slime SFT command bootstrap no longer fails with PermissionError or missing train.py.

Compatibility note

Any customer command that assumed /root/slime will need to point at /opt/slime instead. The image version will bump on merge.

The slime curated image (slime-pytorch-2.9-cuda12.8:2) was unusable by
Foundry/Vienna Training Block Slime SFT jobs running on AML/Singularity
as uid 9000 (aiscuser). The Dockerfile cloned slime into /root/slime and
Megatron-LM into /root/Megatron-LM, then pip installed each editably.
Because /root is mode 700, the non-root job user could not:

- read /root/slime/train.py or /root/slime/slime/__init__.py
- resolve import slime (the editable .pth points into /root/slime)
- pick up Megatron-LM via PYTHONPATH=/root/Megatron-LM
- run python train.py ... (no readable entrypoint)

Passing 
esources.dockerArgs="--user root" was not a viable workaround
because the AML job still launches as aiscuser regardless.

Relocate both editable trees to /opt:

- Clone slime into /opt/slime and Megatron-LM into /opt/Megatron-LM.
- Update the Megatron patch path, both pip install -e invocations, and
  the int4_qat in-tree install to use /opt/slime.
- Update PYTHONPATH and WORKDIR to /opt.
- chmod -R a+rX /opt/slime /opt/Megatron-LM so a defensive umask cannot
  strip world read+traverse from the source trees.

Validation additions:

- smoke_test.py now asserts that slime.__file__ resolves under
  /opt/slime, and that /opt/slime, /opt/slime/train.py,
  /opt/slime/slime/__init__.py, and /opt/Megatron-LM are world-readable
  and (for directories) world-traversable.
- The build now also runs 
unuser -u nobody -- python against an
  importlib-based slime import check, giving end-to-end confidence that
  a non-root user can import slime before the image is published.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

Test Results for assets-test

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 7ec0911.

♻️ This comment has been updated with latest results.

yeshsurya
yeshsurya previously approved these changes May 15, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@neilyan-msft neilyan-msft merged commit e1dc530 into main May 18, 2026
38 checks passed
@neilyan-msft neilyan-msft deleted the users/neilyan/fix-slime-root-perms-20260515 branch May 18, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants