Add slime-train PATH shim for cwd-agnostic invocation#5066
Closed
neilyan-msft wants to merge 1 commit into
Closed
Conversation
Slime's entrypoint is the repo-root train.py script (not a Python
module), so `python train.py` resolves against the caller's cwd.
AML/Singularity sets cwd to a per-job working directory (e.g.
/scratch/azureml/cr/j/{uuid}/exe/wd/) that does not contain
train.py, so the slime curated environment can only be exercised
end-to-end if the consumer hard-codes the absolute path
/opt/slime/train.py.
Install a small `slime-train` shim at /usr/local/bin/slime-train
that cds into /opt/slime and execs the slime entrypoint. This lets
consumers invoke slime via a stable PATH-resolvable name regardless
of the job working directory, and keeps any relative artefact paths
the slime entrypoint resolves at runtime anchored to the slime
source tree.
Extend the smoke test to assert the shim is present, world
readable+executable, and references /opt/slime/train.py.
Test Results for assets-test0 tests 0 ✅ 0s ⏱️ Results for commit ecc2418. ♻️ This comment has been updated with latest results. |
Contributor
Author
|
Closing — switching to a consumer-side fix in Vienna MFE (hard-code /opt/slime/train.py in TrainingRecipeCompiler instead of adding a PATH shim in the asset). The slime asset stays cwd-coupled, but that contract already exists implicitly and slime is the only direct-script framework. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a small
slime-trainshim at/usr/local/bin/slime-trainso consumers can launch the slime training entrypoint from any working directory.Motivation
Slime's entrypoint is the repo-root
train.pyscript (not a Python module), sopython train.pyresolves against the caller's cwd. AML/Singularity sets cwd to a per-job working directory (e.g./scratch/azureml/cr/j/{uuid}/exe/wd/) that does not containtrain.py. As a result, jobs that compose their command aspython train.py ...against this curated environment fail immediately with:(observed end-to-end on canary, jobs
tb-slime-{sft,grpo}-gsm8k-qwen05b-05191606, after the/root/slime->/opt/slimemove in #5046 unblocked image readability.)Change
/usr/local/bin/slime-train(mode 0755) thatcds to/opt/slimeandexecspython /opt/slime/train.py "$@". Thecdkeeps any relative artefact paths the slime entrypoint resolves at runtime anchored to the slime source tree./opt/slime/train.py.Why not change consumers instead
Every other curated training environment (TRL, VERL, OpenRLHF, NeMo-RL, TorchForge) uses
python -m <module>and is therefore cwd-independent. Slime is the only one whose entrypoint is a bare script file, so the cwd contract has to be solved either by hard-coding the absolute path in every consumer or by giving the asset a PATH-resolvable name. The latter keeps the layout knowledge inside the asset that owns it.Validation
Local Python parse check on
smoke_test.py. Image build + smoke test will validate the shim during the asset CI build.