docs: add SkyPilot Kubernetes tutorial#1667
Conversation
Signed-off-by: Zeel <desaizeel2128@gmail.com>
|
Hi @zeel2104 , thank you writing this up, for testing this myself, can you share the provider you used for testing this? Thank you. |
|
Hi @akoumpa, I haven’t validated this on a provider yet. Locally, I did verify the related SkyPilot launcher path by running the targeted unit tests: tests/unit_tests/_cli/test_app.py I can validate the SkyPilot CLI path from a Linux VM, including sky startup, local launcher behavior, and Kubernetes client-side checks, but full end-to-end validation of the tutorial still requires access to a real GPU Kubernetes cluster/provider. |
|
/ok to test 1fca216 |
jgerh
left a comment
There was a problem hiding this comment.
Completed a tech pubs review of .md files and added a few copyedits and style adjustments to align with our style guide.
|
/ok to test 0f8cad1 |
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Zeel <desaizeel2128@gmail.com>
Signed-off-by: Zeel <desaizeel2128@gmail.com>
0f8cad1 to
48663a4
Compare
|
@jgerh |
|
Thank you @zeel2104 looks good 🙇 |
What does this PR do ?
Adds a beginner-friendly SkyPilot + Kubernetes tutorial for NeMo AutoModel, including ready-to-run single-node and two-node example configs, and fixes SkyPilot launcher env-var interpolation so ${HF_TOKEN} works as documented.
Changelog
Add docs/launcher/skypilot-kubernetes.md with a step-by-step SkyPilot + Kubernetes tutorial
Add single-node example config at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
Add two-node example config at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml
Link the new tutorial from docs/launcher/overview.md, docs/launcher/skypilot.md, and docs/index.md
Resolve SkyPilot launcher env vars in nemo_automodel/cli/app.py
Add unit test coverage for SkyPilot env-var resolution in tests/unit_tests/_cli/test_app.py
Before your PR is "Ready for review"
Pre checks:
-Targeted SkyPilot launcher tests passed locally
-automodel --help works locally
-Real SkyPilot + Kubernetes validation could not be completed from native Windows because SkyPilot requires the -Unix resource module
-Hardware validation is still needed on Linux
Additional Information