Skip to content

docs: add SkyPilot Kubernetes tutorial#1667

Merged
akoumpa merged 5 commits intoNVIDIA-NeMo:mainfrom
zeel2104:zeel2104/skypilot-k8s-tutorial
Apr 19, 2026
Merged

docs: add SkyPilot Kubernetes tutorial#1667
akoumpa merged 5 commits intoNVIDIA-NeMo:mainfrom
zeel2104:zeel2104/skypilot-k8s-tutorial

Conversation

@zeel2104
Copy link
Copy Markdown
Contributor

@zeel2104 zeel2104 commented Apr 3, 2026

What does this PR do ?

Adds a beginner-friendly SkyPilot + Kubernetes tutorial for NeMo AutoModel, including ready-to-run single-node and two-node example configs, and fixes SkyPilot launcher env-var interpolation so ${HF_TOKEN} works as documented.

Changelog

Add docs/launcher/skypilot-kubernetes.md with a step-by-step SkyPilot + Kubernetes tutorial
Add single-node example config at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
Add two-node example config at examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml
Link the new tutorial from docs/launcher/overview.md, docs/launcher/skypilot.md, and docs/index.md
Resolve SkyPilot launcher env vars in nemo_automodel/cli/app.py
Add unit test coverage for SkyPilot env-var resolution in tests/unit_tests/_cli/test_app.py

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

-Targeted SkyPilot launcher tests passed locally
-automodel --help works locally
-Real SkyPilot + Kubernetes validation could not be completed from native Windows because SkyPilot requires the -Unix resource module
-Hardware validation is still needed on Linux

Additional Information

Signed-off-by: Zeel <desaizeel2128@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 5, 2026
@akoumpa akoumpa added the docs-only With great power comes great responsibility. label Apr 6, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 6, 2026

Hi @zeel2104 , thank you writing this up, for testing this myself, can you share the provider you used for testing this?

Thank you.

@zeel2104
Copy link
Copy Markdown
Contributor Author

zeel2104 commented Apr 6, 2026

Hi @akoumpa, I haven’t validated this on a provider yet. Locally, I did verify the related SkyPilot launcher path by running the targeted unit tests:

tests/unit_tests/_cli/test_app.py
tests/unit_tests/_cli/test_skypilot_app.py
tests/unit_tests/launcher/test_skypilot_config.py
tests/unit_tests/launcher/test_skypilot_utils.py
These passed locally, and automodel --help also works after reinstalling the package in my environment.

I can validate the SkyPilot CLI path from a Linux VM, including sky startup, local launcher behavior, and Kubernetes client-side checks, but full end-to-end validation of the tutorial still requires access to a real GPU Kubernetes cluster/provider.
For the same , do we have an internal testbed, shared Kubernetes cluster, or cloud credits/access for something like GKE/EKS/AKS that contributors can use to validate SkyPilot + Kubernetes workflows?

@chtruong814 chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Apr 6, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 8, 2026

/ok to test 1fca216

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 8, 2026
Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a tech pubs review of .md files and added a few copyedits and style adjustments to align with our style guide.

Comment thread docs/launcher/overview.md Outdated
Comment thread docs/launcher/overview.md Outdated
Comment thread docs/launcher/overview.md Outdated
Comment thread docs/launcher/overview.md Outdated
Comment thread docs/launcher/overview.md Outdated
Comment thread docs/launcher/skypilot.md Outdated
Comment thread docs/launcher/skypilot.md Outdated
Comment thread docs/launcher/skypilot.md Outdated
Comment thread docs/launcher/skypilot.md Outdated
Comment thread docs/launcher/skypilot.md Outdated
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 10, 2026

/ok to test 0f8cad1

zeel2104 and others added 2 commits April 9, 2026 23:36
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Zeel <desaizeel2128@gmail.com>
Signed-off-by: Zeel <desaizeel2128@gmail.com>
@zeel2104 zeel2104 force-pushed the zeel2104/skypilot-k8s-tutorial branch from 0f8cad1 to 48663a4 Compare April 10, 2026 03:37
@zeel2104
Copy link
Copy Markdown
Contributor Author

@jgerh
Addressed in follow-up commits. I applied the requested copyedits in docs/launcher/overview.md and docs/launcher/skypilot.md.

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 10, 2026

Thank you @zeel2104 looks good 🙇

@chtruong814 chtruong814 added waiting-for-customer Waiting for response from the original author and removed waiting-for-customer Waiting for response from the original author labels Apr 14, 2026
@chtruong814 chtruong814 added the waiting-on-customer Waiting on the original author to respond label Apr 18, 2026
@akoumpa akoumpa merged commit 08da90f into NVIDIA-NeMo:main Apr 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request docs-only With great power comes great responsibility. waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants