Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/source/guides/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Each execution of a single configured task requires an executor. Nemo-Run provid
- `run.DockerExecutor`
- `run.SlurmExecutor` with an optional `SSHTunnel` for executing on Slurm clusters from your local machine
- `run.SkypilotExecutor` (available under the optional feature `skypilot` in the python package).
- `run.LeptonExecutor`

A tuple of task and executor form an execution unit. A key goal of NeMo-Run is to allow you to mix and match tasks and executors to arbitrarily define execution units.

Expand Down Expand Up @@ -41,6 +42,7 @@ The packager support matrix is described below:
| SlurmExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
| SkypilotExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
| DGXCloudExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |
| LeptonExecutor | run.Packager, run.GitArchivePackager, run.PatternPackager, run.HybridPackager |

`run.Packager` is a passthrough base packager.

Expand Down Expand Up @@ -264,3 +266,40 @@ def your_dgx_executor(nodes: int, gpus_per_node: int, container_image: str):
```

For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html).

#### LeptonExecutor

The `LeptonExecutor` integrates with an NVIDIA DGX Cloud Lepton cluster's Python SDK to launch distributed jobs. It uses API calls behind the Lepton SDK to authenticate, identify the target node group and resource shapes, and submit the job specification which will be launched as a batch job on the cluster.

Here's an example configuration:

```python
def your_lepton_executor(nodes: int, gpus_per_node: int, container_image: str):
# Ensure these are set correctly for your DGX Cloud environment
# You might fetch these from environment variables or a config file
resource_shape = "gpu.8xh100-80gb" # Replace with your desired resource shape representing the number of GPUs in a pod
node_group = "my-node-group" # The node group to run the job in
nemo_run_dir = "/nemo-workspace/nemo-run" # The NeMo-Run directory where experiments are saved
# Define the remote storage directory that will be mounted in the job pods
# Ensure the path specified here contains your NEMORUN_HOME
storage_path = "/nemo-workspace" # The remote storage directory to mount in jobs
mount_path = "/nemo-workspace" # The path where the remote storage directory will be mounted inside the container

executor = run.LeptonExecutor(
resource_shape=resource_shape,
node_group=node_group,
container_image=container_image,
nodes=nodes,
nemo_run_dir=nemo_run_dir,
gpus_per_node=gpus_per_node,
mounts=[{"path": storage_path, "mount_path": mount_path}],
# Optional: Add custom environment variables or PyTorch specs if needed
env_vars=common_envs(),
# packager=run.GitArchivePackager() # Choose appropriate packager
)
return executor

# Example usage:
executor = your_lepton_executor(nodes=4, gpus_per_node=8, container_image="your-nemo-image")

```
1 change: 1 addition & 0 deletions docs/source/guides/why-use-nemo-run.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ But once defined, it is seamless to launch your tasks. Currently, we support the
- LocalExecutor
- SlurmExecutor
- SkypilotExecutor
- LeptonExecutor

This means that you can launch your configured task on one slurm cluster or the other, on a Kubernetes cluster, on one cloud or the other, or on all of them at the same time.

Expand Down
6 changes: 6 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,12 @@ will install Skypilot w all clouds

You can also manually install Skypilot from https://skypilot.readthedocs.io/en/latest/getting-started/installation.html

If using DGX Cloud Lepton, use the following command to install the Lepton CLI:

``pip install leptonai``

To authenticate with the DGX Cloud Lepton cluster, navigate to the **Settings > Tokens** page in the DGX Cloud Lepton UI and copy the ``lep login`` command shown on the page and run it in the terminal.

Make sure you have `pip` installed and configured properly.


Expand Down
4 changes: 3 additions & 1 deletion nemo_run/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,20 @@

from nemo_run import cli
from nemo_run.api import autoconvert, dryrun_fn
from nemo_run.cli.lazy import LazyEntrypoint, lazy_imports
from nemo_run.config import Config, ConfigurableMixin, Partial, Script
from nemo_run.core.execution.base import Executor, ExecutorMacros, import_executor
from nemo_run.core.execution.dgxcloud import DGXCloudExecutor
from nemo_run.core.execution.docker import DockerExecutor
from nemo_run.core.execution.launcher import FaultTolerance, SlurmRay, SlurmTemplate, Torchrun
from nemo_run.core.execution.lepton import LeptonExecutor
from nemo_run.core.execution.local import LocalExecutor
from nemo_run.core.execution.skypilot import SkypilotExecutor
from nemo_run.core.execution.slurm import SlurmExecutor
from nemo_run.core.packaging import GitArchivePackager, HybridPackager, Packager, PatternPackager
from nemo_run.core.tunnel.client import LocalTunnel, SSHTunnel
from nemo_run.devspace.base import DevSpace
from nemo_run.help import help
from nemo_run.cli.lazy import LazyEntrypoint, lazy_imports
from nemo_run.package_info import __package_name__, __version__
from nemo_run.run.api import run
from nemo_run.run.experiment import Experiment
Expand All @@ -58,6 +59,7 @@
"GitArchivePackager",
"PatternPackager",
"help",
"LeptonExecutor",
"LocalExecutor",
"LocalTunnel",
"Packager",
Expand Down
11 changes: 9 additions & 2 deletions nemo_run/core/execution/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,16 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo_run.core.execution.dgxcloud import DGXCloudExecutor
from nemo_run.core.execution.lepton import LeptonExecutor
from nemo_run.core.execution.local import LocalExecutor
from nemo_run.core.execution.skypilot import SkypilotExecutor
from nemo_run.core.execution.slurm import SlurmExecutor
from nemo_run.core.execution.dgxcloud import DGXCloudExecutor

__all__ = ["LocalExecutor", "SlurmExecutor", "SkypilotExecutor", "DGXCloudExecutor"]
__all__ = [
"LocalExecutor",
"SlurmExecutor",
"SkypilotExecutor",
"DGXCloudExecutor",
"LeptonExecutor",
]
Loading
Loading