You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add KubeflowExecutor section to execution guide
Documents PyTorchJob and TrainJob usage, configuration fields,
workdir sync, and packager support in docs/guides/execution.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html).
295
296
297
+
#### KubeflowExecutor
298
+
299
+
The `KubeflowExecutor` integrates with the [Kubeflow Training Operator](https://github.com/kubeflow/training-operator) to run distributed training jobs on any Kubernetes cluster. It submits CRDs directly via the Kubernetes API — no `kubectl` required.
300
+
301
+
Two job kinds are supported via the `job_kind` parameter:
302
+
303
+
-**`"PyTorchJob"`** (default) — Training Operator v1 (`kubeflow.org/v1`)
304
+
-**`"TrainJob"`** — Training Operator v2 (`trainer.kubeflow.org/v1alpha1`)
305
+
306
+
Kubernetes configuration is loaded automatically: local kubeconfig is tried first, falling back to in-cluster config when running inside a pod.
307
+
308
+
Here's an example configuration:
309
+
310
+
```python
311
+
# PyTorchJob (default)
312
+
executor = run.KubeflowExecutor(
313
+
namespace="runai-nemo-ci",
314
+
image="nvcr.io/nvidian/nemo:nightly",
315
+
num_nodes=3, # total pods: 1 Master + (num_nodes-1) Workers
316
+
gpus_per_node=8, # also sets nproc_per_node unless overridden explicitly
runtime_ref="torch-distributed", # name of the ClusterTrainingRuntime
331
+
namespace="runai-nemo-ci",
332
+
image="nvcr.io/nvidian/nemo:nightly",
333
+
num_nodes=3,
334
+
gpus_per_node=8,
335
+
)
336
+
```
337
+
338
+
`cancel(wait=True)` polls until both the CR and all associated pods are fully terminated before returning.
339
+
296
340
#### LeptonExecutor
297
341
298
342
The `LeptonExecutor` integrates with an NVIDIA DGX Cloud Lepton cluster's Python SDK to launch distributed jobs. It uses API calls behind the Lepton SDK to authenticate, identify the target node group and resource shapes, and submit the job specification which will be launched as a batch job on the cluster.
0 commit comments