feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric#4395
Open
koiker wants to merge 3 commits into
Open
feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric#4395koiker wants to merge 3 commits into
koiker wants to merge 3 commits into
Conversation
bf539b4 to
6a31618
Compare
ko3n1g
previously approved these changes
Jun 16, 2026
Contributor
|
Could you please check this path? The new launcher parameters are parsed and added to Please pass the parsed |
yaoyu-33
approved these changes
Jun 21, 2026
Contributor
|
/ok to test d7de967 |
…fabric Add a new CSP fabric plugin (`--csp runai`) for on-prem and colocation clusters managed by NVIDIA Run:ai. The plugin injects RoCE/GDR rail extended resources, Multus network-attachment annotations, a memory-backed /dev/shm, and an optional workspace PVC onto the KubeflowExecutor — following the same pattern as EKSEnvPlugin (EFA) and GKEEnvPlugin (gIB). Changes: - csp_plugins.py: new RunAIPlugin dataclass with setup() method - argument_parser.py: add "runai" to --csp choices; add --runai_* args (extended_resources_json, annotations_json, pvc_claim_name, pvc_mount_path, large_shm, env_json) - setup_experiment.py: import RunAIPlugin, wire --runai_* args to main() signature and CSP plugin selection, filter --runai_* from rank-local script args Validated on an 8-node B300 NVL72 cluster (56 GPUs, RoCE fabric) running Nemotron 3 30B BF16 pretraining at ~698 TFLOP/s/GPU. Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>
Older nemo_run versions (e.g. 0.9.0rc0) expose `annotations` instead of `pod_annotations` on KubeflowExecutor. Fall back gracefully so the plugin works on clusters that haven't upgraded nemo_run yet. Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>
Add --kubeflow_api_version {v1,v2} so the Kubeflow launch path can target the
Training Operator v1 PyTorchJob (kubeflow.org/v1) in addition to the v2
TrainJob. v1 is required on clusters running the v1 operator — notably
NVIDIA Run:ai. kubeflow_executor() picks nemo_run's PyTorchJobExecutor for v1
(clear error if the installed nemo_run predates it); since it inherits the
same dataclass fields, the existing field reconciliation works unchanged.
Extend RunAIPlugin with optional scheduler_name (-> spec.schedulerName, e.g.
runai-scheduler, so raw PyTorchJob/TrainJob submissions gang-schedule under
Run:ai instead of the default scheduler) and a generic labels passthrough for
project/queue membership (key varies by Run:ai version). Both route through
pod_spec_overrides / pod_labels, which apply to v1 and v2 alike. Wire through
build_csp_plugin and add --runai_scheduler_name / --runai_labels_json.
Extend the CSP wiring test to assert the new values reach the plugin.
Note: requires bumping the nemo_run pin to a commit that provides
PyTorchJobExecutor and the pod_annotations/pod_labels/extra_resource_* fields.
Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>
Contributor
|
/ok to test 7f05fb4 |
Contributor
|
Hi @koiker could you please fix the lint error ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
RunAIPlugin, a new CSP fabric plugin for on-prem NVIDIA Run:ai clusters that use RoCE/SR-IOV networking with Multus and the Kubeflow Training Operator. This sits alongside the existingEKSEnvPlugin(EFA) andGKEEnvPlugin(gIB) and follows the samenemo_run.Pluginpattern.What it does
csp_plugins.py: NewRunAIPlugindataclass that configures theKubeflowExecutorwith:nvidia.com/r0-p0…r7-p0)/dev/shmvolume (avoids the 64 MiB default that starves NCCL)TRANSFORMERS_OFFLINE,HF_HOME)nemo_runversions that exposeannotationsinstead ofpod_annotationsargument_parser.py: Adds--csp runaichoice and six--runai_*CLI arguments for JSON-encoded extended resources, annotations, PVC config, and env vars.setup_experiment.py: WiresRunAIPlugininstantiation when--csp runaiis passed, and filters--runai_*args from the rank-local script invocation.Motivation
Run:ai is the workload scheduler on NVIDIA DGX Cloud and on-prem GB200/GB300 NVL72 deployments. Today, launching Megatron-Bridge benchmarks on these clusters requires manual K8s YAML or out-of-tree scripts. This PR makes Run:ai a first-class platform —
dgxc-benchmarkingrecipes can use--csp runaithe same way they use--csp awsor--csp gcp.Validation
Tested on an NVIDIA B300 NVL72 cluster (8 nodes / 64 GPUs) with:
RunAIPlugin.setup()correctly configures all executor fields (resources, annotations, volumes, env vars) without overwriting existing valuesllmb-runwith the new--csp runaipath — job submitted, tracked, and performance parsed successfullyTest plan
RunAIPlugin.setup()unit test againstKubeflowExecutor(nemo_run 0.9.0rc0)setup_experiment.pygenerates correct KubeflowExecutor config with--csp runaillmb-runwith Run:ai backend