feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric by koiker · Pull Request #4395 · NVIDIA-NeMo/Megatron-Bridge

koiker · 2026-06-16T18:51:38Z

Summary

Adds RunAIPlugin, a new CSP fabric plugin for on-prem NVIDIA Run:ai clusters that use RoCE/SR-IOV networking with Multus and the Kubeflow Training Operator. This sits alongside the existing EKSEnvPlugin (EFA) and GKEEnvPlugin (gIB) and follows the same nemo_run.Plugin pattern.

What it does

csp_plugins.py: New RunAIPlugin dataclass that configures the KubeflowExecutor with:
- SR-IOV extended resources (e.g. 8 RoCE rails nvidia.com/r0-p0 … r7-p0)
- Multus network-attachment annotations
- Memory-backed /dev/shm volume (avoids the 64 MiB default that starves NCCL)
- Shared workspace PVC mount
- Extra environment variables (e.g. TRANSFORMERS_OFFLINE, HF_HOME)
- Backward compatibility with older nemo_run versions that expose annotations instead of pod_annotations
argument_parser.py: Adds --csp runai choice and six --runai_* CLI arguments for JSON-encoded extended resources, annotations, PVC config, and env vars.
setup_experiment.py: Wires RunAIPlugin instantiation when --csp runai is passed, and filters --runai_* args from the rank-local script invocation.

Motivation

Run:ai is the workload scheduler on NVIDIA DGX Cloud and on-prem GB200/GB300 NVL72 deployments. Today, launching Megatron-Bridge benchmarks on these clusters requires manual K8s YAML or out-of-tree scripts. This PR makes Run:ai a first-class platform — dgxc-benchmarking recipes can use --csp runai the same way they use --csp aws or --csp gcp.

Validation

Tested on an NVIDIA B300 NVL72 cluster (8 nodes / 64 GPUs) with:

Unit test: RunAIPlugin.setup() correctly configures all executor fields (resources, annotations, volumes, env vars) without overwriting existing values
End-to-end: Nemotron 3 30B BF16 training via llmb-run with the new --csp runai path — job submitted, tracked, and performance parsed successfully

Test plan

RunAIPlugin.setup() unit test against KubeflowExecutor (nemo_run 0.9.0rc0)
Dry-run: setup_experiment.py generates correct KubeflowExecutor config with --csp runai
Live run: Nemotron 3 30B BF16 on 8×B300 NVL72 via llmb-run with Run:ai backend
CI: existing tests should pass unchanged (no Slurm/AWS/GCP code paths modified)

copy-pr-bot · 2026-06-16T18:51:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-06-17T17:17:13Z

Could you please check this path? The new launcher parameters are parsed and added to main(), but they do not appear to be populated from the parsed CLI args in the bottom main(...) invocation. Because they all have defaults, --runai_* values look like they would be silently ignored, so --csp runai would only install the default plugin behavior.

Please pass the parsed args.runai_* values through and add a dry-run/unit assertion that RunAIPlugin receives the parsed resources/annotations/PVC/env values.

yaoyu-33 · 2026-06-21T03:56:32Z

/ok to test d7de967

…fabric Add a new CSP fabric plugin (`--csp runai`) for on-prem and colocation clusters managed by NVIDIA Run:ai. The plugin injects RoCE/GDR rail extended resources, Multus network-attachment annotations, a memory-backed /dev/shm, and an optional workspace PVC onto the KubeflowExecutor — following the same pattern as EKSEnvPlugin (EFA) and GKEEnvPlugin (gIB). Changes: - csp_plugins.py: new RunAIPlugin dataclass with setup() method - argument_parser.py: add "runai" to --csp choices; add --runai_* args (extended_resources_json, annotations_json, pvc_claim_name, pvc_mount_path, large_shm, env_json) - setup_experiment.py: import RunAIPlugin, wire --runai_* args to main() signature and CSP plugin selection, filter --runai_* from rank-local script args Validated on an 8-node B300 NVL72 cluster (56 GPUs, RoCE fabric) running Nemotron 3 30B BF16 pretraining at ~698 TFLOP/s/GPU. Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>

Older nemo_run versions (e.g. 0.9.0rc0) expose `annotations` instead of `pod_annotations` on KubeflowExecutor. Fall back gracefully so the plugin works on clusters that haven't upgraded nemo_run yet. Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>

Add --kubeflow_api_version {v1,v2} so the Kubeflow launch path can target the Training Operator v1 PyTorchJob (kubeflow.org/v1) in addition to the v2 TrainJob. v1 is required on clusters running the v1 operator — notably NVIDIA Run:ai. kubeflow_executor() picks nemo_run's PyTorchJobExecutor for v1 (clear error if the installed nemo_run predates it); since it inherits the same dataclass fields, the existing field reconciliation works unchanged. Extend RunAIPlugin with optional scheduler_name (-> spec.schedulerName, e.g. runai-scheduler, so raw PyTorchJob/TrainJob submissions gang-schedule under Run:ai instead of the default scheduler) and a generic labels passthrough for project/queue membership (key varies by Run:ai version). Both route through pod_spec_overrides / pod_labels, which apply to v1 and v2 alike. Wire through build_csp_plugin and add --runai_scheduler_name / --runai_labels_json. Extend the CSP wiring test to assert the new values reach the plugin. Note: requires bumping the nemo_run pin to a commit that provides PyTorchJobExecutor and the pod_annotations/pod_labels/extra_resource_* fields. Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>

yaoyu-33 · 2026-06-29T23:57:37Z

/ok to test 7f05fb4

suiyoubi · 2026-06-30T18:12:41Z

Hi @koiker could you please fix the lint error ?

koiker requested review from a team, erhoo82 and malay-nagda as code owners June 16, 2026 18:51

github-actions Bot added the community-request label Jun 16, 2026

koiker force-pushed the runai-csp-plugin branch 2 times, most recently from bf539b4 to 6a31618 Compare June 16, 2026 18:56

ko3n1g previously approved these changes Jun 16, 2026

View reviewed changes

yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer labels Jun 16, 2026

koiker dismissed ko3n1g’s stale review via d7de967 June 18, 2026 19:06

yaoyu-33 approved these changes Jun 21, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 21, 2026 03:57 Inactive

copy-pr-bot Bot temporarily deployed to public June 21, 2026 06:19 Inactive

copy-pr-bot Bot temporarily deployed to public June 21, 2026 07:07 Inactive

koiker added 3 commits June 23, 2026 21:41

koiker force-pushed the runai-csp-plugin branch from e9bbce6 to 7f05fb4 Compare June 24, 2026 01:42

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jun 29, 2026

copy-pr-bot Bot temporarily deployed to public June 29, 2026 23:58 Inactive

copy-pr-bot Bot temporarily deployed to public June 30, 2026 00:31 Inactive

copy-pr-bot Bot temporarily deployed to public June 30, 2026 00:52 Inactive

suiyoubi added the waiting-on-customer Waiting on the original author to respond label Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric#4395

feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric#4395
koiker wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
koiker:runai-csp-plugin

koiker commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

yaoyu-33 commented Jun 17, 2026

Uh oh!

yaoyu-33 commented Jun 21, 2026

Uh oh!

yaoyu-33 commented Jun 29, 2026

Uh oh!

suiyoubi commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

koiker commented Jun 16, 2026

Summary

What it does

Motivation

Validation

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

yaoyu-33 commented Jun 17, 2026

Uh oh!

yaoyu-33 commented Jun 21, 2026

Uh oh!

yaoyu-33 commented Jun 29, 2026

Uh oh!

suiyoubi commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants