Skip to content

feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric#4395

Open
koiker wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
koiker:runai-csp-plugin
Open

feat(perf): add RunAIPlugin for NVIDIA Run:ai (RoCE/SR-IOV) Kubeflow fabric#4395
koiker wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
koiker:runai-csp-plugin

Conversation

@koiker

@koiker koiker commented Jun 16, 2026

Copy link
Copy Markdown

Summary

Adds RunAIPlugin, a new CSP fabric plugin for on-prem NVIDIA Run:ai clusters that use RoCE/SR-IOV networking with Multus and the Kubeflow Training Operator. This sits alongside the existing EKSEnvPlugin (EFA) and GKEEnvPlugin (gIB) and follows the same nemo_run.Plugin pattern.

What it does

  • csp_plugins.py: New RunAIPlugin dataclass that configures the KubeflowExecutor with:
    • SR-IOV extended resources (e.g. 8 RoCE rails nvidia.com/r0-p0r7-p0)
    • Multus network-attachment annotations
    • Memory-backed /dev/shm volume (avoids the 64 MiB default that starves NCCL)
    • Shared workspace PVC mount
    • Extra environment variables (e.g. TRANSFORMERS_OFFLINE, HF_HOME)
    • Backward compatibility with older nemo_run versions that expose annotations instead of pod_annotations
  • argument_parser.py: Adds --csp runai choice and six --runai_* CLI arguments for JSON-encoded extended resources, annotations, PVC config, and env vars.
  • setup_experiment.py: Wires RunAIPlugin instantiation when --csp runai is passed, and filters --runai_* args from the rank-local script invocation.

Motivation

Run:ai is the workload scheduler on NVIDIA DGX Cloud and on-prem GB200/GB300 NVL72 deployments. Today, launching Megatron-Bridge benchmarks on these clusters requires manual K8s YAML or out-of-tree scripts. This PR makes Run:ai a first-class platform — dgxc-benchmarking recipes can use --csp runai the same way they use --csp aws or --csp gcp.

Validation

Tested on an NVIDIA B300 NVL72 cluster (8 nodes / 64 GPUs) with:

  • Unit test: RunAIPlugin.setup() correctly configures all executor fields (resources, annotations, volumes, env vars) without overwriting existing values
  • End-to-end: Nemotron 3 30B BF16 training via llmb-run with the new --csp runai path — job submitted, tracked, and performance parsed successfully

Test plan

  • RunAIPlugin.setup() unit test against KubeflowExecutor (nemo_run 0.9.0rc0)
  • Dry-run: setup_experiment.py generates correct KubeflowExecutor config with --csp runai
  • Live run: Nemotron 3 30B BF16 on 8×B300 NVL72 via llmb-run with Run:ai backend
  • CI: existing tests should pass unchanged (no Slurm/AWS/GCP code paths modified)

@koiker koiker requested review from a team, erhoo82 and malay-nagda as code owners June 16, 2026 18:51
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@koiker koiker force-pushed the runai-csp-plugin branch 2 times, most recently from bf539b4 to 6a31618 Compare June 16, 2026 18:56
ko3n1g
ko3n1g previously approved these changes Jun 16, 2026
@yaoyu-33 yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer labels Jun 16, 2026
@yaoyu-33

Copy link
Copy Markdown
Contributor

Could you please check this path? The new launcher parameters are parsed and added to main(), but they do not appear to be populated from the parsed CLI args in the bottom main(...) invocation. Because they all have defaults, --runai_* values look like they would be silently ignored, so --csp runai would only install the default plugin behavior.

Please pass the parsed args.runai_* values through and add a dry-run/unit assertion that RunAIPlugin receives the parsed resources/annotations/PVC/env values.

@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test d7de967

koiker added 3 commits June 23, 2026 21:41
…fabric

Add a new CSP fabric plugin (`--csp runai`) for on-prem and colocation
clusters managed by NVIDIA Run:ai. The plugin injects RoCE/GDR rail
extended resources, Multus network-attachment annotations, a
memory-backed /dev/shm, and an optional workspace PVC onto the
KubeflowExecutor — following the same pattern as EKSEnvPlugin (EFA)
and GKEEnvPlugin (gIB).

Changes:
- csp_plugins.py: new RunAIPlugin dataclass with setup() method
- argument_parser.py: add "runai" to --csp choices; add --runai_*
  args (extended_resources_json, annotations_json, pvc_claim_name,
  pvc_mount_path, large_shm, env_json)
- setup_experiment.py: import RunAIPlugin, wire --runai_* args to
  main() signature and CSP plugin selection, filter --runai_* from
  rank-local script args

Validated on an 8-node B300 NVL72 cluster (56 GPUs, RoCE fabric)
running Nemotron 3 30B BF16 pretraining at ~698 TFLOP/s/GPU.

Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>
Older nemo_run versions (e.g. 0.9.0rc0) expose `annotations` instead of
`pod_annotations` on KubeflowExecutor.  Fall back gracefully so the
plugin works on clusters that haven't upgraded nemo_run yet.

Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>
Add --kubeflow_api_version {v1,v2} so the Kubeflow launch path can target the
Training Operator v1 PyTorchJob (kubeflow.org/v1) in addition to the v2
TrainJob. v1 is required on clusters running the v1 operator — notably
NVIDIA Run:ai. kubeflow_executor() picks nemo_run's PyTorchJobExecutor for v1
(clear error if the installed nemo_run predates it); since it inherits the
same dataclass fields, the existing field reconciliation works unchanged.

Extend RunAIPlugin with optional scheduler_name (-> spec.schedulerName, e.g.
runai-scheduler, so raw PyTorchJob/TrainJob submissions gang-schedule under
Run:ai instead of the default scheduler) and a generic labels passthrough for
project/queue membership (key varies by Run:ai version). Both route through
pod_spec_overrides / pod_labels, which apply to v1 and v2 alike. Wire through
build_csp_plugin and add --runai_scheduler_name / --runai_labels_json.

Extend the CSP wiring test to assert the new values reach the plugin.

Note: requires bumping the nemo_run pin to a commit that provides
PyTorchJobExecutor and the pod_annotations/pod_labels/extra_resource_* fields.

Signed-off-by: Rafael M. Koike <koike.rafael@gmail.com>
@koiker koiker force-pushed the runai-csp-plugin branch from e9bbce6 to 7f05fb4 Compare June 24, 2026 01:42
@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels Jun 29, 2026
@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test 7f05fb4

@suiyoubi

Copy link
Copy Markdown
Contributor

Hi @koiker could you please fix the lint error ?

@suiyoubi suiyoubi added the waiting-on-customer Waiting on the original author to respond label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:perf Performance optimizations and benchmarking community-request feature New capabilities, enhancements, or enablement work needs-more-tests Requires additional L0 and L1 test coverage before merge ready-to-merge PR is approved, current, and only waiting for CI to pass before merge waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants