Skip to content

Latest commit

 

History

History
29 lines (22 loc) · 2.18 KB

File metadata and controls

29 lines (22 loc) · 2.18 KB

Self-deploying Soperator on any Kubernetes

Follow the steps below to deploy Soperator on Kubernetes clusters outside Nebius AI, including on-premises environments and other cloud providers.

Networking requirement

Important

When using Soperator, it is important that the CNI supports preserving the client source IP. Therefore, if kube-proxy is configured in IPVS mode, or if you're using CNI plugins like kube-router or Antrea Proxy, the operator will not work.

This operator has been tested with the Cilium network plugin running in kube-proxy replacement mode.

Installation

  1. Decide on the shared storage technology you would like to use. At least one shared filesystem is necessary, because it stores the environment shared by Slurm nodes. The only thing Soperator requires is the name of the PVC. Consider using NFS as the simplest option, or something more advanced like OpenEBS or GlusterFS.
  2. Install the NVIDIA GPU Operator.
  3. If you use InfiniBand, install the NVIDIA Network Operator.
  4. Install Soperator by applying the helm/soperator Helm chart.
  5. Create a Slurm cluster in a namespace with the same name as the Slurm cluster by applying the helm/slurm-cluster Helm chart.
  6. Wait until the slurm.nebius.ai/SlurmCluster resource becomes Available.

Notes and limitations

Warning

Although Soperator should be compatible with any Kubernetes installation in principle, we haven't tested it anywhere outside Nebius AI, so it's likely that something won't work out of the box or will require additional configuration.

If you're facing issues, create an issue in this repository, and we will help you install Soperator to your Kubernetes and update these docs.