Auto RDMA device injection for GPU containers on Kubernetes (no privileged mode) #7914
jiusanzhou
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey DeepSpeed community! 👋
Running DeepSpeed distributed training on Kubernetes with InfiniBand/RoCE? You've probably hit RDMA device management pain:
/dev/infiniband/*for NCCL RDMA transport, but don't have access by defaultprivileged: trueor manual device mounts — fragile and insecureI built k8s-rdma-device-plugin to solve this:
privileged: truerdma.io/hcato kubeletWith
gpuRdmaAutoInject=true, any pod withNVIDIA_VISIBLE_DEVICESautomatically gets the correct RDMA devices injected based on PCIe topology. Zero manual configuration needed.Quick deploy
Repo: https://github.com/jiusanzhou/k8s-rdma-device-plugin
Feedback and PRs welcome!
Beta Was this translation helpful? Give feedback.
All reactions