Skip to content

Commit 00e3293

Browse files
csplintereasymrgr
authored andcommitted
Add affinity for scheduling MPI job, add example of successful test output
cr: https://code.amazon.com/reviews/CR-255417335
1 parent 3a23761 commit 00e3293

1 file changed

Lines changed: 62 additions & 1 deletion

File tree

latest/ug/ml/ml-eks-nvidia-ultraserver.adoc

Lines changed: 62 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ For a multi-node NVLINK NCCL test and other micro-benchmarks review the https://
271271
kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.7.0/mpi-operator.yaml
272272
----
273273
+
274-
. Create a Helm values file named `nvbandwidth-test-job.yaml` that defines the test manifest. Note the `nvidia.com/gpu.clique` pod affinity to schedule the workers in the same NVLink domain which has Multi-Node NVLink reachability.
274+
. Create a Helm values file named `nvbandwidth-test-job.yaml` that defines the test manifest. Note the `nvidia.com/gpu.clique` pod affinity to schedule the workers in the same NVLink domain which has Multi-Node NVLink reachability. The sample below runs a multi-node device-to-device CE Read memcpy test using cuMemcpyAsync and prints the results in the logs.
275275
+
276276
As of NVIDIA DRA Driver version `v25.8.0` ComputeDomains are elastic and `.spec.numNodes` can be set to `0` in the ComputeDomain definition. Review the latest https://github.com/NVIDIA/k8s-dra-driver-gpu[NVIDIA DRA Driver release notes] for updates.
277277
+
@@ -307,6 +307,17 @@ spec:
307307
labels:
308308
nvbandwidth-test-replica: mpi-launcher
309309
spec:
310+
affinity:
311+
nodeAffinity:
312+
requiredDuringSchedulingIgnoredDuringExecution:
313+
nodeSelectorTerms:
314+
- matchExpressions:
315+
# Only schedule on NVIDIA GB200/GB300 nodes
316+
- key: node.kubernetes.io/instance-type
317+
operator: In
318+
values:
319+
- p6e-gb200.36xlarge
320+
- p6e-gb300.36xlarge
310321
containers:
311322
- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
312323
name: mpi-launcher
@@ -334,6 +345,16 @@ spec:
334345
nvbandwidth-test-replica: mpi-worker
335346
spec:
336347
affinity:
348+
nodeAffinity:
349+
requiredDuringSchedulingIgnoredDuringExecution:
350+
nodeSelectorTerms:
351+
- matchExpressions:
352+
# Only schedule on NVIDIA GB200/GB300 nodes
353+
- key: node.kubernetes.io/instance-type
354+
operator: In
355+
values:
356+
- p6e-gb200.36xlarge
357+
- p6e-gb300.36xlarge
337358
podAffinity:
338359
requiredDuringSchedulingIgnoredDuringExecution:
339360
- labelSelector:
@@ -399,6 +420,46 @@ status:
399420
kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
400421
----
401422
+
423+
A successful test shows bandwidth statistics in GB/s for the multi-node memcpy test. An example of a successful test output is shown below.
424+
+
425+
[source,bash]
426+
----
427+
...
428+
nvbandwidth Version: ...
429+
Built from Git version: ...
430+
431+
MPI version: ...
432+
CUDA Runtime Version: ...
433+
CUDA Driver Version: ...
434+
Driver Version: ...
435+
436+
Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (...)
437+
Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (...)
438+
Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (...)
439+
Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (...)
440+
Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (...)
441+
Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (...)
442+
Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (...)
443+
Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (...)
444+
445+
Running multinode_device_to_device_memcpy_read_ce.
446+
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
447+
0 1 2 3 4 5 6 7
448+
0 N/A 821.45 822.18 821.73 822.05 821.38 822.61 821.89
449+
1 822.34 N/A 821.67 822.12 821.94 820.87 821.53 822.08
450+
2 821.76 822.29 N/A 821.58 822.43 821.15 821.82 822.31
451+
3 822.19 821.84 822.05 N/A 821.67 821.23 820.95 822.47
452+
4 821.63 822.38 821.49 822.17 N/A 821.06 821.78 822.22
453+
5 822.08 821.52 821.89 822.35 821.27 N/A 821.64 822.13
454+
6 821.94 822.15 821.68 822.04 821.39 820.92 N/A 822.56
455+
7 822.27 821.73 822.11 821.86 822.38 821.04 821.49 N/A
456+
457+
SUM multinode_device_to_device_memcpy_read_ce ...
458+
459+
NOTE: The reported results may not reflect the full capabilities of the platform.
460+
Performance can vary with software drivers, hardware clocks, and system topology.
461+
----
462+
+
402463
. When the test is complete, delete it with the following command.
403464
+
404465
[source,bash]

0 commit comments

Comments
 (0)