Commit a2d4de4
committed
fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists
Add ml.p5e.48xlarge to SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES.
Add ml.p5.48xlarge to SM_EFA_RDMA_INSTANCES (was missing).
Without these entries, NCCL hangs during distributed training
initialization on P5e instances due to missing EFA environment
variables (FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA, RDMAV_FORK_SAFE).
Fixes #54911 parent 33bf993 commit a2d4de4
File tree
5 files changed
+38
-0
lines changed- sagemaker-core/src/sagemaker/core
- modules/train/container_drivers/common
- remote_function/runtime_environment
- sagemaker-train
- src/sagemaker/train
- container_drivers/common
- remote_function/runtime_environment
- tests/unit/train/container_drivers
5 files changed
+38
-0
lines changedLines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| 60 | + | |
| 61 | + | |
59 | 62 | | |
60 | 63 | | |
61 | 64 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
81 | 82 | | |
82 | 83 | | |
83 | 84 | | |
| 85 | + | |
| 86 | + | |
84 | 87 | | |
85 | 88 | | |
86 | 89 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| 60 | + | |
| 61 | + | |
59 | 62 | | |
60 | 63 | | |
61 | 64 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
81 | 82 | | |
82 | 83 | | |
83 | 84 | | |
| 85 | + | |
| 86 | + | |
84 | 87 | | |
85 | 88 | | |
86 | 89 | | |
| |||
Lines changed: 26 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| |||
146 | 147 | | |
147 | 148 | | |
148 | 149 | | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
0 commit comments