Commit 905bee0
committed
fix(maca): stabilize multi-thread DDP on llama3/gpt2
The MACA runtime auto-cross-maps mcMalloc'd buffers as P2P-readonly
between sibling devices in the same process, so multi-thread DDP
(nthread>=4) crashed ~70% of the time during model upload with
"Writing to readonly page" on a 64MB buffer whose owner node was
missing from the mapped peer list.
llama3/main.cc: defer ProcessGroup creation until after model->To,
serialize model->To across DP threads with a process-wide mutex,
and barrier between upload and PG init so MCCL P2P registration
never overlaps with peer-thread allocations. Compute in-group
ranks via std::find on the rank topology so LoadFromLLMC still
sees the correct tp_rank before any PG exists.
reducer.cc: switch FinalizeBackward to host-blocking
work->Synchronize() so the CPU bucket-rebuild can't race past an
in-flight AllReduce.
maca_guard_impl.cc: setenv MACA_LAUNCH_BLOCKING=1 before mcInit(0)
in the ctor (setenv from main is too late since mcInit runs during
static init), and serialize mcMalloc/mcFree behind a global mutex.
llama3/gpt2 main.cc: std::_Exit(0) after training when device==maca
&& nthread_per_process>1 to bypass the broken static-destruction
chain — ProcessGroupMCCL intentionally skips mcclCommDestroy, and
the leaked MCCL/P2P buffers otherwise trip mxkwUnmapMemoryToGPU
and SIGABRT during teardown.
Validated: 20/20 passes on
./llama3 --device maca --nthread_per_process=8 --num_iteration=10
--batch_size=10 --total_batch_size=5120
Single-card path (nthread_per_process=1) still passes.1 parent 7c3b69d commit 905bee0
4 files changed
Lines changed: 120 additions & 20 deletions
File tree
- example
- gpt2
- llama3
- infini_train/src
- core/runtime/maca
- nn/parallel/ddp
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
501 | 501 | | |
502 | 502 | | |
503 | 503 | | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
504 | 514 | | |
505 | 515 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
1 | 3 | | |
2 | 4 | | |
| 5 | + | |
3 | 6 | | |
4 | 7 | | |
| 8 | + | |
5 | 9 | | |
6 | 10 | | |
7 | 11 | | |
| |||
130 | 134 | | |
131 | 135 | | |
132 | 136 | | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
133 | 143 | | |
134 | 144 | | |
135 | 145 | | |
136 | 146 | | |
137 | | - | |
138 | 147 | | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
139 | 157 | | |
140 | | - | |
141 | | - | |
142 | | - | |
| 158 | + | |
143 | 159 | | |
144 | | - | |
145 | 160 | | |
146 | | - | |
147 | | - | |
148 | | - | |
| 161 | + | |
149 | 162 | | |
150 | 163 | | |
151 | 164 | | |
152 | | - | |
153 | 165 | | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
| 166 | + | |
158 | 167 | | |
159 | 168 | | |
160 | 169 | | |
| |||
187 | 196 | | |
188 | 197 | | |
189 | 198 | | |
190 | | - | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
191 | 241 | | |
192 | 242 | | |
193 | 243 | | |
| |||
473 | 523 | | |
474 | 524 | | |
475 | 525 | | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
476 | 536 | | |
477 | 537 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
23 | 30 | | |
24 | 31 | | |
25 | 32 | | |
| |||
67 | 74 | | |
68 | 75 | | |
69 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
70 | 87 | | |
71 | 88 | | |
72 | 89 | | |
| |||
218 | 235 | | |
219 | 236 | | |
220 | 237 | | |
221 | | - | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
222 | 242 | | |
223 | 243 | | |
224 | | - | |
225 | | - | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
226 | 248 | | |
227 | 249 | | |
228 | 250 | | |
229 | | - | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
230 | 255 | | |
231 | 256 | | |
232 | 257 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
415 | 415 | | |
416 | 416 | | |
417 | 417 | | |
418 | | - | |
419 | | - | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
420 | 425 | | |
421 | 426 | | |
422 | 427 | | |
| |||
0 commit comments