Commit c157a41
authored
add stop+go tests to llama3 recipe, turn off async checkpointing for fp8 (#1494)
async dcp checkpointing is currently not working with fp8 model init, so
we need to detect this and switch back to synchronous checkpointing.
This also adds tests to ensure the dcp checkpoints are functional
Signed-off-by: Peter St. John <pstjohn@nvidia.com>1 parent 493fb3f commit c157a41
8 files changed
Lines changed: 459 additions & 872 deletions
File tree
- bionemo-recipes/recipes/llama3_native_te
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
37 | 38 | | |
| 39 | + | |
38 | 40 | | |
39 | 41 | | |
40 | 42 | | |
| |||
115 | 117 | | |
116 | 118 | | |
117 | 119 | | |
| 120 | + | |
118 | 121 | | |
119 | | - | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
120 | 134 | | |
121 | 135 | | |
122 | 136 | | |
| |||
126 | 140 | | |
127 | 141 | | |
128 | 142 | | |
129 | | - | |
| 143 | + | |
130 | 144 | | |
131 | 145 | | |
132 | 146 | | |
| |||
221 | 235 | | |
222 | 236 | | |
223 | 237 | | |
| 238 | + | |
224 | 239 | | |
225 | 240 | | |
226 | 241 | | |
| |||
236 | 251 | | |
237 | 252 | | |
238 | 253 | | |
| 254 | + | |
239 | 255 | | |
240 | 256 | | |
241 | 257 | | |
| |||
322 | 338 | | |
323 | 339 | | |
324 | 340 | | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
325 | 348 | | |
326 | 349 | | |
327 | 350 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
| 47 | + | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
| 78 | + | |
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| |||
Lines changed: 51 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
61 | 62 | | |
62 | 63 | | |
63 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
64 | 115 | | |
65 | 116 | | |
66 | 117 | | |
| |||
0 commit comments