Commit 7bb8137
authored
Megatron LoRA correctness: align distributed semantics with Megatron and validate TP/EP/ETP/DP against an oracle (#619)
* megatron: integrate lora grad sync with finalize_model_grads
* megatron: harden sharded lora merge validation
* tests: add megatron lora oracle correctness harness
* Minor typing changes
* megatron: extend LoRA grad-sync semantics across tp/expert-tp
* megatron: add MoE routing replay core and unit tests
* megatron runtime/service: wire routing replay into training jobs
* oracle worker/trace: capture forward traces and emit replay bundles
* oracle harness/tests: refactor suite and add oracle-replay parity flow
* typing: clear blocking ty errors in oracle replay and LoRA paths
* megatron: reduce oracle variance with sequence grad accumulation
Use per-step micro-accumulation over multiple packed sequences so updates are less sensitive to sparse expert token assignment. Also make backend progress accounting accumulation-aware.
* megatron lora: fix TP/EP export participation rules
Correct LoRA shard export behavior so non-zero TP ranks in EP/ETP topologies contribute when required, while still filtering replicated-only entries.
* oracle trace: canonicalize MoE outputs across arbitrary topologies
Move normalization logic into ForwardTraceCapture so saved traces are canonicalized toward world-size-1 semantics (expert row identity/order and ETP fc1 layout).
* oracle harness: stabilize scoring and expand sensitivity mutations
Rework oracle pass/fail evaluation with per-phase functions, layer-averaged metrics, deterministic init, expanded sensitivity mutations, and smaller Adam epsilon for tiny-gradient regimes.
* oracle tests: write suite output tables to log files
Redirect suite stdout/stderr into local correctness/sensitivity logs and make skip/report messaging point to those artifacts instead of terminal output.
* Add correct data parallelism.
* Fix per-token DP normalization in Megatron training
* Expand the oracle harness for DP correctness checks
* Clean up type errors in Megatron correctness changes
* Testing harness was working, but real training surfaced a few errors, mostly fixed.
* Cut over Megatron LoRA to QuACK
* Del held packed tensors so dir can be removed.
Plus small typing changes.
* Fuse LoRA scale into QuACK grouped GEMM
* Avoid grad_out copy in QuACK LoRA backward
* Fuse MoE FC1 gate and up LoRA paths
* Tune QuACK low-rank tiles and rank contract
* Inline FC1 QuACK dual call
* Revert unnecessary python 3.12 requirement.
* Create lora without instantiating full model by using meta device.
- Fix routing replay for torch 2.10.0.
* Update Megatron dependencies for transformers v5 change.
* Update megatron tests for new lora kernel and avg grads across experts for stability.
* Limit max build jobs when building the uv cache.
* Fix CI uv cache build robustness
* Tune CI uv cache build concurrency
* Fix CI Apex cache contract1 parent dc8c338 commit 7bb8137
File tree
28 files changed
+10901
-2718
lines changed- .github/workflows
- dev
- scripts/ci
- src/art
- dev
- local
- megatron
- tinker
- unsloth
- tests
- integration
- unit
28 files changed
+10901
-2718
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
16 | 19 | | |
17 | 20 | | |
18 | 21 | | |
| |||
34 | 37 | | |
35 | 38 | | |
36 | 39 | | |
37 | | - | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
38 | 43 | | |
39 | 44 | | |
40 | 45 | | |
| |||
198 | 203 | | |
199 | 204 | | |
200 | 205 | | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
201 | 213 | | |
202 | 214 | | |
203 | 215 | | |
| |||
207 | 219 | | |
208 | 220 | | |
209 | 221 | | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
210 | 231 | | |
211 | 232 | | |
212 | 233 | | |
213 | 234 | | |
214 | 235 | | |
215 | | - | |
| 236 | + | |
216 | 237 | | |
217 | 238 | | |
218 | 239 | | |
219 | | - | |
| 240 | + | |
0 commit comments