Commit d3cf6af
feat: add CI/CD DDP step + multi-GPU support with sync_preconditioner()
CI/CD (.github/workflows/ci.yml):
- Add DDP test step (gloo/CPU, 2 processes) after main pytest run
Multi-GPU / DDP (scao/optimizer.py, scao/preconditioner.py):
- Add _broadcast_precond() function: broadcasts all preconditioner state
(eigenfactors U_l/S_l/U_r/S_r, EMA accumulators, int8 scale factors,
step counter, adaptive rank k) from rank 0 to all other ranks via
torch.distributed.broadcast; handles Kronecker / block-diagonal / diagonal
modes and rank mismatch after checkpoint loading
- Add SCAO.sync_preconditioner(process_group=None): broadcasts exp_avg,
exp_avg_sq, step, and preconditioner tensors for every parameter; emits
RuntimeWarning if dist is not initialised; no-op during single-GPU training
- Add DDP section to module docstring: recommended async_precond=False,
checkpoint-resume pattern, torchrun usage
DDP tests (scao/tests/test_ddp.py):
- test_ddp_converges: 2-process gloo/CPU, quadratic loss, verify loss
decreases over 30 steps on both ranks
- test_sync_preconditioner: inject zeroed U_l on rank 1, call sync, verify
both ranks have identical U_l norm via dist.all_gather
Multi-GPU benchmark (scripts/bench_ddp.py):
- torchrun-compatible script: NCCL (GPU) or gloo (CPU) backend auto-selected
- AdamW vs SCAO vs SCAO+int8 comparison with DDP-wrapped GPT model
- Per-GPU batch size, world_size-scaled throughput reporting
- Saves results_ddp_<scale>.csv and _curves.csv (rank 0 only)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 81b60b3 commit d3cf6af
5 files changed
Lines changed: 710 additions & 3 deletions
File tree
- .github/workflows
- scao
- tests
- scripts
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
41 | 44 | | |
42 | 45 | | |
43 | 46 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
39 | 59 | | |
40 | 60 | | |
41 | 61 | | |
| |||
65 | 85 | | |
66 | 86 | | |
67 | 87 | | |
68 | | - | |
| 88 | + | |
69 | 89 | | |
70 | 90 | | |
71 | 91 | | |
| |||
443 | 463 | | |
444 | 464 | | |
445 | 465 | | |
446 | | - | |
| 466 | + | |
447 | 467 | | |
448 | 468 | | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
449 | 524 | | |
450 | 525 | | |
451 | 526 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
552 | 552 | | |
553 | 553 | | |
554 | 554 | | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
0 commit comments