Commit c23cecf
committed
fix(kubeflow): resolve rank-0 and last rank before forwarding logs
On first attach the GROUP_RANK pod map is empty until the torchrun workers
finish rendezvous, so _forward_to_stdout fell back to rank-0-only and the
last rank's early per-step loss/throughput lines (replayed via --tail=-1)
were written to log-allranks but never forwarded to stdout — the CI log
silently dropped the beginning of the run until a re-attach ~120s later,
by which point --since-time skips the replayed history.
Poll on the first attach until both rank 0 and the last rank resolve before
forwarding, capped at 600s (then fall back). The wait is gated on a
non-empty pod list, so it is a no-op when pods can't be listed (no kubectl
/ unit tests) and engages only for real runs.
Signed-off-by: oliver könig <okoenig@nvidia.com>1 parent 981e6f9 commit c23cecf
1 file changed
Lines changed: 30 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
609 | 609 | | |
610 | 610 | | |
611 | 611 | | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
612 | 622 | | |
613 | 623 | | |
614 | 624 | | |
615 | 625 | | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
616 | 646 | | |
617 | 647 | | |
618 | 648 | | |
| |||
0 commit comments