Commit 2b344dc
committed
fix(kubeflow): wait for rank-0/last to resolve, never fall back to completion-index
The first-attach barrier capped the wait at 600s and then forwarded with the
completion-index heuristic, which streams the wrong rank. A job can legitimately
sit Pending (starved for nodes) far longer than 600s, so it would time out and
mis-forward. Drop the timeout/fallback: keep polling while the job is alive and
stop only when it reaches a terminal state. --tail=-1 on first attach replays
history, so waiting loses nothing.
Signed-off-by: oliver könig <okoenig@nvidia.com>1 parent c23cecf commit 2b344dc
1 file changed
Lines changed: 10 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
617 | 617 | | |
618 | 618 | | |
619 | 619 | | |
620 | | - | |
621 | 620 | | |
622 | 621 | | |
623 | 622 | | |
| |||
627 | 626 | | |
628 | 627 | | |
629 | 628 | | |
630 | | - | |
631 | | - | |
632 | | - | |
633 | | - | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
634 | 635 | | |
635 | | - | |
636 | | - | |
637 | | - | |
638 | | - | |
639 | | - | |
640 | | - | |
641 | | - | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
642 | 640 | | |
643 | 641 | | |
644 | 642 | | |
| |||
0 commit comments