Skip to content

Commit 2b344dc

Browse files
committed
fix(kubeflow): wait for rank-0/last to resolve, never fall back to completion-index
The first-attach barrier capped the wait at 600s and then forwarded with the completion-index heuristic, which streams the wrong rank. A job can legitimately sit Pending (starved for nodes) far longer than 600s, so it would time out and mis-forward. Drop the timeout/fallback: keep polling while the job is alive and stop only when it reaches a terminal state. --tail=-1 on first attach replays history, so waiting loses nothing. Signed-off-by: oliver könig <okoenig@nvidia.com>
1 parent c23cecf commit 2b344dc

1 file changed

Lines changed: 10 additions & 12 deletions

File tree

nemo_run/core/execution/kubeflow.py

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -617,7 +617,6 @@ def _forward_to_stdout(log_line: str, pod_index: dict[str, int]) -> bool:
617617
# stdout, silently dropping the beginning of the run from the CI log.
618618
# Poll until both are resolved, capped so a run that never exposes
619619
# GROUP_RANK still streams (with the completion-index fallback).
620-
rank_resolve_timeout_s = 600.0
621620
rank_resolve_poll_s = 5.0
622621
since_time: Optional[str] = None
623622
while True:
@@ -627,18 +626,17 @@ def _forward_to_stdout(log_line: str, pod_index: dict[str, int]) -> bool:
627626
# First attach: wait until BOTH rank 0 and the last rank are
628627
# resolved before forwarding, so the last rank's early per-step
629628
# lines (replayed via --tail=-1) reach stdout instead of only
630-
# log-allranks. Only wait while pods are actually listable; an
631-
# empty list (no kubectl / unit tests) skips the wait and streams
632-
# with the existing completion-index fallback.
633-
resolve_deadline = time.time() + rank_resolve_timeout_s
629+
# log-allranks. Never fall back to the completion-index heuristic
630+
# — it forwards the wrong rank. The job may sit Pending (waiting
631+
# for nodes) or be mid-rendezvous, so keep waiting while it is
632+
# alive; --tail=-1 on first attach replays history, so nothing is
633+
# lost by waiting. Stop only if the job reaches a terminal state
634+
# (or pods aren't listable at all — e.g. no kubectl / unit tests).
634635
while pod_index and not {0, last_group_rank} <= set(group_rank_map.values()):
635-
if time.time() >= resolve_deadline:
636-
logger.warning(
637-
"rank 0 / last rank (%d) not both resolved within %.0fs; "
638-
"forwarding with completion-index fallback",
639-
last_group_rank,
640-
rank_resolve_timeout_s,
641-
)
636+
if self.status(job_name) in (
637+
KubeflowJobState.SUCCEEDED,
638+
KubeflowJobState.FAILED,
639+
):
642640
break
643641
time.sleep(rank_resolve_poll_s)
644642
pod_index = _pod_index_map()

0 commit comments

Comments
 (0)