Skip to content

Commit 7495a29

Browse files
committed
fix(kubeflow): tail only first and last node logs
KubeflowExecutor.fetch_logs streamed every replica pod (--max-log-requests num_nodes). Each pod's stdout already includes all of that node's local ranks, so following all replicas multiplies aggregate log volume by num_nodes and overruns CI/runner job-log size limits at scale (e.g. a 16-node job exceeded GitLab's 128MB cap). Restrict the selector to completion indices 0 and num_nodes-1 (first + last node) — enough for rank-0 driver output and the far end of the world for spotting straggler/per-rank failures. Signed-off-by: oliver könig <okoenig@nvidia.com>
1 parent 1ad956b commit 7495a29

1 file changed

Lines changed: 13 additions & 2 deletions

File tree

nemo_run/core/execution/kubeflow.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -331,7 +331,18 @@ def fetch_logs(
331331
until pods are running (up to 10 minutes). Otherwise it returns the last
332332
*lines* lines from a single ``kubectl logs`` call.
333333
"""
334-
label_selector = f"jobset.sigs.k8s.io/jobset-name={job_name}"
334+
# Tail only the first and last node of the job. Each pod's stdout
335+
# already carries all of that node's local ranks, so following every
336+
# replica multiplies the aggregate output by num_nodes and can blow
337+
# past CI / runner job-log size limits at scale. The first and last
338+
# completion indices are enough to surface rank-0 driver output and the
339+
# opposite end of the world for spotting straggler / per-rank failures.
340+
last_index = max(self.num_nodes - 1, 0)
341+
node_indices = "0" if last_index == 0 else f"0,{last_index}"
342+
label_selector = (
343+
f"jobset.sigs.k8s.io/jobset-name={job_name},"
344+
f"batch.kubernetes.io/job-completion-index in ({node_indices})"
345+
)
335346
cmd = [
336347
"kubectl",
337348
"logs",
@@ -342,7 +353,7 @@ def fetch_logs(
342353
"--tail",
343354
str(lines),
344355
"--max-log-requests",
345-
str(self.num_nodes),
356+
str(1 if last_index == 0 else 2),
346357
]
347358
if stream:
348359
cmd.append("-f")

0 commit comments

Comments
 (0)