Commit 7495a29
committed
fix(kubeflow): tail only first and last node logs
KubeflowExecutor.fetch_logs streamed every replica pod
(--max-log-requests num_nodes). Each pod's stdout already includes all of that
node's local ranks, so following all replicas multiplies aggregate log volume
by num_nodes and overruns CI/runner job-log size limits at scale (e.g. a 16-node
job exceeded GitLab's 128MB cap). Restrict the selector to completion indices 0
and num_nodes-1 (first + last node) — enough for rank-0 driver output and the
far end of the world for spotting straggler/per-rank failures.
Signed-off-by: oliver könig <okoenig@nvidia.com>1 parent 1ad956b commit 7495a29
1 file changed
Lines changed: 13 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
331 | 331 | | |
332 | 332 | | |
333 | 333 | | |
334 | | - | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
335 | 346 | | |
336 | 347 | | |
337 | 348 | | |
| |||
342 | 353 | | |
343 | 354 | | |
344 | 355 | | |
345 | | - | |
| 356 | + | |
346 | 357 | | |
347 | 358 | | |
348 | 359 | | |
| |||
0 commit comments