Commit 620d093
fix(kfpytorch): add environment property to pass elastic config via env vars
This fixes bug #1 where overriding nproc_per_node via with_overrides would not
change the number of processes for single-node elastic tasks.
For single-node elastic tasks (task_type='python-task'), the _execute method
reads PET_NPROC_PER_NODE, PET_NNODES, PET_MAX_RESTARTS, and PET_MONITOR_INTERVAL
from environment variables. However, these env vars were never being set in the
task template during serialization.
The fix adds an environment property override to PytorchElasticFunctionTask that
includes the elastic config as environment variables. This ensures that when
task_config is modified via with_overrides, the elastic configuration is
correctly passed to the pod via environment variables.
Combined with the previous fix (dynamic task_type property), this now fully
supports:
- Bug #1: single-node (1 proc) -> single-node (multiple procs) override
- Bug #2: single-node (1 proc) -> multi-node (multiple procs) override
Co-Authored-By: carlos@exa.ai <carlosmarques.personal@gmail.com>1 parent 506d94e commit 620d093
2 files changed
Lines changed: 76 additions & 0 deletions
Lines changed: 27 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
364 | 364 | | |
365 | 365 | | |
366 | 366 | | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
367 | 394 | | |
368 | 395 | | |
369 | 396 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
361 | 361 | | |
362 | 362 | | |
363 | 363 | | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
0 commit comments