Commit 568a242
fix: catch transient sacct exceptions in SlurmTunnelScheduler.describe()
After long-running jobs (hours), a transient sacct failure (daemon hiccup,
invoke.UnexpectedExit from non-zero exit code, etc.) would propagate
uncaught through describe() → runner.wait() → wait_and_exit(), killing
the wait loop and reporting EXIT_CODE_TRAINING=1 even though the Slurm
job was still running.
Wrap the sacct call in a try/except and return AppState.UNKNOWN on
failure. UNKNOWN is non-terminal in torchx so polling continues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>1 parent 739409a commit 568a242
2 files changed
Lines changed: 30 additions & 3 deletions
File tree
- nemo_run/run/torchx_backend/schedulers
- test/run/torchx_backend/schedulers
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
240 | 240 | | |
241 | 241 | | |
242 | 242 | | |
243 | | - | |
244 | | - | |
245 | | - | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
246 | 252 | | |
247 | 253 | | |
248 | 254 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
380 | 380 | | |
381 | 381 | | |
382 | 382 | | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
383 | 404 | | |
384 | 405 | | |
385 | 406 | | |
| |||
0 commit comments