You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures (#458) (#459)
* fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures
When a job is submitted to DGXCloud, the API may transiently return a
non-200 response or an "Unknown" phase before the workload is fully
registered. Previously this was mapped to AppState.FAILED, causing
wait_and_exit() to treat the job as terminated immediately while the
pod was still starting up on the cluster.
- DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES
- executor.status() returns None (instead of DGXCloudState.UNKNOWN)
on non-200 HTTP responses so transient API errors don't look like
a real "Unknown" phase reported by the scheduler
- describe() fallback for unknown keys in DGX_STATES changed to PENDING
- Tests updated and added to cover all three code paths
* fix: use type-specific endpoint for DGXCloud workload status
The status() method was calling GET /workloads/{job_id} (generic endpoint)
which returns 403 for distributed and training workloads. The correct
endpoints match the create paths: /workloads/distributed/{job_id} for
multi-node jobs and /workloads/trainings/{job_id} for single-node jobs.
This is consistent with how cancel() already uses /workloads/distributed/.
Adds test_status_distributed to verify the correct URL is used for
multi-node executors.
* fix: read actualPhase from type-specific workload endpoints
The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints
return actualPhase, not phase (which was the field on the generic
/workloads/{id} endpoint). This caused a KeyError crash immediately
after the 403 fix landed.
Now reads actualPhase first, falls back to phase for compatibility,
and returns None (PENDING) if neither field is present.
* add tests
* fix: store job_id explicitly to avoid separator collision in app_id parsing
When a role name ends with '_', the app_id string looks like:
experiment___role_name____job_id
Splitting on '___' produces job_id = '_job_id' (spurious leading '_'),
causing the status/cancel/log_iter calls to use a wrong ID and get 404.
Fix: _save_job_dir now stores the actual job_id in the JSON record.
describe(), log_iter(), and _cancel_existing() all read job_id from
the stored record, falling back to app_id.split('___')[-1] for
backwards compatibility with existing saved jobs.
* format
---------
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
0 commit comments