fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures by svcnvidia-nemo-ci · Pull Request #458 · NVIDIA-NeMo/Run

svcnvidia-nemo-ci · 2026-03-11T16:42:33Z

Problem

When a job is submitted to DGXCloud, wait_and_exit() would immediately return with FAILED even though the pod was still starting up on the cluster. This happened because:

executor.status() returns None if the auth token request fails, or DGXCloudState.UNKNOWN if the API returns a non-200 response (e.g. a transient HTTP 404 before the workload is registered).
DGXCloudState.UNKNOWN was mapped to AppState.FAILED in DGX_STATES.
TorchX treats AppState.FAILED as a terminal state, so runner.wait() returned immediately — while kubectl get pods showed the job still running.

The window is small but consistently reproducible: the DGXCloud API can take a few seconds to register a newly submitted workload, during which any status poll returns a transient non-200 or Unknown phase.

Fix

DGXCloudState.UNKNOWN → AppState.PENDING in DGX_STATES (was AppState.FAILED)
executor.status() now returns None on non-200 HTTP responses instead of DGXCloudState("Unknown"), so transient API errors are clearly distinguished from the scheduler explicitly reporting an Unknown phase
Fallback for unknown keys in DGX_STATES.get(...) changed from AppState.UNKNOWN to AppState.PENDING

Tests

test_unknown_state_maps_to_pending_not_failed — asserts the DGX_STATES mapping directly
test_describe_returns_pending_when_status_is_none — covers the auth/network failure path
test_describe_returns_pending_when_status_is_unknown — covers the transient 404 path
Updated test_status_error_response to assert None instead of DGXCloudState.UNKNOWN

🤖 Generated with Claude Code

…e failures When a job is submitted to DGXCloud, the API may transiently return a non-200 response or an "Unknown" phase before the workload is fully registered. Previously this was mapped to AppState.FAILED, causing wait_and_exit() to treat the job as terminated immediately while the pod was still starting up on the cluster. - DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES - executor.status() returns None (instead of DGXCloudState.UNKNOWN) on non-200 HTTP responses so transient API errors don't look like a real "Unknown" phase reported by the scheduler - describe() fallback for unknown keys in DGX_STATES changed to PENDING - Tests updated and added to cover all three code paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

The status() method was calling GET /workloads/{job_id} (generic endpoint) which returns 403 for distributed and training workloads. The correct endpoints match the create paths: /workloads/distributed/{job_id} for multi-node jobs and /workloads/trainings/{job_id} for single-node jobs. This is consistent with how cancel() already uses /workloads/distributed/. Adds test_status_distributed to verify the correct URL is used for multi-node executors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints return actualPhase, not phase (which was the field on the generic /workloads/{id} endpoint). This caused a KeyError crash immediately after the 403 fix landed. Now reads actualPhase first, falls back to phase for compatibility, and returns None (PENDING) if neither field is present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…arsing When a role name ends with '_', the app_id string looks like: experiment___role_name____job_id Splitting on '___' produces job_id = '_job_id' (spurious leading '_'), causing the status/cancel/log_iter calls to use a wrong ID and get 404. Fix: _save_job_dir now stores the actual job_id in the JSON record. describe(), log_iter(), and _cancel_existing() all read job_id from the stored record, falling back to app_id.split('___')[-1] for backwards compatibility with existing saved jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

Signed-off-by: oliver könig <okoenig@nvidia.com>

…e failures (#458) * fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures When a job is submitted to DGXCloud, the API may transiently return a non-200 response or an "Unknown" phase before the workload is fully registered. Previously this was mapped to AppState.FAILED, causing wait_and_exit() to treat the job as terminated immediately while the pod was still starting up on the cluster. - DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES - executor.status() returns None (instead of DGXCloudState.UNKNOWN) on non-200 HTTP responses so transient API errors don't look like a real "Unknown" phase reported by the scheduler - describe() fallback for unknown keys in DGX_STATES changed to PENDING - Tests updated and added to cover all three code paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix: use type-specific endpoint for DGXCloud workload status The status() method was calling GET /workloads/{job_id} (generic endpoint) which returns 403 for distributed and training workloads. The correct endpoints match the create paths: /workloads/distributed/{job_id} for multi-node jobs and /workloads/trainings/{job_id} for single-node jobs. This is consistent with how cancel() already uses /workloads/distributed/. Adds test_status_distributed to verify the correct URL is used for multi-node executors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * fix: read actualPhase from type-specific workload endpoints The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints return actualPhase, not phase (which was the field on the generic /workloads/{id} endpoint). This caused a KeyError crash immediately after the 403 fix landed. Now reads actualPhase first, falls back to phase for compatibility, and returns None (PENDING) if neither field is present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * add tests Signed-off-by: oliver könig <okoenig@nvidia.com> * fix: store job_id explicitly to avoid separator collision in app_id parsing When a role name ends with '_', the app_id string looks like: experiment___role_name____job_id Splitting on '___' produces job_id = '_job_id' (spurious leading '_'), causing the status/cancel/log_iter calls to use a wrong ID and get 404. Fix: _save_job_dir now stores the actual job_id in the JSON record. describe(), log_iter(), and _cancel_existing() all read job_id from the stored record, falling back to app_id.split('___')[-1] for backwards compatibility with existing saved jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com> * format Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…e failures (#458) (#459) * fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures When a job is submitted to DGXCloud, the API may transiently return a non-200 response or an "Unknown" phase before the workload is fully registered. Previously this was mapped to AppState.FAILED, causing wait_and_exit() to treat the job as terminated immediately while the pod was still starting up on the cluster. - DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES - executor.status() returns None (instead of DGXCloudState.UNKNOWN) on non-200 HTTP responses so transient API errors don't look like a real "Unknown" phase reported by the scheduler - describe() fallback for unknown keys in DGX_STATES changed to PENDING - Tests updated and added to cover all three code paths * fix: use type-specific endpoint for DGXCloud workload status The status() method was calling GET /workloads/{job_id} (generic endpoint) which returns 403 for distributed and training workloads. The correct endpoints match the create paths: /workloads/distributed/{job_id} for multi-node jobs and /workloads/trainings/{job_id} for single-node jobs. This is consistent with how cancel() already uses /workloads/distributed/. Adds test_status_distributed to verify the correct URL is used for multi-node executors. * fix: read actualPhase from type-specific workload endpoints The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints return actualPhase, not phase (which was the field on the generic /workloads/{id} endpoint). This caused a KeyError crash immediately after the 403 fix landed. Now reads actualPhase first, falls back to phase for compatibility, and returns None (PENDING) if neither field is present. * add tests * fix: store job_id explicitly to avoid separator collision in app_id parsing When a role name ends with '_', the app_id string looks like: experiment___role_name____job_id Splitting on '___' produces job_id = '_job_id' (spurious leading '_'), causing the status/cancel/log_iter calls to use a wrong ID and get 404. Fix: _save_job_dir now stores the actual job_id in the JSON record. describe(), log_iter(), and _cancel_existing() all read job_id from the stored record, falling back to app_id.split('___')[-1] for backwards compatibility with existing saved jobs. * format --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

svcnvidia-nemo-ci temporarily deployed to public March 11, 2026 16:44 — with GitHub Actions Inactive

hemildesai previously approved these changes Mar 11, 2026

View reviewed changes

ko3n1g added the r0.8.0 Cherry-pick PR to the r0.8.0 release branch label Mar 11, 2026

ko3n1g dismissed hemildesai’s stale review via 1e18af5 March 11, 2026 18:36

ko3n1g temporarily deployed to public March 11, 2026 18:37 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 11, 2026 18:50 — with GitHub Actions Inactive

ko3n1g force-pushed the ko3n1g/fix/dgxcloud-unknown-state-pending branch from cb65757 to c89ba4e Compare March 11, 2026 19:17

ko3n1g temporarily deployed to public March 11, 2026 19:18 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 11, 2026 19:24 — with GitHub Actions Inactive

ko3n1g and others added 6 commits March 11, 2026 19:24

add tests

b4c77ed

Signed-off-by: oliver könig <okoenig@nvidia.com>

format

ebf072f

Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g force-pushed the ko3n1g/fix/dgxcloud-unknown-state-pending branch from db4756f to ebf072f Compare March 11, 2026 19:25

ko3n1g temporarily deployed to public March 11, 2026 19:26 — with GitHub Actions Inactive

hemildesai approved these changes Mar 11, 2026

View reviewed changes

ko3n1g merged commit b725412 into main Mar 11, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures#458

fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures#458
ko3n1g merged 6 commits into
mainfrom
ko3n1g/fix/dgxcloud-unknown-state-pending

svcnvidia-nemo-ci commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

svcnvidia-nemo-ci commented Mar 11, 2026

Problem

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants