Skip to content

fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures#458

Merged
ko3n1g merged 6 commits into
mainfrom
ko3n1g/fix/dgxcloud-unknown-state-pending
Mar 11, 2026
Merged

fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures#458
ko3n1g merged 6 commits into
mainfrom
ko3n1g/fix/dgxcloud-unknown-state-pending

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Contributor

Problem

When a job is submitted to DGXCloud, wait_and_exit() would immediately return with FAILED even though the pod was still starting up on the cluster. This happened because:

  1. executor.status() returns None if the auth token request fails, or DGXCloudState.UNKNOWN if the API returns a non-200 response (e.g. a transient HTTP 404 before the workload is registered).
  2. DGXCloudState.UNKNOWN was mapped to AppState.FAILED in DGX_STATES.
  3. TorchX treats AppState.FAILED as a terminal state, so runner.wait() returned immediately — while kubectl get pods showed the job still running.

The window is small but consistently reproducible: the DGXCloud API can take a few seconds to register a newly submitted workload, during which any status poll returns a transient non-200 or Unknown phase.

Fix

  • DGXCloudState.UNKNOWNAppState.PENDING in DGX_STATES (was AppState.FAILED)
  • executor.status() now returns None on non-200 HTTP responses instead of DGXCloudState("Unknown"), so transient API errors are clearly distinguished from the scheduler explicitly reporting an Unknown phase
  • Fallback for unknown keys in DGX_STATES.get(...) changed from AppState.UNKNOWN to AppState.PENDING

Tests

  • test_unknown_state_maps_to_pending_not_failed — asserts the DGX_STATES mapping directly
  • test_describe_returns_pending_when_status_is_none — covers the auth/network failure path
  • test_describe_returns_pending_when_status_is_unknown — covers the transient 404 path
  • Updated test_status_error_response to assert None instead of DGXCloudState.UNKNOWN

🤖 Generated with Claude Code

hemildesai
hemildesai previously approved these changes Mar 11, 2026
@ko3n1g ko3n1g added the r0.8.0 Cherry-pick PR to the r0.8.0 release branch label Mar 11, 2026
@ko3n1g ko3n1g force-pushed the ko3n1g/fix/dgxcloud-unknown-state-pending branch from cb65757 to c89ba4e Compare March 11, 2026 19:17
ko3n1g and others added 6 commits March 11, 2026 19:24
…e failures

When a job is submitted to DGXCloud, the API may transiently return a
non-200 response or an "Unknown" phase before the workload is fully
registered. Previously this was mapped to AppState.FAILED, causing
wait_and_exit() to treat the job as terminated immediately while the
pod was still starting up on the cluster.

- DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES
- executor.status() returns None (instead of DGXCloudState.UNKNOWN)
  on non-200 HTTP responses so transient API errors don't look like
  a real "Unknown" phase reported by the scheduler
- describe() fallback for unknown keys in DGX_STATES changed to PENDING
- Tests updated and added to cover all three code paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
The status() method was calling GET /workloads/{job_id} (generic endpoint)
which returns 403 for distributed and training workloads. The correct
endpoints match the create paths: /workloads/distributed/{job_id} for
multi-node jobs and /workloads/trainings/{job_id} for single-node jobs.
This is consistent with how cancel() already uses /workloads/distributed/.

Adds test_status_distributed to verify the correct URL is used for
multi-node executors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints
return actualPhase, not phase (which was the field on the generic
/workloads/{id} endpoint). This caused a KeyError crash immediately
after the 403 fix landed.

Now reads actualPhase first, falls back to phase for compatibility,
and returns None (PENDING) if neither field is present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
…arsing

When a role name ends with '_', the app_id string looks like:
  experiment___role_name____job_id
Splitting on '___' produces job_id = '_job_id' (spurious leading '_'),
causing the status/cancel/log_iter calls to use a wrong ID and get 404.

Fix: _save_job_dir now stores the actual job_id in the JSON record.
describe(), log_iter(), and _cancel_existing() all read job_id from
the stored record, falling back to app_id.split('___')[-1] for
backwards compatibility with existing saved jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g force-pushed the ko3n1g/fix/dgxcloud-unknown-state-pending branch from db4756f to ebf072f Compare March 11, 2026 19:25
@ko3n1g ko3n1g merged commit b725412 into main Mar 11, 2026
24 checks passed
svcnvidia-nemo-ci added a commit that referenced this pull request Mar 11, 2026
…e failures (#458)

* fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures

When a job is submitted to DGXCloud, the API may transiently return a
non-200 response or an "Unknown" phase before the workload is fully
registered. Previously this was mapped to AppState.FAILED, causing
wait_and_exit() to treat the job as terminated immediately while the
pod was still starting up on the cluster.

- DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES
- executor.status() returns None (instead of DGXCloudState.UNKNOWN)
  on non-200 HTTP responses so transient API errors don't look like
  a real "Unknown" phase reported by the scheduler
- describe() fallback for unknown keys in DGX_STATES changed to PENDING
- Tests updated and added to cover all three code paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix: use type-specific endpoint for DGXCloud workload status

The status() method was calling GET /workloads/{job_id} (generic endpoint)
which returns 403 for distributed and training workloads. The correct
endpoints match the create paths: /workloads/distributed/{job_id} for
multi-node jobs and /workloads/trainings/{job_id} for single-node jobs.
This is consistent with how cancel() already uses /workloads/distributed/.

Adds test_status_distributed to verify the correct URL is used for
multi-node executors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix: read actualPhase from type-specific workload endpoints

The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints
return actualPhase, not phase (which was the field on the generic
/workloads/{id} endpoint). This caused a KeyError crash immediately
after the 403 fix landed.

Now reads actualPhase first, falls back to phase for compatibility,
and returns None (PENDING) if neither field is present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* add tests

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix: store job_id explicitly to avoid separator collision in app_id parsing

When a role name ends with '_', the app_id string looks like:
  experiment___role_name____job_id
Splitting on '___' produces job_id = '_job_id' (spurious leading '_'),
causing the status/cancel/log_iter calls to use a wrong ID and get 404.

Fix: _save_job_dir now stores the actual job_id in the JSON record.
describe(), log_iter(), and _cancel_existing() all read job_id from
the stored record, falling back to app_id.split('___')[-1] for
backwards compatibility with existing saved jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>

* format

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
ko3n1g added a commit that referenced this pull request Mar 11, 2026
…e failures (#458) (#459)

* fix: treat DGXCloud UNKNOWN/transient status as PENDING to avoid false failures

When a job is submitted to DGXCloud, the API may transiently return a
non-200 response or an "Unknown" phase before the workload is fully
registered. Previously this was mapped to AppState.FAILED, causing
wait_and_exit() to treat the job as terminated immediately while the
pod was still starting up on the cluster.

- DGXCloudState.UNKNOWN now maps to AppState.PENDING in DGX_STATES
- executor.status() returns None (instead of DGXCloudState.UNKNOWN)
  on non-200 HTTP responses so transient API errors don't look like
  a real "Unknown" phase reported by the scheduler
- describe() fallback for unknown keys in DGX_STATES changed to PENDING
- Tests updated and added to cover all three code paths




* fix: use type-specific endpoint for DGXCloud workload status

The status() method was calling GET /workloads/{job_id} (generic endpoint)
which returns 403 for distributed and training workloads. The correct
endpoints match the create paths: /workloads/distributed/{job_id} for
multi-node jobs and /workloads/trainings/{job_id} for single-node jobs.
This is consistent with how cancel() already uses /workloads/distributed/.

Adds test_status_distributed to verify the correct URL is used for
multi-node executors.




* fix: read actualPhase from type-specific workload endpoints

The /workloads/distributed/{id} and /workloads/trainings/{id} endpoints
return actualPhase, not phase (which was the field on the generic
/workloads/{id} endpoint). This caused a KeyError crash immediately
after the 403 fix landed.

Now reads actualPhase first, falls back to phase for compatibility,
and returns None (PENDING) if neither field is present.




* add tests



* fix: store job_id explicitly to avoid separator collision in app_id parsing

When a role name ends with '_', the app_id string looks like:
  experiment___role_name____job_id
Splitting on '___' produces job_id = '_job_id' (spurious leading '_'),
causing the status/cancel/log_iter calls to use a wrong ID and get 404.

Fix: _save_job_dir now stores the actual job_id in the JSON record.
describe(), log_iter(), and _cancel_existing() all read job_id from
the stored record, falling back to app_id.split('___')[-1] for
backwards compatibility with existing saved jobs.




* format



---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.8.0 Cherry-pick PR to the r0.8.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants