Skip to content

feat: Stream DGXC logs#377

Merged
ko3n1g merged 45 commits into
mainfrom
ko3n1g/feat/stream-dgxc-logs
Nov 20, 2025
Merged

feat: Stream DGXC logs#377
ko3n1g merged 45 commits into
mainfrom
ko3n1g/feat/stream-dgxc-logs

Conversation

@ko3n1g

@ko3n1g ko3n1g commented Nov 5, 2025

Copy link
Copy Markdown
Contributor

This adds a log-streamer to the DGXCExecutor and ties it to the frontend via the torchx scheduler.

Since the DXGC endpoint doesn't expose logs, we need to go via the kube-apiserver to download container pod logs. Since we need a token for creating that, I decided to go via the torchx scheduler that allows instantiating the executor' state.

I still need to:

  • test this e2e with a DGXC environment
  • add unit tests

Comment thread nemo_run/core/execution/dgxcloud.py Fixed
Comment thread nemo_run/run/torchx_backend/schedulers/dgxcloud.py Fixed
Comment thread nemo_run/core/execution/dgxcloud.py Fixed
Comment thread nemo_run/run/torchx_backend/schedulers/dgxcloud.py Fixed
Comment thread nemo_run/run/torchx_backend/schedulers/dgxcloud.py Fixed
Comment thread nemo_run/run/torchx_backend/schedulers/dgxcloud.py Fixed
Comment thread nemo_run/run/torchx_backend/schedulers/dgxcloud.py Fixed
Comment thread test/core/execution/test_dgxcloud.py Fixed
@ko3n1g

ko3n1g commented Nov 6, 2025

Copy link
Copy Markdown
Contributor Author

hey @roclark, Hemil mentioned that you built the initial DGXC executor. Before diving too deep into the work, I wanted to prefetch your thoughts on this PR. Do you agree with the implementation design to fetch container logs?

@roclark

roclark commented Nov 7, 2025

Copy link
Copy Markdown
Contributor

Hey @ko3n1g, I haven't worked on the Run:ai pieces in a while, but I think this makes sense. My only question would be if the API server is something that is always exposed/available by default to users on installations or if that needs to be configured by an admin beforehand.

Comment thread nemo_run/run/logs.py Dismissed
Comment thread nemo_run/core/execution/dgxcloud.py Fixed
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Comment thread test/core/execution/test_dgxcloud.py Fixed
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Comment thread nemo_run/core/execution/dgxcloud.py Outdated
from enum import Enum
from pathlib import Path
from typing import Any, Optional, Type
from typing import Any, Dict, Iterable, Optional

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Dict and just use dict for the types?

Comment thread nemo_run/run/logs.py
role_name,
replica_id,
regex,
None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchx has 8 args, so prior to this change we were routing should_tail and streams to since and until.

Since my codepaths use streams I ran into this issue

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g merged commit 6b2240e into main Nov 20, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants