Skip to content

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run#265

Merged
roclark merged 9 commits into
NVIDIA-NeMo:mainfrom
prekshivyas:main
Jun 11, 2025
Merged

refactor data mover: switch to BatchJob with auto cleanup and sleep after every run#265
roclark merged 9 commits into
NVIDIA-NeMo:mainfrom
prekshivyas:main

Conversation

@prekshivyas

Copy link
Copy Markdown
Contributor

No description provided.

…fter every run

Signed-off-by: prekshivyas <prekhsivyas@gmail.com>

@roclark roclark left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @prekshivyas! This looks good to me - just some small comments on extra wait states.

Comment thread nemo_run/core/execution/lepton.py Outdated
raise TimeoutError(f"Job {job_id} did not complete within {timeout} seconds.")
current_job = client.job.get(job_id)
current_job_status = current_job.status.state
if count > 0 and current_job_status in [LeptonJobState.Completed, LeptonJobState.Failed, LeptonJobState.Unknown]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the count > 0 check? If the job is immediately in one of the acceptable states, figured we could break out right away, but I suppose it could be in Unknown prior to running?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had pout some logs to check what states come up when job is just scheduling or starting - sometimes it would be Unknown like in the initial tests - i saw it coming more than 2 times, but in my later tests it would just come up as Starting and not Unknown anymore - just to be sure and avoid this randomness i put a count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks! We can keep the count check then. If we want to be super efficient, we could do another branch for if current_job_status == LeptonJobState.Unknown and count == 0 and sleep/retry in that state and only check for Completed and Failed without the count here, but I'm not too particular about it.

Comment thread nemo_run/core/execution/lepton.py Outdated
if current_job_status != LeptonJobState.Completed:
raise RuntimeError(f"Job {job_id} failed with status: {current_job_status}")

time.sleep(sleep)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards not putting a sleep here. If the data has been uploaded to the remote FS and the data-mover is marked as Completed, I'd say we continue immediately to the next stages to reduce downtime. Theoretically, we should only make it to this line if everything is ready to continue.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm I see ! yeah we could remove sleep so I will do that now

prekshivyas added 5 commits June 9, 2025 10:28
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
prekshivyas added 2 commits June 11, 2025 13:22
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>
Signed-off-by: prekshivyas <prekhsivyas@gmail.com>

@roclark roclark left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@roclark roclark merged commit caf3f12 into NVIDIA-NeMo:main Jun 11, 2025
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants