Skip to content

Introduce JOB_DISCONNECTED_RETRY_TIMEOUT#2627

Merged
r4victor merged 2 commits into
masterfrom
issue_2626_job_no_connectivity
May 14, 2025
Merged

Introduce JOB_DISCONNECTED_RETRY_TIMEOUT#2627
r4victor merged 2 commits into
masterfrom
issue_2626_job_no_connectivity

Conversation

@r4victor
Copy link
Copy Markdown
Collaborator

@r4victor r4victor commented May 13, 2025

Closes #2626

The PR changes process_running_jobs handling of ssh connection errors for pulling/running jobs. Previously, it failed jobs with no capacity after establishing ssh connection fails. Now, it keeps job active for at least JOB_DISCONNECTED_RETRY_TIMEOUT (2 minutes). This allows surviving temporary networking issues.

A possible regression is that dstack becomes slower to replace failed jobs/replicas. IMO, not a big problem since there are no expectations/guarantees on how quick the replacement is.

Later we can make timeout smarter depending on the job (e.g. short for spots, if retry enabled, etc). For now, keeping it as constant for simplicity.

Also introduced JobTerminationReason.INSTANCE_UNREACHABLE – not yet used because of backward compatibility.

@r4victor r4victor requested a review from jvstme May 14, 2025 07:06
else:
# No job_model.termination_reason set means ssh connection failed
if job_model.disconnected_at is None:
job_model.disconnected_at = common_utils.get_current_datetime()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that runner_ssh_tunnel also does 3 blocking retries, and each can take 15 seconds. So this disconnected_at can be imprecise, since we only set it after the runner_ssh_tunnel retries.

But I'm not sure we need to do anything about it. I thought about disabling the runner_ssh_tunnel retries in favor of the new disconnected_at retries, but having retry logic in both places may further improve stability.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out. Let's keep both! :)

@r4victor r4victor merged commit 4ff4d63 into master May 14, 2025
25 checks passed
@r4victor r4victor deleted the issue_2626_job_no_connectivity branch May 14, 2025 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not fail jobs immediately in case of connectivity issues

2 participants