Skip to content

[Bug]: _submit_job_to_runner is unrecoverably broken once any runner API call after /api/submit failed #3740

@un-def

Description

@un-def

Steps to reproduce

Build dstack-runner with the following patch:

diff --git runner/internal/runner/api/http.go runner/internal/runner/api/http.go
index 34220acc6..dfd9db99c 100644
--- runner/internal/runner/api/http.go
+++ runner/internal/runner/api/http.go
@@ -130,6 +130,11 @@ func (s *Server) uploadCodePostHandler(w http.ResponseWriter, r *http.Request) (
 		return nil, &api.Error{Status: http.StatusConflict}
 	}

+	if !s.uploadCodeCalledOnce {
+		s.uploadCodeCalledOnce = true
+		return nil, &api.Error{Status: http.StatusInternalServerError}
+	}
+
 	r.Body = http.MaxBytesReader(w, r.Body, maxBodySize)

 	if err := s.executor.WriteRepoBlob(r.Body); err != nil {
diff --git runner/internal/runner/api/server.go runner/internal/runner/api/server.go
index 11b76d887..4872f38ef 100644
--- runner/internal/runner/api/server.go
+++ runner/internal/runner/api/server.go
@@ -27,6 +27,8 @@ type Server struct {
 	executor  executor.Executor
 	cancelRun context.CancelFunc

+	uploadCodeCalledOnce bool
+
 	metricsCollector *metrics.MetricsCollector

 	version string

It emulates a flaky failure of /api/upload_code handler (e.g., a network issue) – the first call fails, all consecutive calls succeed.

Deploy this build and submit a run as usual.

Actual behaviour

Once _submit_job_to_runner fails in any runner's API call other than /api/submit, all consecutive attempts are deemed to fail as the previous attempt changed runner's executor state to WaitCode|WaitRun and /api/submit rejects submission since the state is not WaitSubmit.

If we ignore 409 in /api/submit call, the submission process recovers:

--- src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py
+++ src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py
@@ -1342,16 +1342,19 @@ def _submit_job_to_runner(
     if runner_client.healthcheck() is None:
         return _SubmitJobToRunnerResult(success=success_if_not_available)

-    runner_client.submit_job(
-        run=run,
-        job=job,
-        cluster_info=cluster_info,
-        # Do not send all the secrets since interpolation is already done by the server.
-        # TODO: Passing secrets may be necessary for filtering out secret values from logs.
-        secrets={},
-        repo_credentials=repo_credentials,
-        instance_env=instance_env,
-    )
+    try:
+        runner_client.submit_job(
+            run=run,
+            job=job,
+            cluster_info=cluster_info,
+            # Do not send all the secrets since interpolation is already done by the server.
+            # TODO: Passing secrets may be necessary for filtering out secret values from logs.
+            secrets={},
+            repo_credentials=repo_credentials,
+            instance_env=instance_env,
+        )
+    except Exception:
+        pass

Expected behaviour

No response

dstack version

0.20.15

Server logs

[12:00:33] DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 741f2c6c-b400-46a1-a796-da44ebbed36b
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:681 job(741f2c)task-0-0: process pulling job with shim, age=0:00:29.311805
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1329 job(741f2c)task-0-0: submitting job spec
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1330 job(741f2c)task-0-0: repo clone URL is None
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1355 job(741f2c)task-0-0: uploading file archive(s)
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1358 job(741f2c)task-0-0: uploading code
           DEBUG    dstack._internal.server.services.runner.ssh:106 Cannot connect to 192.168.122.75's API: 500 Server Error: Internal Server Error for url: http://localhost:39461/api/upload_code
           WARNING  dstack._internal.server.background.pipeline_tasks.jobs_running:906 job(741f2c)task-0-0: is unreachable, waiting for the instance to become reachable again, age=0:00:29.827807
           INFO     dstack._internal.server.services.events:205 Emitting event: Job became unreachable. Event targets: job(741f2c)task-0-0. Actor: system
[12:00:45] DEBUG    dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 741f2c6c-b400-46a1-a796-da44ebbed36b
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:681 job(741f2c)task-0-0: process pulling job with shim, age=0:00:41.777487
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1329 job(741f2c)task-0-0: submitting job spec
           DEBUG    dstack._internal.server.background.pipeline_tasks.jobs_running:1330 job(741f2c)task-0-0: repo clone URL is None
           DEBUG    dstack._internal.server.services.runner.ssh:106 Cannot connect to 192.168.122.75's API: 409 Client Error: Conflict for url: http://localhost:35811/api/submit
           WARNING  dstack._internal.server.background.pipeline_tasks.jobs_running:906 job(741f2c)task-0-0: is unreachable, waiting for the instance to become reachable again, age=0:00:42.311724
           <... and "409 Client Error" failure repeated again and again until provisioning timeout exceeded>

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions