-
Notifications
You must be signed in to change notification settings - Fork 221
[Bug]: _submit_job_to_runner is unrecoverably broken once any runner API call after /api/submit failed #3740
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Steps to reproduce
Build dstack-runner with the following patch:
diff --git runner/internal/runner/api/http.go runner/internal/runner/api/http.go
index 34220acc6..dfd9db99c 100644
--- runner/internal/runner/api/http.go
+++ runner/internal/runner/api/http.go
@@ -130,6 +130,11 @@ func (s *Server) uploadCodePostHandler(w http.ResponseWriter, r *http.Request) (
return nil, &api.Error{Status: http.StatusConflict}
}
+ if !s.uploadCodeCalledOnce {
+ s.uploadCodeCalledOnce = true
+ return nil, &api.Error{Status: http.StatusInternalServerError}
+ }
+
r.Body = http.MaxBytesReader(w, r.Body, maxBodySize)
if err := s.executor.WriteRepoBlob(r.Body); err != nil {
diff --git runner/internal/runner/api/server.go runner/internal/runner/api/server.go
index 11b76d887..4872f38ef 100644
--- runner/internal/runner/api/server.go
+++ runner/internal/runner/api/server.go
@@ -27,6 +27,8 @@ type Server struct {
executor executor.Executor
cancelRun context.CancelFunc
+ uploadCodeCalledOnce bool
+
metricsCollector *metrics.MetricsCollector
version stringIt emulates a flaky failure of /api/upload_code handler (e.g., a network issue) – the first call fails, all consecutive calls succeed.
Deploy this build and submit a run as usual.
Actual behaviour
Once _submit_job_to_runner fails in any runner's API call other than /api/submit, all consecutive attempts are deemed to fail as the previous attempt changed runner's executor state to WaitCode|WaitRun and /api/submit rejects submission since the state is not WaitSubmit.
If we ignore 409 in /api/submit call, the submission process recovers:
--- src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py
+++ src/dstack/_internal/server/background/pipeline_tasks/jobs_running.py
@@ -1342,16 +1342,19 @@ def _submit_job_to_runner(
if runner_client.healthcheck() is None:
return _SubmitJobToRunnerResult(success=success_if_not_available)
- runner_client.submit_job(
- run=run,
- job=job,
- cluster_info=cluster_info,
- # Do not send all the secrets since interpolation is already done by the server.
- # TODO: Passing secrets may be necessary for filtering out secret values from logs.
- secrets={},
- repo_credentials=repo_credentials,
- instance_env=instance_env,
- )
+ try:
+ runner_client.submit_job(
+ run=run,
+ job=job,
+ cluster_info=cluster_info,
+ # Do not send all the secrets since interpolation is already done by the server.
+ # TODO: Passing secrets may be necessary for filtering out secret values from logs.
+ secrets={},
+ repo_credentials=repo_credentials,
+ instance_env=instance_env,
+ )
+ except Exception:
+ passExpected behaviour
No response
dstack version
0.20.15
Server logs
[12:00:33] DEBUG dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 741f2c6c-b400-46a1-a796-da44ebbed36b
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:681 job(741f2c)task-0-0: process pulling job with shim, age=0:00:29.311805
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:1329 job(741f2c)task-0-0: submitting job spec
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:1330 job(741f2c)task-0-0: repo clone URL is None
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:1355 job(741f2c)task-0-0: uploading file archive(s)
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:1358 job(741f2c)task-0-0: uploading code
DEBUG dstack._internal.server.services.runner.ssh:106 Cannot connect to 192.168.122.75's API: 500 Server Error: Internal Server Error for url: http://localhost:39461/api/upload_code
WARNING dstack._internal.server.background.pipeline_tasks.jobs_running:906 job(741f2c)task-0-0: is unreachable, waiting for the instance to become reachable again, age=0:00:29.827807
INFO dstack._internal.server.services.events:205 Emitting event: Job became unreachable. Event targets: job(741f2c)task-0-0. Actor: system
[12:00:45] DEBUG dstack._internal.server.background.pipeline_tasks.base:357 Processing jobs item 741f2c6c-b400-46a1-a796-da44ebbed36b
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:681 job(741f2c)task-0-0: process pulling job with shim, age=0:00:41.777487
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:1329 job(741f2c)task-0-0: submitting job spec
DEBUG dstack._internal.server.background.pipeline_tasks.jobs_running:1330 job(741f2c)task-0-0: repo clone URL is None
DEBUG dstack._internal.server.services.runner.ssh:106 Cannot connect to 192.168.122.75's API: 409 Client Error: Conflict for url: http://localhost:35811/api/submit
WARNING dstack._internal.server.background.pipeline_tasks.jobs_running:906 job(741f2c)task-0-0: is unreachable, waiting for the instance to become reachable again, age=0:00:42.311724
<... and "409 Client Error" failure repeated again and again until provisioning timeout exceeded>Additional information
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working