Skip to content

Retry when Docker has internal timeouts behind MiniWDL #5511

@adamnovak

Description

@adamnovak

CI tests can in general fail for no good reason with errors like:

	[2026-04-30T22:47:19+0000] [MainThread] [C] [toil.worker] Worker crashed with traceback:
	Traceback (most recent call last):
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/api/client.py", line 275, in _raise_for_status
	    response.raise_for_status()
	    ~~~~~~~~~~~~~~~~~~~~~~~~~^^
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/requests/models.py", line 1028, in raise_for_status
	    raise HTTPError(http_error_msg, response=self)
	requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.50/services/create
	
	The above exception was the direct cause of the following exception:
	
	Traceback (most recent call last):
	  File "/builds/databiosphere/toil/src/toil/worker.py", line 591, in workerScript
	    job._runner(
	    ~~~~~~~~~~~^
	        jobGraph=None,
	        ^^^^^^^^^^^^^^
	    ...<2 lines>...
	        defer=defer,
	        ^^^^^^^^^^^^
	    )
	    ^
	  File "/builds/databiosphere/toil/src/toil/job.py", line 3376, in _runner
	    returnValues = self._run(jobGraph=None, fileStore=fileStore)
	  File "/builds/databiosphere/toil/src/toil/job.py", line 3254, in _run
	    return self.run(fileStore)
	           ~~~~~~~~^^^^^^^^^^^
	  File "/builds/databiosphere/toil/src/toil/wdl/wdltoil.py", line 333, in decorated
	    return decoratee(*args, **kwargs)
	  File "/builds/databiosphere/toil/src/toil/wdl/wdltoil.py", line 4540, in run
	    task_container.run(miniwdl_logger, command_string)
	    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/WDL/runtime/task_container.py", line 323, in run
	    exit_code = self._run(logger, terminating, command)
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/WDL/runtime/backend/docker_swarm.py", line 233, in _run
	    svc = client.services.create(image_tag, **kwargs)
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/models/services.py", line 235, in create
	    service_id = self.client.api.create_service(**create_kwargs)
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/utils/decorators.py", line 32, in wrapper
	    return f(self, *args, **kwargs)
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/api/service.py", line 187, in create_service
	    return self._result(
	           ~~~~~~~~~~~~^
	        self._post_json(url, data=data, headers=headers), True
	        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	    )
	    ^
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/api/client.py", line 281, in _result
	    self._raise_for_status(response)
	    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/api/client.py", line 277, in _raise_for_status
	    raise create_api_error_from_http_exception(e) from e
	          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
	  File "/builds/databiosphere/toil/venv/lib/python3.14/site-packages/docker/errors.py", line 39, in create_api_error_from_http_exception
	    raise cls(e, response=response, explanation=explanation) from e
	docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.50/services/create: Internal Server Error ("rpc error: code = DeadlineExceeded desc = context deadline exceeded")

See: https://ucsc-ci.com/databiosphere/toil/-/jobs/108201/raw

We should add retry logic, either in MiniWDL (preferred) or else where we call into MiniWDL, to retry with exponential backoff when Docker just fails to create containers for reasons like this.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1832

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions