Skip to content

[Bug]: Job termination detaches a volume while it's in use by another job when blocks feature is used #3841

@un-def

Description

@un-def

Steps to reproduce

type: volume
name: demo-volume
backend: gcp  # or aws
region: <REGION>
availability_zone: <AZ>
size: 10GB
type: fleet
name: demo-fleet
nodes: 1
backends: [gcp]  # or [aws]
regions: [<REGION>]
availability_zones: [<AZ>]
resources:
  cpu: 4..
  memory: 1GB..
  disk: 1GB..
  gpu: 0
blocks: auto
type: dev-environment
volumes:
  - demo-volume:/volume
init:
  - echo $DSTACK_JOB_ID > /volume/job_id
resources:
  cpu: 1..
  memory: 1GB..
  gpu: 0..
  disk: 1GB..
  • Create a fleet and a volume
  • dstack apply --name devenv-1 --fleet demo-fleet
    
  • ssh devenv-1 cat /volume/job_id
    f5ecabff-61d7-4914-9ca4-bc4043069a66
    
  • dstack apply --name devenv-2 --fleet demo-fleet --reuse
    Error (Volume error)
    Failed to attach volume: unexpected error
    
  • ssh devenv-1 cat /volume/job_id
    cat: /volume/job_ids: Input/output error
    
  • (see server logs section for the error produced by this step)
    dstack stop devenv-1
    

Actual behaviour

The second job fails on Compute.attach_volume() since the volume is in use (already attached to the same instance), then, during failed job termination, the server calls Compute.detach_volume(), successfully detaching the volume from the instance despite it's still used by the first job.

Expected behaviour

No response

dstack version

0.20.19

Server logs

ERROR    dstack._internal.server.background.pipeline_tasks.jobs_terminating:981 Got exception when detaching volume volume-gcp from
                    instance gcp-0
                    Traceback (most recent call last):
                      File "/home/def/dev/dstack/src/dstack/_internal/server/background/pipeline_tasks/jobs_terminating.py", line 940, in
                    _detach_volume_from_job_instance
                        await common.run_async(
                        ...<4 lines>...
                        )
                      File "/home/def/dev/dstack/src/dstack/_internal/utils/common.py", line 50, in run_async
                        return await asyncio.get_running_loop().run_in_executor(None, func_with_args)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/concurrent/futures/thread.py",
                    line 59, in run
                        result = self.fn(*self.args, **self.kwargs)
                      File "/home/def/dev/dstack/src/dstack/_internal/core/backends/gcp/compute.py", line 857, in detach_volume
                        attachment_data = get_or_error(volume.get_attachment_data_for_instance(instance_id))
                      File "/home/def/dev/dstack/src/dstack/_internal/utils/common.py", line 292, in get_or_error
                        raise ValueError("Optional value is None")
                    ValueError: Optional value is None

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingvolumes

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions