Volume Snapshot Workflow in CloudStack

This document describes the end-to-end workflow for taking volume-level snapshots from the CloudStack Management Server, organized in the sequence that CloudStack orchestrates the operation.

Overview

A volume snapshot in CloudStack captures the state of a disk at a point in time. The snapshot can be stored on primary storage, secondary storage, or replicated across zones and storage pools. The workflow involves multiple layers: API, orchestration, storage engine, and storage-specific strategy plugins.

Step-by-Step Workflow

Step 1 — API Entry Point: `CreateSnapshotCmd.execute()`

File: api/src/main/java/org/apache/cloudstack/api/command/user/snapshot/CreateSnapshotCmd.java

The user (or scheduler) calls the createSnapshot API. The command is a BaseAsyncCreateCmd, meaning snapshot allocation and execution happen in two separate phases (create and execute).

In the execute() phase:

snapshot = _volumeService.takeSnapshot(
    getVolumeId(), getPolicyId(), getEntityId(),
    getAccount(), getQuiescevm(), getLocationType(),
    getAsyncBackup(), getTags(), getZoneIds(),
    getStoragePoolIds(), useStorageReplication());

Key parameters available to the caller:

volumeId – the volume to snapshot
policyId – optional snapshot policy to apply
locationType – PRIMARY or SECONDARY
asyncBackup – whether to back up to secondary asynchronously
zoneIds – destination zones to copy the snapshot to
storagePoolIds – specific primary storage pools to copy the snapshot to
useStorageReplication – use native cross-zone storage replication (StorPool)

Step 2 — Allocation Phase: `VolumeApiServiceImpl.allocSnapshot()`

File: server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java

Before execute() is called, create() runs allocSnapshot() which:

Verifies the caller has access to the volume.
Validates resource limits (snapshot count, secondary storage quota).
Generates a snapshot name in the format <vmName>_<volumeName>_<timestamp>.
Creates a SnapshotVO record in the database in state Allocated.
Increments resource counters for the account (snapshot count, storage size).

The allocation returns the snapshot ID, which is then used by the execute() phase.

Step 3 — Validation and Path Selection: `VolumeApiServiceImpl.takeSnapshotInternal()`

File: server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java

takeSnapshotInternal() performs pre-flight checks before dispatching work:

Re-validates volume exists and is in Ready state.
Rejects snapshots on External hypervisor type volumes.
Resolves zoneIds and poolIds from snapshot policy details if a policyId is provided.
Validates each destination zone exists.
Checks that the caller has access to both the volume and (if attached) the VM.
If the storage pool is managed and locationType is unset, defaults to LocationType.PRIMARY.
Calls snapshotHelper.addStoragePoolsForCopyToPrimary() to resolve storage pool IDs when useStorageReplication is enabled.

Path selection based on VM attachment:

Volume attached to running VM?
├── YES → Serialize via VM Work Job Queue
│         (Step 4a — job queue path)
└── NO  → Direct execution
          (Step 4b — direct path)

Step 4a — Serialized Execution via VM Work Job Queue

File: server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java

When the volume is attached to a VM, CloudStack serializes the operation using the VM Work Job queue. This prevents concurrent conflicting operations on the same VM.

Outcome<Snapshot> outcome = takeVolumeSnapshotThroughJobQueue(
    vm.getId(), volumeId, policyId, snapshotId,
    account.getId(), quiesceVm, locationType,
    asyncBackup, zoneIds, poolIds);

A VmWorkTakeVolumeSnapshot work item is created and dispatched. The job framework eventually calls orchestrateTakeVolumeSnapshot(VmWorkTakeVolumeSnapshot work) from within the VM work job dispatcher.

If the current thread is already running inside the job dispatcher (re-entrant case), a placeholder work record is created and orchestrateTakeVolumeSnapshot() is called directly to avoid deadlock.

VmWorkTakeVolumeSnapshot carries:

// engine/components-api/src/main/java/com/cloud/vm/VmWorkTakeVolumeSnapshot.java
new VmWorkTakeVolumeSnapshot(userId, accountId, vmId, handlerName,
    volumeId, policyId, snapshotId, quiesceVm,
    locationType, asyncBackup, zoneIds, poolIds);

Step 4b — Direct Execution (Volume Not Attached to VM)

File: server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java

When the volume is not attached to a VM, a CreateSnapshotPayload is built and attached directly to the volume:

CreateSnapshotPayload payload = new CreateSnapshotPayload();
payload.setSnapshotId(snapshotId);
payload.setSnapshotPolicyId(policyId);
payload.setAccount(account);
payload.setQuiescevm(quiescevm);
payload.setLocationType(locationType);
payload.setAsyncBackup(asyncBackup);
payload.setZoneIds(zoneIds);
payload.setStoragePoolIds(poolIds);

volume.addPayload(payload);
return volService.takeSnapshot(volume);

Step 5 — Orchestration: `orchestrateTakeVolumeSnapshot()`

File: server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java

Whether coming from the job queue or directly, orchestrateTakeVolumeSnapshot() handles the final preparation:

Re-validates the volume is still Ready.
Detects whether the volume is encrypted and on a running VM; rejects such snapshots unless the storage is StorPool (which supports live encrypted volume snapshots).
Builds the CreateSnapshotPayload with all execution parameters.
Attaches the payload to the volume.
Calls volService.takeSnapshot(volume) — delegating to SnapshotManagerImpl.

StorPool encrypted volume exception:

boolean isSnapshotOnStorPoolOnly =
    volume.getStoragePoolType() == StoragePoolType.StorPool &&
    SnapshotInfo.BackupSnapshotAfterTakingSnapshot.value();
// Allow live snapshot of encrypted volumes on StorPool primary storage

Step 6 — Strategy Selection and Snapshot Execution: `SnapshotManagerImpl.takeSnapshot()`

File: server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java

This is the core snapshot execution method:

Extracts CreateSnapshotPayload from the volume.
Determines whether to use KVM file-based storage path.
Checks if backup to secondary storage is needed for this zone.
For KVM file-based storage with secondary backup, allocates an image store.
Selects the appropriate SnapshotStrategy via StorageStrategyFactory.getSnapshotStrategy(snapshot, TAKE).

Strategy selection priority (highest wins):

Strategy	Priority	Handles
`StorPoolSnapshotStrategy`	HIGHEST (for DELETE/COPY)	DELETE, COPY on StorPool storage
`StorageSystemSnapshotStrategy`	HIGH	Managed storage (TAKE, DELETE)
`DefaultSnapshotStrategy`	DEFAULT	File-based hypervisor snapshots
`CephSnapshotStrategy`	HIGH	Ceph RBD snapshots
`ScaleIOSnapshotStrategy`	HIGH	ScaleIO/PowerFlex snapshots

Calls snapshotStrategy.takeSnapshot(snapshot) which returns a SnapshotInfo on primary storage.

Step 7 — Primary Storage Snapshot Creation: `SnapshotServiceImpl.takeSnapshot()`

File: engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/snapshot/SnapshotServiceImpl.java

The storage engine creates the snapshot on primary storage:

Creates a snapshot state object on the primary data store.
Transitions snapshot state: CreateRequested.
Transitions volume state: Volume.Event.SnapshotRequested.
Issues an asynchronous command to the primary data store driver (PrimaryDataStoreDriver.takeSnapshot()).
Waits for the async callback via AsyncCallFuture<SnapshotResult>.
On success:
- Updates physical size from the driver response.
- Publishes EVENT_SNAPSHOT_ON_PRIMARY usage event.
- Transitions volume: Volume.Event.OperationSucceeded.
On failure:
- Transitions snapshot to OperationFailed.
- Transitions volume: Volume.Event.OperationFailed.

Step 8 — Secondary Storage Backup Decision

File: server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java

After the snapshot is created on primary, CloudStack decides whether to back it up:

BackupSnapshotAfterTakingSnapshot == true?
├── YES
│   ├── KVM file-based → postSnapshotDirectlyToSecondary()
│   │   (snapshot already on secondary — update DB reference only)
│   └── Otherwise → backupSnapshotToSecondary()
│       ├── asyncBackup == true → schedule BackupSnapshotTask
│       └── asyncBackup == false → synchronous backupSnapshot() + postSnapshotCreation()
└── NO
    ├── storagePoolIds provided AND asyncBackup → schedule BackupSnapshotTask for pool copy
    └── Otherwise → markBackedUp() (snapshot stays on primary only)

BackupSnapshotTask (async retry runner):

Retries backup up to snapshot.backup.to.secondary.retries times.
On exhausting retries, calls snapshotSrv.cleanupOnSnapshotBackupFailure() to remove the snapshot record.

Step 9 — StorPool Cross-Zone Snapshot Copy: `StorPoolSnapshotStrategy.copySnapshot()`

File: plugins/storage/volume/storpool/src/main/java/org/apache/cloudstack/storage/snapshot/StorPoolSnapshotStrategy.java

When storagePoolIds are provided and the storage is StorPool, the snapshot is replicated natively between clusters:

Export the snapshot from the local StorPool cluster to the remote location using snapshotExport().
Persist recovery information in snapshot_details table with the exported name and location, so that partial cross-zone copies can be recovered.
Copy from remote on the destination StorPool cluster using snapshotFromRemote().
Reconcile the snapshot on the remote cluster using snapshotReconcile().
Update the snapshot_store_ref.install_path in the database to reflect the destination path.
Invoke the async callback with success or failure.

Recovery detail saved:

// Stored so incomplete exports can be cleaned up later
String detail = "~" + snapshotName + ";" + location;
new SnapshotDetailsVO(snapshot.getId(), SP_RECOVERED_SNAPSHOT, detail, true);

Step 10 — Post-Snapshot Processing: `postCreateSnapshot()` and Zone/Pool Copies

File: server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java

After snapshot creation (and optional backup):

postCreateSnapshot(): Updates snapshot policy retention — removes the oldest snapshot if the retention count is exceeded.
snapshotZoneDao.addSnapshotToZone(): Associates the snapshot with its origin zone.
Usage event: Publishes EVENT_SNAPSHOT_CREATE with the physical size of the snapshot.
Resource limit correction: For delta (incremental) snapshots, decrements the pre-allocated resource count by (volumeSize − snapshotPhysicalSize) since the actual snapshot is smaller than the volume.
copyNewSnapshotToZones() (synchronous backup path only): Copies the snapshot to secondary storage in additional destination zones.
copyNewSnapshotToZonesOnPrimary() (synchronous backup path only): Copies the snapshot to additional primary storage pools.

Step 11 — Rollback on Failure

File: server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java

The outer try/catch in takeSnapshot() ensures resource cleanup on any failure:

} catch (CloudRuntimeException | UnsupportedOperationException cre) {
    ResourceType storeResourceType = getStoreResourceType(...);
    _resourceLimitMgr.decrementResourceCount(snapshotOwner.getId(), ResourceType.snapshot);
    _resourceLimitMgr.decrementResourceCount(snapshotOwner.getId(), storeResourceType, volumeSize);
    throw cre;
} catch (Exception e) {
    // Same resource rollback
    throw new CloudRuntimeException("Failed to create snapshot", e);
}

Additional cleanup methods:

Method	Trigger	Action
`cleanupVolumeDuringSnapshotFailure()`	Snapshot creation fails completely	Removes `snapshot_store_ref` entries (non-Destroyed) and deletes the `SnapshotVO` record
`cleanupOnSnapshotBackupFailure()`	Async backup exhausts all retries	Transitions snapshot state, removes async job MS_ID, deletes snapshot record
`StorPoolSnapshotStrategy.deleteSnapshot()`	Snapshot DELETE operation on StorPool	Calls StorPool API `snapshotDelete`, transitions state, cleans up DB

Sequence Diagram (Text Form)

User/Scheduler
    │
    ▼
CreateSnapshotCmd.create()
    │ allocSnapshot() → SnapshotVO persisted (Allocated state)
    ▼
CreateSnapshotCmd.execute()
    │
    ▼
VolumeApiServiceImpl.takeSnapshot()
    │
    ▼
takeSnapshotInternal()
    │  validate volume, account, zones, policies
    │
    ├── [Volume attached to VM] ─────────────────────────────┐
    │        takeVolumeSnapshotThroughJobQueue()               │
    │        VmWorkTakeVolumeSnapshot dispatched               │
    │        ← job queue serializes VM operations →           │
    │                                                          ▼
    └── [Volume not attached] ──► orchestrateTakeVolumeSnapshot()
                                       │ build CreateSnapshotPayload
                                       │ volume.addPayload(payload)
                                       ▼
                               SnapshotManagerImpl.takeSnapshot()
                                       │
                                       │ StorageStrategyFactory.getSnapshotStrategy(TAKE)
                                       ▼
                               snapshotStrategy.takeSnapshot(snapshot)
                                       │
                                       ▼
                               SnapshotServiceImpl.takeSnapshot()
                                       │ PrimaryDataStoreDriver.takeSnapshot() [async]
                                       │ ← waits on AsyncCallFuture →
                                       │ snapshot created on primary storage
                                       ▼
                               Backup decision
                               ├── BackupSnapshotAfterTakingSnapshot=true
                               │       backupSnapshotToSecondary()  [sync or async]
                               └── BackupSnapshotAfterTakingSnapshot=false
                                       markBackedUp() / schedule pool copy
                                       ▼
                               postCreateSnapshot()
                                   snapshotZoneDao.addSnapshotToZone()
                                   UsageEventUtils.publishUsageEvent()
                                   _resourceLimitMgr.decrementResourceCount()
                                   copyNewSnapshotToZones()         [if zoneIds]
                                   copyNewSnapshotToZonesOnPrimary() [if poolIds]
                                       ▼
                               Return SnapshotInfo to caller

Key Classes and Their Roles

Class	Package	Role
`CreateSnapshotCmd`	`api/.../command/user/snapshot`	API command entry point; two-phase create+execute
`VolumeApiServiceImpl`	`server/.../storage`	Validates, dispatches, and orchestrates snapshot requests
`VmWorkTakeVolumeSnapshot`	`engine/components-api/.../vm`	Work item for job queue; carries all snapshot parameters
`SnapshotManagerImpl`	`server/.../storage/snapshot`	Core business logic; strategy selection; resource accounting
`SnapshotHelper`	`server/.../snapshot`	Resolves storage pool IDs for cross-zone replication
`SnapshotServiceImpl`	`engine/storage/snapshot`	Interacts with primary data store driver asynchronously
`DefaultSnapshotStrategy`	`engine/storage/snapshot`	Hypervisor-based (file) snapshot implementation
`StorageSystemSnapshotStrategy`	`engine/storage/snapshot`	Managed storage native snapshot implementation
`StorPoolSnapshotStrategy`	`plugins/storage/volume/storpool`	StorPool native snapshot; handles DELETE and cross-zone COPY
`StorageStrategyFactory`	`engine/storage`	Selects the highest-priority strategy for each operation

Key Configuration Parameters

Parameter	Default	Description
`backup.snapshot.after.taking.snapshot` (`BackupSnapshotAfterTakingSnapshot`)	`true`	Whether to back up snapshot to secondary storage after creation
`snapshot.backup.retries`	`3`	Number of retry attempts for asynchronous snapshot backup
`snapshot.backup.retry.interval`	`300` (seconds)	Interval between retry attempts for async backup
`use.storage.replication`	`false`	Use native storage replication (e.g., StorPool cross-zone copy) instead of secondary storage copy
`snapshot.copy.multiply.exp.backoff`	—	Exponential backoff configuration for snapshot copy retries

Rollback Summary

CloudStack implements rollback at multiple layers to maintain consistency:

Resource limit rollback — On any exception in SnapshotManagerImpl.takeSnapshot(), snapshot count and storage quotas are decremented back to their original values.
Volume state rollback — Volume.Event.OperationFailed is fired so the volume returns to Ready state.
Snapshot state machine — Snapshot transitions to Error or Destroyed so it can be cleaned up by the background expunge process.
Async backup failure cleanup — After exhausting all retries, cleanupOnSnapshotBackupFailure() runs in a transaction to delete the snapshot record and associated job metadata.
StorPool cross-zone recovery — The exported (but not yet imported) snapshot name is persisted in snapshot_details with the key SP_RECOVERED_SNAPSHOT, enabling manual or automated cleanup of partial cross-zone copies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume Snapshot Workflow in CloudStack

Overview

Step-by-Step Workflow

Step 1 — API Entry Point: `CreateSnapshotCmd.execute()`

Step 2 — Allocation Phase: `VolumeApiServiceImpl.allocSnapshot()`

Step 3 — Validation and Path Selection: `VolumeApiServiceImpl.takeSnapshotInternal()`

Step 4a — Serialized Execution via VM Work Job Queue

Step 4b — Direct Execution (Volume Not Attached to VM)

Step 5 — Orchestration: `orchestrateTakeVolumeSnapshot()`

Step 6 — Strategy Selection and Snapshot Execution: `SnapshotManagerImpl.takeSnapshot()`

Step 7 — Primary Storage Snapshot Creation: `SnapshotServiceImpl.takeSnapshot()`

Step 8 — Secondary Storage Backup Decision

Step 9 — StorPool Cross-Zone Snapshot Copy: `StorPoolSnapshotStrategy.copySnapshot()`

Step 10 — Post-Snapshot Processing: `postCreateSnapshot()` and Zone/Pool Copies

Step 11 — Rollback on Failure

Sequence Diagram (Text Form)

Key Classes and Their Roles

Key Configuration Parameters

Rollback Summary

FilesExpand file tree

volume-snapshot-workflow.md

Latest commit

History

volume-snapshot-workflow.md

File metadata and controls

Volume Snapshot Workflow in CloudStack

Overview

Step-by-Step Workflow

Step 1 — API Entry Point: CreateSnapshotCmd.execute()

Step 2 — Allocation Phase: VolumeApiServiceImpl.allocSnapshot()

Step 3 — Validation and Path Selection: VolumeApiServiceImpl.takeSnapshotInternal()

Step 4a — Serialized Execution via VM Work Job Queue

Step 4b — Direct Execution (Volume Not Attached to VM)

Step 5 — Orchestration: orchestrateTakeVolumeSnapshot()

Step 6 — Strategy Selection and Snapshot Execution: SnapshotManagerImpl.takeSnapshot()

Step 7 — Primary Storage Snapshot Creation: SnapshotServiceImpl.takeSnapshot()

Step 8 — Secondary Storage Backup Decision

Step 9 — StorPool Cross-Zone Snapshot Copy: StorPoolSnapshotStrategy.copySnapshot()

Step 10 — Post-Snapshot Processing: postCreateSnapshot() and Zone/Pool Copies

Step 11 — Rollback on Failure

Sequence Diagram (Text Form)

Key Classes and Their Roles

Key Configuration Parameters

Rollback Summary

Step 1 — API Entry Point: `CreateSnapshotCmd.execute()`

Step 2 — Allocation Phase: `VolumeApiServiceImpl.allocSnapshot()`

Step 3 — Validation and Path Selection: `VolumeApiServiceImpl.takeSnapshotInternal()`

Step 5 — Orchestration: `orchestrateTakeVolumeSnapshot()`

Step 6 — Strategy Selection and Snapshot Execution: `SnapshotManagerImpl.takeSnapshot()`

Step 7 — Primary Storage Snapshot Creation: `SnapshotServiceImpl.takeSnapshot()`

Step 9 — StorPool Cross-Zone Snapshot Copy: `StorPoolSnapshotStrategy.copySnapshot()`

Step 10 — Post-Snapshot Processing: `postCreateSnapshot()` and Zone/Pool Copies