Skip to content

glusterd/snapshot (ZFS): refuse snapshot delete when a dependent clone exists#4690

Open
ThalesBarretto wants to merge 1 commit into
gluster:develfrom
ThalesBarretto:snap-zfs-restore-clone-orphan
Open

glusterd/snapshot (ZFS): refuse snapshot delete when a dependent clone exists#4690
ThalesBarretto wants to merge 1 commit into
gluster:develfrom
ThalesBarretto:snap-zfs-restore-clone-orphan

Conversation

@ThalesBarretto

Copy link
Copy Markdown
Contributor

Fixes: #4689

Summary

On a ZFS-backed volume, gluster snapshot clone (and snapshot restore) build the backend with
zfs clone and never zfs promote, so the clone keeps origin == <snapshot>. A later
snapshot delete of that source snapshot runs a plain zfs destroy (no -R), which ZFS correctly
refuses while a dependent clone exists. glusterd logs that failure but does not propagate it — the
return code is overwritten by the subsequent directory-cleanup sys_rmdir — so the CLI reports
snap removed successfully (exit 0), glusterd drops the snapshot object, and the backend ZFS snapshot is
left behind, now untracked. See #4689 for the full root-cause walk.

This change makes the delete fail before any state change, with a clear, actionable error, instead of
silently reporting success.

Fix

  • Add an optional dependents method to the snapshot-provider vtable struct glusterd_snap_ops.
    Providers that do not track dependents leave the slot NULL (LVM is unchanged — designated-init).
  • Implement it for the ZFS provider (glusterd_zfs_snapshot_dependents): reconstruct the snapshot handle
    <dataset>@<volname>_<brick_num> (mirroring glusterd_zfs_snapshot_remove) and run
    zfs get -H -o value clones <snap_device>.
  • Gate the delete prevalidate (glusterd_snapshot_remove_prevalidate): walk the snap's volumes →
    local bricks (MY_UUID) and refuse the op (op_errstr + op_errno, return −1) when a dependent clone
    is confirmed.

Why prevalidate (not commit-phase error propagation): glusterd_snapshot_remove_commit marks the snap
for decommission before the brick-remove runs, so failing at commit time would leave a
half-decommissioned snapshot. Prevalidate refuses before any state is written. (The cluster-wide op-sm
lock serializes snapshot ops, so there is no prevalidate→commit TOCTOU.)

Conservative by design: *has_dependent is set true only on a positively-confirmed, non-empty
clones value. Every inconclusive result — dataset unresolvable, snapshot already gone, zfs not
runnable, output empty or - — returns "no dependent", so the check can never become a new way to block
a legitimate delete and preserves the existing tolerant "already gone" behavior.

Testing (GlusterFS 11.2 + ZFS 2.4.2, single node)

Built on-node and verified with a three-way control (swap only glusterd.so):

glusterd.so delete of a clone-held snapshot backend snapshot
stock reports success (exit 0), snap dropped orphaned
rebuilt, UNPATCHED (control) reports success (exit 0), snap dropped orphaned — rebuild alone doesn't change it
rebuilt, PATCHED refused (exit 1), snap retained retained

The unpatched-rebuild control reproduces identically to stock, so the patched result is attributable to
the source change. Positive controls on the patched build:

  • delete-with-dependent → refused, snap + backend retained, error names the clone (no state change);
  • independent snapshot (no clone) → still deletes and frees its backend snapshot (no regression);
  • after the dependency is removed → the snapshot deletes and the backend snapshot is genuinely gone
    (so only the dependent case is blocked).

The patch applies cleanly to both devel and the 11.2 source.

Known limitations / disclosures (deliberately out of scope here)

  • snapshot delete all becomes fail-fast at a clone-held snapshot. The CLI already issues one delete
    per snapshot and aborts on the first failure (gf_cli_snapshot_for_delete, by design — "Fail the
    operation if deleting one of the snapshots is failed"
    ). With this fix, a clone-held snapshot makes its
    delete fail in prevalidate, so delete all stops there: snapshots processed before it are deleted,
    the clone-held one and any after it remain (tested). This is the same fail-fast behavior as for any
    un-deletable snapshot — and strictly better than the previous silent-orphan-and-continue. Making
    delete all skip-and-continue past a refused snapshot is a possible follow-up, intentionally not done
    here to keep the change minimal.
  • Operator remedy needs backend action. gluster volume delete <clone> does not free the backend
    ZFS clone dataset (verified); the dependent clone has to be removed at the backend (e.g.
    zfs destroy <clone_dataset>) before the snapshot can be deleted. This is a related gap (the clone
    volume's backend dataset outliving volume delete) worth tracking separately; the error message here is
    deliberately worded as "remove the dependent clone" rather than prescribing volume delete.
  • LVM: unchanged (NULL vtable slot; the LVM TU compiles clean against the new struct). Not exercised
    on a live LVM brick.
  • Multi-brick / multi-peer: the check is per-brick and runs on each peer for its local bricks; any
    peer confirming a dependent aborts the op. Verified on a single brick; the per-brick logic is identical
    to the (multi-brick-correct) remove path.
  • auto-delete / restart-GC reach the removal sink via glusterd_snap_remove directly, bypassing this
    delete prevalidate; they are neither covered nor changed by this fix. Verified: soft-limit
    auto-delete of a clone-held oldest snapshot still drops it from glusterd and orphans the backend ZFS
    snapshot, hitting the same swallow in glusterd_snapshot_remove. The shared root cause is that swallow
    (every glusterd_snap_remove caller is exposed); a focused follow-up could guard the sink or make
    auto-delete skip clone-held snapshots (reusing this dependents() method). Kept out of this PR because
    it changes auto-delete's "always frees space" contract and warrants its own design + testing.
  • Synchronous backend probe: the check shells zfs get in prevalidate, so a hung pool blocks the
    delete in prevalidate. This is the same synchronous-zfs exposure glusterd already has in these
    snapshot ops (zfs list/zfs destroy), just surfaced earlier — not a new failure class.
  • Interaction with glusterd/snapshot: fix use-after-scope in glusterd_zfs_dataset() (CWE-562) #4688 (the glusterd_zfs_dataset use-after-scope fix): dependents() calls
    glusterd_zfs_dataset with the current signature and consumes the result immediately (the established
    pattern in create/remove/restore). If glusterd/snapshot: fix use-after-scope in glusterd_zfs_dataset() (CWE-562) #4688 lands first, rebase dependents() onto the caller-owned
    buffer signature.

Note

This is not data loss (no -R is issued); it is a correctness bug (success reported for a failed
operation) plus an invisible, accumulating consumer of pool space. zfs promote is not a safe
alternative (it re-parents the live brick onto the snapshot-clone, after which a zfs destroy -R of the
clone would also destroy the live brick) — see #4689.

…e exists

On a ZFS-backed volume, snapshot clone/restore build the backend with
`zfs clone` and never `zfs promote`, so the clone keeps origin == <snapshot>.
A later `snapshot delete` of that snapshot runs a plain `zfs destroy`, which
ZFS refuses while a dependent clone exists; glusterd logs the failure but the
return code is overwritten by the directory-cleanup rmdir, so it reports
success, drops the snapshot object, and leaves the backend ZFS snapshot
orphaned.

Refuse the delete in prevalidate, before any state change, when the backend
snapshot still has a dependent clone. Add an optional `dependents` method to
the snapshot-provider vtable; the ZFS provider implements it via
`zfs get -H -o value clones`. Providers that do not track dependents (LVM)
leave the slot NULL and are unchanged. The check is conservative: it blocks
only on a positively-confirmed dependent and tolerates every inconclusive
result (snapshot already gone, zfs not runnable, empty output), so it never
blocks a legitimate delete.

Fixes: gluster#4689
Signed-off-by: Thales Antunes de Oliveira Barretto <thales.barretto.git@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

glusterd snapshot (ZFS): snapshot delete reports success when the backend zfs destroy fails, leaving an untracked ZFS snapshot

1 participant