diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md
new file mode 100644
index 000000000000..d3cc1aa51597
--- /dev/null
+++ b/design/ceph/osd-replacement.md
@@ -0,0 +1,322 @@
+# Design: Single OSD replacement with a shared metadata device
+
+Issue: [rook/rook#13240](https://github.com/rook/rook/issues/13240)
+
+## Problem
+
+When an OSD's data and metadata live on different devices (per `spec.storage` `metadataDevice` config in the CephCluster CR), Rook today cannot replace a single failed OSD on its own. The user must either re-provision all OSDs sharing the same metadata device or run a multi-step manual workflow including scaling down the operator to zero. Raw-mode OSDs (data and metadata on a single disk) follow a similar manual procedure today, with fewer steps.
+
+This design proposes a workflow to replace a single failed OSD in place — preserving its OSD ID — without affecting other OSDs sharing the same metadata device.
+
+## Notation
+
+- **User** - the human cluster admin who edits Kubernetes objects.
+- **Operator** - the Rook controller process (CephCluster controller).
+- **Data LV / data device** - the LV (or block device) holding an OSD's bulk data. One per OSD.
+- **DB LV / metadata device** - the LV holding the OSD's rocksdb (`block.db`). One per OSD; multiple OSDs can share the same metadata device.
+
+## User story
+
+A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NVMe metadata device. The user annotates the failed OSD's deployment:
+
+```sh
+kubectl -n rook-ceph annotate deployment rook-ceph-osd-5 \
+ rook.io/osd-replace=yes-really-replace-osd-5
+```
+
+Rook drains and destroys `osd.5` in Ceph (preserving its CRUSH position and OSD ID — the slot is marked `destroyed` in the OSDMap), removes the data and DB LVs, and deletes the OSD deployment. The user swaps the physical disk in the chassis at any later time — minutes or days — and the next prepare-job reconcile on that node pairs the destroyed slot with the new disk, provisioning a new OSD with the preserved OSD ID 5 and a fresh DB LV in the existing metadata VG. The other four OSDs on the same NVMe stay up the whole time.
+
+The flow works for both already-failed disks (Ceph has already auto-marked the OSD `out` after `mon_osd_down_out_interval`) and misbehaving-but-still-running disks (slow IO, bad blocks, not yet auto-marked out). Rook unconditionally runs `ceph osd out` during drain — idempotent for already-out OSDs, and required for `safe-to-destroy` to ever pass on healthy OSDs.
+
+## Constraints
+
+### Replacement is same-host
+
+The new disk must go to the same host as the destroyed OSD. For shared-metadata layouts this is a hard requirement: the metadata VG holding the new DB LV lives on a device attached to that host. For other layouts there is no hard constraint, but preserving the OSD ID across hosts does not buy anything — CRUSH remaps PGs on daemon start anyway. Cephadm takes the same defensive approach and does not allow [cross-host replacement](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd).
+
+### Rook cannot tell a replacement disk from a new disk
+
+Rook cannot reliably distinguish a freshly added disk from a replacement disk. The user must therefore mark the failed OSD for replacement *before* swapping the disk, so Rook can drive cleanup and reserve the OSD slot for the incoming disk.
+
+### Storage device config must tolerate device swap
+
+Rook's `spec.storage` accepts several device-reference shapes (`useAllDevices`, `deviceFilter`, kernel names, and udev symlinks under `/dev/disk/by-path`, `by-id`, `by-uuid`). Only some of them resolve to the new disk after a swap: `useAllDevices` and `deviceFilter` tolerate any swap, `by-path` tolerates same-slot replacement, and the rest cannot. The replacement flow validates the affected OSD's references before drain and rejects the non-tolerant shapes; see [open question 5](#open-questions) for the full table and trade-offs.
+
+## Proposed flow
+
+This flow drives [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`ok-to-stop` → `osd out` → `safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm batch --osd-ids`) from inside the existing CephCluster controller. The design adds one new Kubernetes Job — the Replace Job — that destroys the OSD in Ceph and zaps its data on the host.
+
+State is read from Ceph and Kubernetes, not stored in Rook. Three markers carry it through the procedure:
+
+- **Annotation on the OSD Deployment.** Marks intent. Stays in place while the operator validates the request and runs `ok-to-stop`, `osd out`, and `safe-to-destroy`. Cleaned up automatically when the Deployment is deleted.
+- **[Suspended](https://kubernetes.io/blog/2021/04/12/introducing-suspended-jobs/#api-changes) Replace Job.** The operator captures the OSD's fsid, CV mode, data-device path, and encryption flag into the Job's env, creates the Job suspended, and deletes the OSD Deployment. The Job is resumed once the OSD pod terminates.
+- **Destroyed slot in the OSDMap.** The Replace Job runs `ceph osd destroy` to mark the OSD's slot as destroyed in the OSDMap and exits after cleanup is done. The user can physically swap the failed disk at any time after that — destroy and provision are decoupled. The existing prepare-job is extended to check the OSDMap for destroyed OSDs on the same node and reuse them when provisioning a new disk, the same primitive [cephadm uses for `orch osd rm --replace`](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd).
+
+### Sequence
+
+```mermaid
+sequenceDiagram
+ autonumber
+ actor User
+ participant Dep as OSD Deployment
+ participant Op as Operator
+ participant J as Job
+ participant Ceph as Ceph
+
+ User->>Dep: kubectl annotate
+ Op->>Dep: watch
+ critical Validation - emit error and skip OSD if failed
+ Op->>Op: validate Cluster CR device names
and annotation OSD id
+ Op->>Ceph: ceph osd dump (osd.5 exists, state != destroyed)
+ Op->>Ceph: ceph osd ok-to-stop 5
+ end
+ Op->>Ceph: ceph osd out 5 (idempotent)
+ Note left of Ceph: OSD up+out.
PGs migrate to peers.
+ Op->>Ceph: ceph osd safe-to-destroy 5
+ Note left of Op: RequeueAfter 30s until pass.
1h timeout.
+
+ Op->>Dep: read ROOK_OSD_UUID, ROOK_CV_MODE,
ROOK_BLOCK_PATH, encrypted label
+ Op->>+J: create Replace Job
(spec.suspend=true)
OSD_ID, OSD_FSID, OSD_CV_MODE
OSD_DATA_DEVICE, OSD_ENCRYPTED
+
+ Op->>Dep: delete
(propagationPolicy=Foreground)
+ destroy Dep
+ Dep->>Op: deployment removed, reconcile triggered
+ Op->>J: patch spec.suspend=false
+ Note over J: Pod scheduled
+
+ J->>Ceph: ceph osd dump (if osd.5 == destroyed, jump to zap)
+ J->>Ceph: ceph osd safe-to-destroy 5 (re-check)
+ J->>Ceph: ceph osd down osd.5
+ J->>Ceph: ceph osd destroy osd.5 --yes-i-really-mean-it
+ Note over Ceph: OSD slot stays in CRUSH as destroyed.
+
+ J->>Ceph: (lvm only):
ceph-volume lvm zap --osd-id 5 --destroy
+ Note over Ceph: walks ceph.osd_id LV tags.
closes dm-crypt mappings.
removes data and DB LVs
+ J->>Ceph: (raw mode + encrypted):
cryptsetup close ceph-OSD_FSID-vdc-block-dmcrypt
+ J->>Ceph: (raw mode):
ceph-volume lvm zap OSD_DATA_DEVICE --destroy
+ J-->>-Op: succeeded
+
+ Note over User,Ceph: user swaps the failed disk on the host.
Any time after the Replace Job succeeds. No timeout.
End of replace flow. The rest is done in normal OSD provision flow.
+
+ Op->>+J: next prepare-job reconcile on the node
+ J->>Ceph: ceph osd tree --states destroyed
(checks free slots on the node)
+ J->>Ceph: ceph-volume lvm batch --no-auto /dev/newDataDisk \
--db-devices /dev/metaDev --osd-ids 5 \
[--dmcrypt] --yes --no-systemd
+ Note over J: handoff to exiting flow
prepare-job writes OSDInfo to CM
+ J-->>-Op: succeeded
+
+ create participant NewDep as new OSD Deployment
+ Op->>NewDep: create rook-ceph-osd-5
+ Op->>Ceph: poll ceph osd tree until osd.5 is up+in
+```
+
+### Step-by-step
+
+The walk-through uses the running example.
+
+#### 1. Trigger
+
+User sets the annotation `rook.io/osd-replace=yes-really-replace-osd-` on the OSD deployment. The numeric `` in the value must equal the deployment's `ceph-osd-id` label (copy-paste guard against operating on the wrong OSD).
+
+A predicate carve-out on the CephCluster controller's owned-Deployment watch fires the reconcile when this annotation transitions from absent to present, value changes, or present to absent. Other Deployment updates remain suppressed.
+
+#### 2. Validate
+
+Cheap upfront checks. On the first failed check the operator returns from the reconcile without requeuing; `ReportReconcileResult` emits a Warning event on the CephCluster CR. See [open question 6](#open-questions) on where to report the validation result.
+
+1. **Confirmation matches.** Annotation value equals `yes-really-replace-osd-` where `` equals the deployment's `ceph-osd-id` label.
+2. **Target OSD exists and is not already destroyed.** `ceph osd dump`: the OSD must exist and its `state` must not contain `destroyed`.
+3. **Deployment is host-based.** Label `ceph.rook.io/pvc` must be absent. See [Notes on Scope](#pvc-based-osd-replacement--separate-design).
+4. **Data-device reference is swap-tolerant.** The OSD's `spec.storage` reference must be `useAllDevices`, `deviceFilter`, or a `/dev/disk/by-path/...` path (same-slot replacement only). See [open question 5](#open-questions) on whether to make this configurable.
+5. **Drain admission.** `ceph osd ok-to-stop ` must return OK. Failure means removing this OSD would push some PG below `min_size`; the user resolves the underlying cluster state before re-annotating.
+
+On all checks passing, the controller proceeds to drain.
+
+#### 3. Drain
+
+Drain leaves the OSD `up` but with no PGs, then waits for `safe-to-destroy`.
+
+1. **`ceph osd out `.** Unconditional. Idempotent for already-out OSDs and required for `safe-to-destroy` to ever pass on a healthy OSD.
+2. **Poll `ceph osd safe-to-destroy ` until OK.**
+
+Polling uses `RequeueAfter: 30s`, with a 1h cumulative timeout since drain start enforced via a wall-clock annotation `rook.io/osd-replace-started-at=` stamped on the deployment on the first drain reconcile.
+
+Retry on failure or expiry: the user removes the `rook.io/osd-replace` annotation and re-adds it. The annotation-flip predicate fires a fresh reconcile; the operator clears the stale `started-at` annotation on a fresh start.
+
+#### 4. Spawn the Replace Job
+
+Before deleting the deployment, the operator reads a small set of immutable fields off it and writes them to the Job's env. Kubernetes Job env is immutable across pod restarts within the same Job, so the Job never needs to re-derive these from Ceph state that `osd destroy` will zero out.
+
+| Job env var | Source on the Deployment | Used by |
+|-------------------|-----------------------------------------------|----------------------------------------------------|
+| `OSD_ID` | annotation value (`yes-really-replace-osd-`) | all teardown commands |
+| `OSD_FSID` | container env `ROOK_OSD_UUID` | raw-encrypted dm-crypt mapper name |
+| `OSD_CV_MODE` | container env `ROOK_CV_MODE` | Replace Job step 5 branch (lvm zap vs raw zap) |
+| `OSD_DATA_DEVICE` | container env `ROOK_BLOCK_PATH` | raw zap target |
+| `OSD_ENCRYPTED` | label `encrypted` on the OSD deployment | raw-encrypted gate for `cryptsetup close` |
+
+The Job is created with `spec.suspend=true` and a deterministic name (`rook-ceph-osd-replace-`). The suspended Job *is* the durable replacement marker until it completes. On retry, the operator deletes any prior completed/failed Job with the same name before creating a fresh one. The Job reuses the existing prepare-job pod scaffold.
+
+#### 5. Delete the OSD deployment
+
+The operator deletes the OSD deployment with `propagationPolicy=Foreground` and returns from the reconcile without polling. Kubernetes holds the deployment in `Terminating` until the OSD pod is fully terminated and the daemon releases its hold on the data/DB LVs. (dm-crypt mappings stay open at this point — the Replace Job's step 5 closes them.)
+
+The owned-Deployment watch fires the next reconcile when the deployment is finally removed from the API.
+
+#### 6. Unsuspend the Replace Job
+
+On the deletion-triggered reconcile, the operator observes a suspended Job and the now-absent deployment. It patches `Job.spec.suspend=false`; Kubernetes schedules the pod and the Job container begins execution. Job env is fixed at create time, so the values read off the deployment in step 4 survive the unsuspend.
+
+#### 7. Replace Job execution
+
+Steps 1-4 are mode-agnostic and idempotent on pod restart:
+
+1. **State check.** `ceph osd dump`: if `osd.` state already contains `destroyed`, skip steps 2-4 and jump to step 5.
+2. **Defensive `safe-to-destroy ` re-check.**
+3. **`ceph osd down osd.`.** Forces the mon view to `down`. Idempotent. Dodges heartbeat-lag `EBUSY` on the next step.
+4. **`ceph osd destroy osd. --yes-i-really-mean-it`.** The mon's `KVMonitor::do_osd_destroy` clears `dm-crypt/osd//*` and `daemon-private/osd./*` keys ([KVMonitor.cc#L369-L387](https://github.com/ceph/ceph/blob/v19.2.2/src/mon/KVMonitor.cc#L369-L387)). CRUSH bucket and weight preserved; no explicit `config-key rm` needed.
+
+Step 5 is mode-specific:
+
+**LVM mode (`OSD_CV_MODE == lvm`).** `ceph-volume lvm zap --osd-id --destroy` walks `ceph.osd_id` LV tags, closes dm-crypt mappings, and removes data and DB LVs. Sibling DB LVs on a shared metadata VG are untouched — `zap --destroy` removes the parent VG only when at most one LV remains. `lvm zap --osd-id` is not idempotent, so it is guarded with `ceph-volume lvm list --format json` in case of Job retry.
+
+**Raw mode (`OSD_CV_MODE == raw`).** All required inputs are in the Job env.
+
+- If `OSD_ENCRYPTED == true`: `cryptsetup close ceph---block-dmcrypt` closes the mapping; the raw zap path does not handle dm-crypt teardown. Non-idempotent — guarded by `cryptsetup status` (exits 4 for inactive mappings, 0 when active).
+- `ceph-volume lvm zap --destroy` runs `ceph-bluestore-tool zap-device` + `wipefs` + `dd`. This is naturally idempotent on a raw device path: a second run re-zeroes the first 10 MB and exits 0. No retry gate is needed for raw mode.
+
+Implementation note for raw-encrypted: `ROOK_BLOCK_PATH` may point at the dm-crypt mapper path (e.g., `/dev/mapper/ceph---block-dmcrypt`), not the underlying disk. The destroy logic must resolve the underlying device before close + zap (mirroring how the PVC branch of `DestroyOSD` resolves the real device at [remove.go#L272-L277](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L272-L277)).
+
+The destroy and zap logic lives in Go as an extension of the existing [`DestroyOSD`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L245-L290) function (today called by the OSD migration path). The Replace Job's container invokes a destroy-only entry point — a `rook ceph osd destroy --id ` subcommand or env switch in `prepareOSD` that skips `NewAgent` and `Provision`, runs the extended `DestroyOSD`, and exits. See [Why a separate Replace Job](#why-a-separate-replace-job).
+
+After the Job reaches `status.succeeded=1`, its marker role is done — the destroyed slot in `ceph osd tree` is now the steering signal. The operator may garbage-collect the Replace Job once `osd.` is observed `up+in` in the final reconcile.
+
+#### 8. Wait for disk swap
+
+The destroyed slot lives in Ceph's OSDMap. No Rook-side state is kept between destroy and provision.
+
+User contract: swap the failed disk only after the Replace Job completes (`Job.status.succeeded == 1`). No timeout — the slot can sit in destroyed state for days. There is no programmatic gate on early-swap by default; see [open question 1](#open-questions).
+
+#### 9. Provision and complete
+
+On the next prepare-job reconcile after the user swaps the disk, the prepare-job runs the existing discovery pass plus a new per-node pre-step:
+
+1. **Enumerate destroyed slots for this node.** `ceph osd tree --states destroyed --format json` filtered by the prepare-job's node name (read from `ROOK_NODE_NAME` env).
+2. **Discovery finds the new empty data device.**
+3. **Invoke `ceph-volume` with the destroyed slot's ID.** Per layout:
+ - **LVM with shared metadata device** (encrypted or not): `ceph-volume lvm batch --no-auto /dev/ --db-devices /dev/ --osd-ids [--dmcrypt] --yes --no-systemd`.
+ - **LVM single-disk** (no separate metadata device): same command with `--db-devices` omitted.
+ - **Raw single-disk**: `ceph-volume raw prepare --bluestore --osd-id --data /dev/ [--dmcrypt]`.
+
+Invocation notes for `lvm batch`:
+
+- **`--osd-ids ` is what claims the destroyed slot.** Internally `ceph-volume` calls `ceph osd new `, binding a fresh OSD UUID to the existing slot. The OSD ID, CRUSH bucket, and weight are preserved; only the UUID changes.
+- **The metadata device is passed as a raw block-device path** (e.g., `/dev/nvme0n1`), not as a specific LV. `ceph-volume` detects the existing metadata VG already on the device and creates a new DB LV inside it — sibling DB LVs (other OSDs sharing the metadata device) are untouched. This relies on the spec's `metadataDevice` still pointing at the same physical device the sibling OSDs use; see [open question 4](#open-questions) for the spec-drift risk.
+- **`--no-systemd` is required.** Without it, if `ceph-volume` hits an activate failure its rollback runs `osd purge-new`, which would destroy the slot we just claimed.
+- **Use `lvm batch /dev/ ...`, not `lvm batch --data /dev/ ...`.** `lvm batch` reads data devices as bare paths; the `--data` flag exists but collides with `--data-slots` / `--data-allocate-fraction`.
+
+The prepare-job writes OSDInfo to the existing per-node status CM `rook-ceph-osd--status`. The cluster controller's existing path takes over: [createOSDsForStatusMap](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324) reads the CM and creates the daemon Deployment. The controller polls `ceph osd tree` each reconcile until `osd.` is `up` AND `in`; once observed, it emits an `OSDReplaceCompleted` event on the new deployment.
+
+### Coordination
+
+Replacements run serially per CephCluster as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. The serialization signal is observable in-band: "an annotated OSD deployment exists OR a Replace Job exists in the namespace". Additional annotated OSDs are deferred to subsequent reconciles.
+
+Per-OSD `safe-to-destroy` returns OK only when the OSD has zero PGs in any acting/up set AND either the cluster is `active+clean` or the OSD reports zero stored PGs. So drain progress can be affected by unrelated cluster state — a drain that times out may be stuck on cluster-wide noise (another OSD lagging, backfill from a different failure) rather than this OSD's status. Concurrent destroys of independently-safe OSDs are technically safe, but serial keeps the operational model simple.
+
+### Reconcile placement
+
+The replacement reconciler runs at the end of `osds.Start()` in the OSD subpackage (alongside the existing OSD migration code), after normal OSD provisioning. Since `osds.Start()` is itself the last step of `reconcileCephDaemons`, a `RequeueAfter` return from replacement does not skip earlier reconcile work (mon, mgr, OSD daemon reconcile). There is no synchronous polling inside the reconcile worker — each waiting step returns `RequeueAfter` and the controller-runtime queue handles re-entry.
+
+### Cancellation and retry
+
+Cancellation is done by removing the `rook.io/osd-replace` annotation.
+
+- **During validate or drain (before destroy).** Clean cancel: no Ceph or host state has been changed that the controller cannot leave as-is. The OSD has been marked `out`; the user re-`in`s it if they want to put the OSD back into service. The operator clears the `started-at` annotation. The Replace Job has not been created yet.
+- **During Replace Job execution.** Not honored. `ceph-volume lvm zap` and adjacent destroy steps cannot be safely interrupted mid-call (partial dm-crypt + half-zapped LV). The Replace Job runs to a terminal state. On Job failure, the operator may retry the Job (deterministic name; prior completed/failed Job deleted first) — destroy and zap sub-steps are idempotent.
+- **After destroy succeeds.** Not honored. The slot is already destroyed in the OSDMap; the only recovery is to either let the flow complete (provision a new OSD with the same ID) or run `ceph osd purge ` manually outside Rook to retire the slot.
+
+Retry after a terminal failure (validation rejection, drain timeout, Job failure that the operator does not auto-retry): the user removes the `rook.io/osd-replace` annotation and re-adds it. The annotation-flip predicate fires a fresh reconcile; the operator clears the stale `started-at` annotation on a fresh start.
+
+## Notes on Scope
+
+### Multiple metadata devices on one node — works conditionally
+
+Rook supports per-device metadata-device pairing:
+
+```yaml
+nodes:
+- name: "node-1"
+ devices:
+ - name: "/dev/disk/by-path/...sda"
+ config: { metadataDevice: "nvme0n1" }
+ - name: "/dev/disk/by-path/...sdb"
+ config: { metadataDevice: "nvme0n1" }
+ - name: "/dev/disk/by-path/...sdc"
+ config: { metadataDevice: "nvme1n1" } # different metadata device on the same node
+```
+
+This setup requires exact `name` (or `fullpath`) references — the per-device `config` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally: the destroyed slot keeps the OSD ID, `lvm batch --osd-ids` reuses it, and the new DB LV lands in the existing metadata VG, so the per-device `config` block targets the correct metadata device — subject to the same spec-drift caveats discussed in [open question 4](#open-questions). Two caveats:
+
+- **Device-name validation must permit exact entries** — see [open question 5](#open-questions).
+- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls at provisioning.
+
+### Non-shared-metadata OSDs — same flow, simpler teardown
+
+The flow naturally covers OSDs without a separate metadata device:
+
+- **LVM single-disk:** the Replace Job's `lvm zap --osd-id` removes the single OSD LV. Provisioning uses `lvm batch` without `--db-devices`.
+- **`ceph-volume raw` OSDs:** the Replace Job's raw branch closes the dm-crypt mapping (if encrypted) and runs `lvm zap` on the raw device path. Provisioning uses `raw prepare --osd-id `.
+
+The CV mode (lvm vs raw) is already carried on the OSD deployment via `ROOK_CV_MODE`; the Replace Job branches on `OSD_CV_MODE` in step 5, and the prepare-job picks the matching provisioning command.
+
+### PVC-based OSD replacement — separate design
+
+PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. The validate step rejects PVC-backed deployments by the presence of the `ceph.rook.io/pvc` label.
+
+## Implementation scope
+
+Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the OSD migration flow (`spec.storage.migration`), which recreates OSDs in place after encryption or store-type spec changes via a `ROOK_REPLACE_OSD=` env switch on the prepare-job that triggers `DestroyOSD` before provisioning. Migration is decoupled from disk replacement (no swap step) and `DestroyOSD` does not cover the lvm-mode shared-metadata case. Implementing the flow above requires two kinds of work: filling in capabilities missing from existing Rook code, and adding new logic.
+
+**Missing in existing Rook code:**
+
+1. **Annotation flips on an OSD Deployment do not enqueue a reconcile.** `WatchPredicateForNonCRDObject` filters out all Deployment update events at the `appsv1.Deployment` switch case ([predicate.go#L252-L256](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/controller/predicate.go#L252-L256)), so a `rook.io/osd-replace` annotation set by the user is invisible to the CephCluster controller. The replacement flow needs a carve-out: annotation transitions (absent to present, value change, present to absent) must trigger a reconcile, while other Deployment updates remain suppressed.
+
+2. **The prepare-job is not aware of destroyed OSD slots in the OSDMap.** Destroyed slots are exposed by `ceph osd tree --states destroyed`; the prepare-job needs to enumerate them by node and pass the IDs to `lvm batch` via a new `--osd-ids` argument ([volume.go#L597-L720](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L597-L720)).
+
+3. **`DestroyOSD` ([remove.go#L245-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L245-L290)) doesn't handle shared-metadata layouts.** Needs extending to cover the shared-metadata DB LV. Reused by both migration and replacement after the extension.
+
+**New code:**
+
+All operator-side steps (validation, drain, Replace Job lifecycle) are added as a new reconcile step at the end of `reconcileCephDaemons`. See [Reconcile placement](#reconcile-placement).
+
+The Replace Job fully reuses the prepare-job's binary and pod scaffold. The new code adds a **destroy-only entry point** — a `rook ceph osd destroy --id ` subcommand or env as a flag to run only destroy logic and exit. The Job instance is per-OSD (`rook-ceph-osd-replace-`).
+
+Provisioning of the new disk hands back through the existing per-node status CM (`rook-ceph-osd--status`) consumed by `createOSDsForStatusMap` — no new return path. See [Why a separate Replace Job](#why-a-separate-replace-job) below for why the Replace Job is its own thing rather than folded into the prepare-job's `ROOK_REPLACE_OSD` path.
+
+### Why a separate Replace Job
+
+The existing OSD migration flow and this replacement flow both destroy an OSD and reuse its slot. They share the destroy primitive (`DestroyOSD`). Could they share the same Job?
+
+The two flows differ structurally in one place: when provision runs relative to destroy.
+
+- **Migration** is in-place: destroy and re-provision happen back-to-back in one prepare-job pod on the same disk. The disk is always present.
+- **Replacement** is deferred: destroy runs now, but re-provision waits minutes to days for the user to physically swap the failed disk for a new one. Cephadm's [`orch osd rm --replace`](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd) uses the same deferred shape.
+
+If replacement reused migration's "destroy then provision" pod sequence, `Provision` would run immediately after `DestroyOSD` and claim the destroyed slot back onto the freshly-zapped failed disk — defeating the replacement (post-zap, the failed disk presents as an empty kernel device, indistinguishable from a fresh one).
+
+The hole could be closed by extending migration with a destroy-only branch plus conditional gating in `Provision`. That works, but it entangles two distinct lifecycles in one code path with branches at each divergence — at the cost of migration's current focused scope.
+
+Keeping replacement as its own short-lived Job (`rook-ceph-osd-replace-`) invoking a destroy-only entry point keeps migration's code path unmodified and gives per-OSD observability (`kubectl get jobs` shows which OSD is mid-replacement). The genuinely shared work — the `DestroyOSD` extension — lives below the Job orchestration layer and is reused by both flows. The `--osd-ids` wiring and destroyed-slot enumeration are added for replacement and not used by migration.
+
+## Open questions
+
+1. **Enforce the swap-after-Job contract programmatically or document only?** The flow defines a user contract: swap only after the Replace Job completes. Violation case: the prepare-job claims the new disk as a fresh OSD with a new ID, orphaning the in-flight replacement; recovery is manual `ceph osd purge`. Cephadm matches the contract approach (no programmatic gate). Alternative: suppress the prepare-job on the affected node while the `rook.io/osd-replace` annotation is set or the Replace Job exists, modeled on the existing `skipPreparePod` check in `startProvisioningOverPVCs` (no equivalent exists in `startProvisioningOverNodes` today). Cost: a few lines in `startProvisioningOverNodes`. Benefit: violation becomes impossible to trigger.
+
+2. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (annotation). Should there be a follow-up option for the operator to auto-annotate when a failed OSD and a fresh disk are detected on a node?
+
+3. **Default values.** Proposed: `safeToDestroyTimeout: 1h`; drain re-check interval `30s`; disk-swap wait has no timeout (the destroyed slot persists in the OSDMap indefinitely). Reasonable, or change them?
+
+4. **Metadata device on a shared-metadata node: spec vs sibling.** The current proposal re-reads `spec.storage[].config.metadataDevice` at provision time, which can drift from what the destroyed OSD actually used (user spec edits, or kernel-name renumbering on host reboot — kernel names are not persistent). Drift silently splits the shared-metadata layout. Alternative: derive the metadata device from a surviving sibling's DB LV (e.g., `ceph-volume lvm list --format json` filtered on `type=db`). Downside: when no sibling survives on the node, this must fall back to spec anyway.
+
+5. **Device-name validation.** Proposed: accept `useAllDevices`, `deviceFilter`, and `by-path` (same-slot only); reject kernel names, `by-id`, and `by-uuid` (see [persistent block device naming](https://wiki.archlinux.org/title/Persistent_block_device_naming) for background on which references survive a swap). Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)?
+
+6. **Where to report the validation result.** The current proposal emits a Warning event on the CephCluster CR (matching Rook convention). Should we also emit on the OSD Deployment for discoverability (it's the object the user just annotated), or emit only on the Deployment?