From e4059275077cebc783936e5890f907675a51174c Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Mon, 4 May 2026 13:10:48 +0200 Subject: [PATCH 01/12] docs: osd replacement design document Signed-off-by: Artem Torubarov --- osd-design.md | 538 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 538 insertions(+) create mode 100644 osd-design.md diff --git a/osd-design.md b/osd-design.md new file mode 100644 index 000000000000..bc2b5b52938d --- /dev/null +++ b/osd-design.md @@ -0,0 +1,538 @@ +# Design: Single OSD replacement with a shared metadata device + +Issue: [rook/rook#13240](https://github.com/rook/rook/issues/13240) + +## Problem + +When an OSD's data and metadata live on different devices (per `spec.storage` `metadataDevice` config in the CephCluster CR), Rook today cannot replace a single failed OSD on its own. The user must either re-provision all OSDs sharing the same metadata device or run a multi-step manual workflow including scaling down the operator to zero. Both are slow and error-prone. + +This design proposes a workflow to replace a single failed OSD in place — preserving its OSD ID — without affecting other OSDs sharing the same metadata device. + +## Notation + +- **User** - the human cluster admin who edits the CR. +- **Operator** - the Rook controller process. +- **Data LV / data device** - the LV (or block device) holding an OSD's bulk data. One per OSD. +- **DB LV / metadata device** - the LV holding the OSD's rocksdb (`block.db`). One per OSD; multiple OSDs can share the same metadata device. + +## User story + +A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NVMe metadata device. The user marks `osd.5` for replacement on the CephCluster CR, swaps the physical disk in the chassis, and walks away. Rook destroys `osd.5`, frees its DB LV slot on the NVMe, provisions a new OSD on the replacement disk *with the same OSD ID 5*, and the other four OSDs on the same NVMe stay up the whole time. + +## Constraints + +Two facts about the environment shape every later choice in this design. + +### Rook cannot tell a replacement disk from a new disk + +When a fresh empty disk appears on a node, Rook gets no signal — from the kernel, from Ceph, from the disk itself — that says "I am the replacement for the OSD that just failed". The next CephCluster reconcile calls `startProvisioningOverNodes`, which spawns a prepare-job on each node. With `useAllDevices: true` (or a matching `deviceFilter`) the prepare-job auto-provisions a new OSD on the empty disk with a fresh ID; orphan resources for the failed OSD stay leaked. + +This is why the user must declare the intent first via `spec.storage.replaceOSD`, *before* swapping the disk. Swapping the disk first is unsafe: any reconcile trigger between the swap and the CR edit will auto-provision the new disk with a fresh ID, defeating the flow. + +### Storage device config must tolerate device swap + +Rook lets users identify OSD data devices in three ways via `spec.storage`: + +- `useAllDevices: true` — match any empty disk on the node. +- `deviceFilter: ""` — match disks whose `lsblk` properties match a regex (e.g., model, vendor). +- `nodes[].devices[].name: ""` — match a specific path or name. The value can be a kernel name (`vdb`), a raw path (`/dev/sdc`), or a udev symlink path (`/dev/disk/by-path/...`, `/dev/disk/by-id/...`). + +Each shape interacts differently with the Linux device-naming interfaces. The relevant guarantees: + +- **Kernel names** (`vdb`, `sdc`, `/dev/sdc`) are not persistent across reboot, hot-swap, or HBA topology changes — SCSI/SATA enumeration is allocation-order based. See [Arch Wiki: Persistent block device naming](https://wiki.archlinux.org/title/Persistent_block_device_naming). [Ceph's own admin docs](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) use raw paths like `/dev/sdX` in their replacement examples, but the manual procedure can be re-validated at each step; an automated flow has fewer recovery options if the name has shifted. +- **`/dev/disk/by-path/...`** is built by udev rules from the sysfs port path. Same physical port → same `by-path` symlink (guaranteed). Different port → different `by-path`. So `by-path` survives a *same-slot* swap and breaks on a different-slot swap. Same-slot replacement is **not** a Rook or Ceph requirement: [Ceph upstream is silent on slot semantics](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd); cephadm's `ceph orch device replace` is slot-agnostic. +- **`/dev/disk/by-id/...`** identifies the disk by hardware serial / WWN. Different disk → different `by-id`. Useless for replacement (the new disk *is* a different disk). +- **`/dev/disk/by-uuid/...`** identifies the filesystem/LV UUID. The replacement disk has a fresh UUID after provisioning. Same as `by-id`: useless here. + +The shapes that tolerate any swap (same-slot or different-slot, any new disk) are `useAllDevices` and `deviceFilter` — both slot-and-disk-agnostic. `by-path` tolerates only same-slot replacement. Kernel names tolerate only the lucky case where the kernel happens to assign the same name. + +**Tension with per-device CR config.** Some legitimate Rook layouts *require* exact `name:` entries — notably per-device `config.metadataDevice` (`Documentation/CRDs/Cluster/ceph-cluster-crd.md:393-394`), which attaches a metadata-device pairing to a specific data device entry. There's no way to express "this data device pairs with that NVMe" via `useAllDevices` alone. Strictly rejecting all exact references in this flow would block these layouts. See "Multiple metadata devices on one node" in Out of scope. + +The replacement flow's pre-check #5 enforces a validation policy — the default and configurability are open question U-10. The intent is that users with simple homogeneous-node setups (`useAllDevices` / `deviceFilter`) work transparently, while users on slot-stable hardware or per-device config can opt into a more permissive policy. + +## Current gaps + +Rook already has a same-device replacement flow for OSDs whose data and DB share one device. The user triggers it via `spec.storage.migration.confirmation` in the CR; the operator passes the OSD ID to the prepare-job pod via the `ROOK_REPLACE_OSD` env var, and the prepare-job calls `ceph-volume raw prepare --osd-id` to provision a new OSD reusing the destroyed slot. For the shared-metadata case, five gaps prevent that flow from working end-to-end: + +1. The replacement code path runs only in raw mode; LVM mode (required when a metadata device is configured) does not pass `--osd-id`, so the new OSD gets a new ID. (`initializeDevicesLVMMode`, [volume.go#L584-L844](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L584-L844)) +2. Destroy zaps only the data LV; the DB LV on the shared metadata disk stays as an orphan. (`DestroyOSD`, [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) +3. The dm-crypt key in Ceph's config-key store is never removed, leading to LUKS collisions on retry of encrypted OSDs. (same `DestroyOSD` body — [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) +4. Once any OSD is provisioned on a shared metadata disk, Rook's inventory excludes that disk from future discovery (the "has children" filter). (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) +5. `OSDInfo.MetadataPath` is never populated for LVM-mode OSDs (the parser walks only `[block]` entries from `ceph-volume lvm list`), so the operator has no record of which metadata disk a destroyed OSD used. (`GetCephVolumeLVMOSDs`, [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177)) + +## Proposed flow + +This flow orchestrates [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm prepare --osd-id` → `lvm activate`) across short-lived Kubernetes Jobs, with operator-side state for crash recovery and Rook-specific gates around auto-provisioning. cephadm — Ceph's container-orchestrator analogue — preserves OSD IDs by default ([cephadm OSD service docs](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd)); this design follows the same convention. + +Two short-lived jobs — Destroy Job and Prepare Job — separated by the wait for the replacement disk. The operator owns all phase transitions and the wait; jobs are workers observed via `Job.status.succeeded`. + +>Why split into two jobs (vs. one job like the existing OSD migration flow)? +>- The disk-swap wait can take hours. Keeping a job pod alive across it is wasteful — the operator, not a job, should own the wait. +>- Destroy and prepare are independently retryable. If destroy succeeds and prepare fails, only prepare re-runs. + +One OSD per reconcile cycle, gated by `safe-to-destroy `. + +### Sequence + +```mermaid +sequenceDiagram + autonumber + actor User + participant CR as CephCluster CR + participant Op as Operator + participant Map as Replacement CM + participant OldPod as OSD pod 5 old + participant DJ as Destroy Job + participant PJ as Prepare Job + participant Ceph as Ceph + participant NewPod as OSD pod 5 new + + User->>CR: set spec.storage.replaceOSD id=5 + Op->>CR: read trigger + Op->>Ceph: ceph osd dump get fsid for osd.5 + Op->>OldPod: read deployment env + Op->>Map: write phase=destroy-pending,
layout + Op->>OldPod: delete deployment + Op->>OldPod: wait for pod termination + Op->>DJ: create + DJ->>Ceph: ceph osd destroy osd.5 + DJ->>Ceph: ceph config-key exists, rm + DJ->>DJ: cryptsetup close db mapping + DJ->>DJ: lvremove db lv + DJ->>DJ: ceph-volume lvm zap data lv + Op->>DJ: observe Succeeded + Op->>Map: phase=prepare-pending,
pending-db-lv-name + Note over User,Op: User swaps the failed disk
(any time after CR edit) + Note over Op: Wait for replacement disk:
- with rook-discover: watch local-device-NODE CM
- without: spawn inventory Job, requeue 5m (U-9) + Op->>PJ: create + PJ->>PJ: lvcreate using persisted lv name + PJ->>Ceph: ceph-volume lvm prepare
osd-id=5 + Note over PJ: writes per-node status CM
(rook-ceph-osd-NODE-status,
existing CM, not Map) + Op->>PJ: observe Succeeded + Op->>Map: phase=deployment-pending + Op->>NewPod: create deployment with id 5 + NewPod->>Ceph: lvm activate, join cluster + Op->>Map: phase=completed once Ready +``` + +### ConfigMaps and phase state + +Two ConfigMaps appear in the flow: + +1. **`osd-replacement-state`** — new in this design. Per-cluster, single-key. Lives in the operator namespace, owner-ref'd to the CephCluster (same lifecycle pattern as `osd-migration-config` at [`migrate.go#L42-L44`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/migrate.go)). Created when validation persists the trigger (Step 3); transitioned through phases by the operator; deleted (or its single entry overwritten) when the user moves on to a different OSD or clears `replaceOSD`. +2. **`rook-ceph-osd--status`** — existing per-node prepare-job output CM. The Prepare Job (Step 7) writes the new OSD's layout here; the existing reconcile path at [`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324) consumes it to create the daemon Deployment. This design does not change its shape or lifecycle. + +The replacement CM holds at most one entry per cluster, keyed by `osd-id`. Re-trigger of the same OSD (with a fresh `confirmation` string — see "Trigger already consumed" in Step 1 pre-checks) overwrites the entry's confirmation and resets phase. Trigger of a different OSD ID while one is in flight does not collide: the "no other replacement in progress" pre-check blocks it until the in-flight one completes. Collision on re-trigger is structurally impossible. + +**Phase state machine:** + +``` + ┌────────── (timeout) ──────────┐ + ▼ │ +(no entry) → destroy-pending → prepare-pending → deployment-pending → completed → (GC'd) + ▲ + └── (cancel via remove replaceOSD; only honored in waiting-for-disk substate) +``` + +- **destroy-pending**: operator deleted the OSD deployment, Destroy Job is in flight or about to start. +- **prepare-pending**: Destroy Job succeeded, two substates — *waiting for disk* (no Prepare Job yet, only `pending-db-lv-name` reserved) and *Job running* (`lvcreate` and `ceph-volume lvm prepare` in flight). +- **deployment-pending**: Prepare Job succeeded, operator is creating the new daemon Deployment. +- **completed**: new daemon Ready in Ceph; entry kept until the next spec change for audit, then GC'd. + +**Full example of the record:** + +```yaml +osd-id: 5 +node: node-1 # required for Destroy/Prepare Job NodeSelector; survives Step 4's deployment delete +phase: destroy-pending # destroy-pending → prepare-pending → deployment-pending → completed +data-lv: /dev/ceph-data-vg-5/osd-block-aaa... +db-lv: /dev/ceph-metadata-vg-1/osd-db-bbb... +metadata-source-device: nvme0n1 +metadata-vg: ceph-metadata-vg-1 +crush-device-class: hdd +database-size-mb: 4096 +encrypted: true +osd-fsid: 8b7e6c19-... +pending-db-lv-name: # populated when phase advances to prepare-pending +expected-disk-pending: false # set true while phase=prepare-pending; gates auto-provision skip per required change 6 +confirmation: # value from spec at trigger time; populated on phase=completed +new-fsid: # populated on phase=completed; for audit/diagnostics only, never for re-arming +completed-at: # populated on phase=completed +``` + +**Reconcile order on every cycle:** the OSD reconcile entry-point ([`Cluster.Start` in `osd.go#L255`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L255)) gains a new first-step subroutine that runs **before** the existing `updateAndCreateOSDs` path: + +1. **GC** stale entries (rules in "Long-term state cleanup" below). +2. **Drive in-flight** entries forward via the state machine. +3. **Validate** any newly-set `spec.replaceOSD` and persist the entry on success. + +Only after this returns does `updateAndCreateOSDs` run. This ordering prevents auto-provisioning from racing a fresh trigger when the operator restarts after a CR edit (the "operator-down race"): a replacement disk inserted during operator downtime is held off via the `expected-disk-pending` flag in the record until the replacement Prepare Job claims it. + +### Step-by-step + +The walk-through uses a concrete example: + +``` +OSD ID: 5 +metadata VG: ceph-metadata-vg-1 +data device: /dev/sdc → /dev/sdh after swap +databaseSizeMB: 4096 +crush-device-class: hdd +encryption: on +``` + +#### Step 1 — User sets `replaceOSD` on the CR (diagram arrows 1-2) + +Typical trigger is a failed disk, but failure is not required — `safe-to-destroy` is the only gate, so the flow also covers proactive replacement of a healthy OSD. + +```yaml +spec: + storage: + useAllNodes: true + useAllDevices: true # or use deviceFilter; an exact `name:` entry on osd.5's device would be rejected by pre-check #5 + replaceOSD: + id: 5 + confirmation: "yes-really-replace-osd-5" +``` + +`confirmation` is a free-form string the user picks. It does not encode the OSD ID; the example just embeds `5` for human clarity. To re-trigger replacement of the *same* OSD ID after a successful run, the user changes `confirmation` to a new string (e.g., `"yes-really-replace-osd-5-take-2"`). Same UX as `spec.storage.migration.confirmation` today. + +The user can swap the disk at any point after the edit succeeds — before, during, or after destroy. Step 5 tolerates a missing data PV. Only ordering rule: edit the CR first, then swap. + +If multiple OSDs need replacement, the user sets `replaceOSD`, waits for completion, then sets it again with a different ID. `replaceOSD` is an object, not a list — same shape as `spec.storage.migration` for consistency. Parallelism is open question U-2. + +**Pre-checks.** Each check runs on each reconcile when `spec.replaceOSD` is set. Possible outcomes per check: + +- **Continue** — advance to the next check. +- **Short-circuit** — no action this reconcile (idempotency / in-flight). +- **Terminal-reject** — set `ReplacementRejected` condition + Kubernetes Event via `opcontroller.UpdateCondition` ([`conditions.go#L35`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/controller/conditions.go#L35)); user must change spec to recover. +- **Transient-wait** — set a `WaitingFor*` condition; re-evaluate next reconcile, no spec change needed. + +1. **Trigger already consumed.** Replacement CM has a `phase: completed` entry whose `(osd-id, confirmation)` match the spec. Match is on `(id, confirmation)` only — *not* `new-fsid`. (Re-using fsid would silently destroy an OSD that the user manually purged and recreated outside this flow.) → **Short-circuit.** +2. **Trigger already in flight.** Replacement CM has an in-progress entry (any phase before `completed`) whose `(osd-id, confirmation)` match the spec. → **Short-circuit**; the state machine drives the existing entry forward instead of re-validating. +3. **OSD 5 exists** in the OSD map. → **Terminal-reject** if absent (wrong ID, user edits spec). +4. **`safe-to-destroy 5`** returns OK. The only safety gate; `down`/`out` alone is not sufficient because data may not have replicated to peers. → **Transient-wait** (`WaitingForSafeToDestroy`) while peers backfill — verified on Ceph v19.2.2 in [`osd-rep-log.md`](osd-rep-log.md) §1.2 that `safe-to-destroy` returns EBUSY in this state. Bounded escalation timeout (default 1h; see U-4) flips to terminal `SafeToDestroyTimeout` — backfill stuck for 1h+ warrants paging. +5. **Failed OSD's CR matching is swap-tolerant** — evaluated per the validation policy (default `strict`; see U-10 for the policy and configurability discussion): + - **`strict`** — reject if the failed OSD is matched by *any* exact `name:` entry in `spec.storage.nodes[*].devices[*]` on the OSD's node. The CR must match the failed OSD via `useAllDevices` or `deviceFilter`. Implementation: look up the failed OSD's data device from its deployment; scan the CR's `name:` entries on that node; reject if any resolves to that device. + - **`accept-by-path`** — reject only kernel-name-style references (`vdb`, `sdc`, `/dev/sdc`); accept `/dev/disk/by-path/...` references. The user takes responsibility for performing same-slot replacement. + - **`lenient`** — accept any CR shape. Mismatches surface as a Step 6 stall (`ReplacementDiskMissing` after the U-4 timeout). + + → **Terminal-reject** if the chosen policy rejects (spec must be made swap-tolerant before this flow can run). +6. **No unexpected OSD on the node** — catches the auto-provisioning race (a replacement disk was inserted before this trigger fired and Rook auto-provisioned a new OSD on it). Compare: + - `ceph osd metadata` filtered by hostname (already used by [`clusterdisruption/osd.go#L450`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/disruption/clusterdisruption/osd.go#L450)), + - vs. OSD Deployments owned by Rook on this node (`app=rook-ceph-osd`, filtered by `NodeSelector[k8sutil.LabelHostname()]`). + + Any OSD Ceph reports with no matching Rook Deployment is unexpected. → **Terminal-reject** (user removes the orphan before re-triggering). +7. **No other replacement is in progress** (different `osd-id`). → **Transient-wait** (`WaitingForInFlightReplacement`); self-clearing once the in-flight entry reaches `completed` and is GC'd. + +#### Step 2 — Capture layout + +The operator captures the OSD's layout from sources that do not require the failed data device. + +| Field | Source | Example | +| ------------------------- | -------------------------------------------------------------------------------------------- | ------------------------------------- | +| `osd-fsid` | `ceph osd dump --format json` | `8b7e6c19-...` | +| `osd-id` | OSD pod label `ceph-osd-id` | `5` | +| `node` | OSD deployment `Spec.Template.Spec.NodeSelector[k8sutil.LabelHostname()]` | `node-1` | +| `data-lv` | OSD deployment env `ROOK_BLOCK_PATH` | `/dev/ceph-data-vg-5/osd-block-aaa…` | +| `db-lv` | OSD deployment env `ROOK_METADATA_DEVICE` ¹ | `/dev/ceph-metadata-vg-1/osd-db-bbb…` | +| `metadata-source-device` | OSD deployment env `ROOK_METADATA_SOURCE_DEVICE` ² | `nvme0n1` | +| `crush-device-class` | OSD deployment env `ROOK_OSD_CRUSH_DEVICE_CLASS` | `hdd` | +| `metadata-vg` | `pvs --noheadings -o vg_name ` | `ceph-metadata-vg-1` | +| `database-size-mb` | `lvs --noheadings -o lv_size ` ÷ 1MiB ³ | `4096` | +| `encrypted` | LV tag `ceph.encrypted` on `` ³ | `true` | + +¹ Existing env, but populated only for raw-mode OSDs today. Required change #2 fixes the parser to populate it for LVM-mode OSDs as well. +² New env, added by required change #5. For OSDs whose deployment predates required change #5, this env is missing — see fallback below. +³ Read from the OSD's own DB LV (the metadata VG is by construction intact at Step 2: failure is on the data device, and Step 5's `lvremove` hasn't run yet). Live spec is *not* the source: a user-edited `spec.storage.config.databaseSizeMB` between original provisioning and replacement would size the new DB LV inconsistently with siblings, and `encrypted` is immutable per-OSD so a CR-level toggle cannot retroactively change it. If the OSD's own LV is missing for any reason, fall back to a surviving sibling LV in the same VG. + +**Fallback when `ROOK_METADATA_*` env vars are missing.** For deployments predating required change #5, the operator captures `db-lv` and `metadata-source-device` from a one-shot `ceph-volume lvm list --format json` Job on the OSD's node, via Rook's existing `cmdreporter` ([`cmdreporter.go`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/cmdreporter/cmdreporter.go) — same pattern used today for network/version detection). The pod profile mirrors the prepare-job's (privileged + `/dev`, `/run/lvm`, `/run/udev` mounts, NodeSelector pinned to the failed OSD's node). Output's `[db]` entry: `devices` field is the metadata source device, `tags.ceph.db_device` is the DB LV path. Correct even when the data device has physically failed — `ceph-volume lvm list` reads from VG metadata replicated on the metadata-VG's surviving PV. Verified empirically against the Lima cluster's output for a healthy shared-metadata OSD. If the Job fails or returns no entry for the target OSD, validation rejects with `LayoutCaptureFailed` (terminal — user investigates, e.g. metadata disk also failed → out of scope). + +#### Step 3 — Persist the replacement record (diagram arrow 5) + +Operator writes the replacement CM with `phase: destroy-pending` and the layout captured in Step 2. Field schema and lifecycle: see "ConfigMaps and phase state" above. From this point on, the record is the source of truth for retry — a crashed operator restarts and resumes from the persisted phase. + +#### Step 4 — Delete OSD deployment, wait for pod termination, create Destroy Job (diagram arrows 6-8) + +Operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` — this deletes the deployment and polls until the pod is gone. Then it creates the Destroy Job populated with the layout. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and Step 5's `cryptsetup close` would fail. + +If the wait times out (transiently NotReady node), the operator sets `WaitingForOSDPodTermination` and re-checks on the next reconcile. The operator does NOT force-delete: a stuck pod on a NotReady node may still be holding the LUKS mapping when kubelet recovers; force-delete would diverge K8s and host state. + +**Host permanently down — out of scope.** If the host is genuinely gone (powered off, hardware failure), this flow cannot proceed: the Destroy Job's NodeSelector pins it to that node, and even a force-deleted OSD pod doesn't bring the kubelet back. The Destroy Job stays Pending. Replacement of an OSD on a permanently-dead host is a different workflow (node decommission, then OSD-out-and-purge, then re-add the host with fresh OSDs) — handled by existing Rook flows, not this design. The operator surfaces this case via a `ReplacementHostUnavailable` event after both the pod-termination wait and a Destroy-Job-Pending wait expire. + +#### Step 5 — Destroy Job (diagram arrows 9-13) + +Operator-owned phase stays `destroy-pending` until the Job reports `Succeeded`. The Job's container invokes `DestroyOSD` ([`remove.go#L244-L292`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go). The bash below specifies the behavior `DestroyOSD` must implement after required change #3 lands (today it only does the first step and a partial last step). Each operation is idempotent on retry; no standalone shell script ships in the operator. + +```bash +# 5.1 Destroy in Ceph (preserves OSD ID 5 for reuse). +ceph osd destroy osd.5 --yes-i-really-mean-it # idempotent: already-destroyed → succeeds + +# 5.2 Remove dm-crypt key. On Ceph v19.2.2 (verified) `ceph osd destroy` already +# cleans the key and `config-key rm` on a missing key is itself idempotent +# (returns 0), so this whole step is typically a no-op. The explicit `exists` +# precheck is defensive: keeps the chain safe on older Ceph versions where +# rm's exit-code behavior on missing key has not been measured. +ceph config-key exists dm-crypt/osd/8b7e6c19-.../luks \ + && ceph config-key rm dm-crypt/osd/8b7e6c19-.../luks + +# 5.3 Close DB-side LUKS mapping. The cryptsetup arg is the device-mapper name, +# not the LUKS UUID. Enumerate children with TYPE explicit and pick the crypt +# child specifically — robust against future LV-stack shapes (snapshots, +# thin pools) that could produce additional non-crypt children. +# Precheck pattern (no || true): if the mapping is gone, do nothing; if it's +# present and close fails (busy device), the error bubbles up and the state +# machine retries. +DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db-bbb... | awk '$2=="crypt"{print $1; exit}') +[ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ + && cryptsetup close "$DB_MAPPING" + +# 5.4 Free the DB slot. Precheck (no || true): real lvremove failures bubble up +# and the state machine retries. +lvs /dev/ceph-metadata-vg-1/osd-db-bbb... >/dev/null 2>&1 \ + && lvremove -f /dev/ceph-metadata-vg-1/osd-db-bbb... + +# 5.5 Zap the data LV (also handles the data-side dm-crypt mapping). +# Precheck mirrors 5.4: skip if the LV no longer exists. Real failures +# (zap returns non-zero with the LV present — partial wipe, busy device) +# bubble up via Job exit and are retried by the state machine. +lvs /dev/ceph-data-vg-5/osd-block-aaa... >/dev/null 2>&1 \ + && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block-aaa... --destroy +``` + +After Job completes successfully, operator advances record to `phase: prepare-pending` and does Step 6. + +#### Step 6 — Pre-allocate DB LV name and wait for replacement disk (diagram arrow 14) + +Operator generates a fresh uuid for the new DB LV and persists it in the record (`pending-db-lv-name`) before Step 7.1's `lvcreate` runs. On retry, the same name is reused — no orphan DB LVs from retries. + +The operator then waits for the replacement disk to appear on the node. The operator pod has no `/dev` access; the existing prepare-job spawn (which would otherwise inventory the node) is *suppressed* for this node by change #6's `expected-disk-pending` flag — without that suppression, it would auto-provision the new disk with a fresh ID. So inventory needs a path that doesn't provision: + +- **If `rook-discover` is enabled:** operator watches the per-node `local-device-` CM. Reconcile is triggered on CM update via the hotplug-CM watch (`controller.go:279`). Latency: seconds (rook-discover's udev monitor) up to its `ROOK_DISCOVER_DEVICES_INTERVAL` (default 60 min) for the polling fallback. +- **If `rook-discover` is disabled** (the operator's default): the operator returns `Result{RequeueAfter: 5m}` from each reconcile while in `prepare-pending` waiting-for-disk, and spawns a one-shot `ceph-volume inventory --format json` Job via the existing `cmdreporter` pattern (same one used for Step 2's older-OSD fallback). The Job runs node-side, writes its output to a result CM, and the operator reads it on the next reconcile. Latency ≈ `RequeueAfter` interval (5m) + Job pod startup. + +The 5-min `RequeueAfter` interval is a working default, not a load-bearing decision — see open question U-9. The wait blocks only this OSD's flow; other OSD reconcile work proceeds normally. + +While waiting, the operator sets `WaitingForReplacementDisk` on the CephCluster status. Default timeout 24h (U-4). On timeout the condition flips to `ReplacementDiskMissing` and polling stops. + +**Recovery from timeout — two paths:** + +1. **Insert the disk and bump `confirmation`** in the CR. Pre-checks re-run and the wait resumes. `pending-db-lv-name` is preserved across the cycle (Step 7.1's precheck handles the LV being either already-allocated or absent). +2. **Abandon** by removing `spec.storage.replaceOSD`. Per "Handling cancellation", removing the field in this substate is honored: the operator GCs the record; `osd.5` stays `destroyed` in the OSD map; user runs `ceph osd purge 5` manually if they want to remove the slot. + +#### Step 7 — Prepare Job (diagram arrows 15-17) + +Phase `prepare-pending`. The Job receives the record (including `pending-db-lv-name`) as env vars. + +```bash +# 7.1 Pre-allocate the DB LV using the persisted name. Idempotent on retry — +# if the LV already exists from a previous attempt, lvcreate is skipped. +lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ + || lvcreate -L 4096M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y + +# 7.2 Provision the new OSD with the preserved ID. +# --dmcrypt is conditional on the record's `encrypted` field; +# omitted for unencrypted OSDs. +ceph-volume lvm prepare \ + --bluestore [--dmcrypt] \ + --osd-id 5 \ + --data /dev/sdh \ + --block.db /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... \ + --crush-device-class hdd +``` + +(The uuid in `osd-db-12cf3a91-...` is the operator-generated uuid from Step 6, not the OSD's fsid. ceph-volume assigns its own fsid during prepare and writes `ceph.osd_fsid` / `ceph.db_uuid` LV tags.) + +Prepare writes the new OSD's layout (data path, DB path, metadata source device) to the per-node status CM that Rook already uses to drive daemon creation. After the Job succeeds, operator advances to `phase: deployment-pending`. + +#### Step 8 — Operator creates the new OSD deployment (diagram arrows 18-20) + +Reuses the existing reconcile path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` (no fallback `lvm list` job needed for a future replacement of this same OSD). + +#### Step 9 — Mark replacement complete (diagram arrow 21) + +Operator polls `ceph osd metadata `. Ready = a record returned with a non-empty fsid, `id` matching, and `hostname` matching the record's `node`. This single check covers both the up-in-Ceph signal and the new-fsid capture; `ceph osd metadata` is the source of truth, not K8s readiness-probe semantics. + +On Ready, the operator transitions the replacement CM entry from `phase: deployment-pending` to `phase: completed` and records `confirmation`, `new-fsid`, and `completed-at`. The entry is kept (not deleted) so the next reconcile sees the consumed trigger and short-circuits via pre-check #1. Same UX as `spec.storage.migration` today: the operator never mutates `spec.replaceOSD`; the user clears the field manually when they want to move on. + +If the new OSD does not reach Ready, the record stays in its in-progress phase and the next reconcile resumes from there. + +### Idempotency / resume table + +| Phase on disk | Recovery on next reconcile | +| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------- | +| no record | Validation re-evaluated. No destructive action taken yet. | +| `destroy-pending`, no Destroy Job exists | Operator re-issues the deployment delete (idempotent), waits for pod termination, creates the Destroy Job. | +| `destroy-pending`, Destroy Job in flight | Operator awaits Job; on retry, recreates the Job. All commands in Step 5 are idempotent via precheck patterns. | +| `prepare-pending`, no Prepare Job yet | Operator polls for replacement disk; once visible, creates Prepare Job. Same `pending-db-lv-name` reused — no new orphan. | +| `prepare-pending`, Prepare Job in flight | Operator awaits Job; on retry, recreates it. `lvcreate` skipped if LV exists (7.1 precheck). `lvm prepare --osd-id` reuses the destroyed slot. | +| `deployment-pending` | Existing per-node OSD-status reconcile creates the deployment. | +| `completed` (with consumed `confirmation` + `new-fsid`) | Flow done. Pre-check #1 (trigger already consumed) short-circuits subsequent reconciles until spec moves on; entry then GC'd per "Long-term state cleanup". | + +> **⚠️ Destroy is irreversible.** Once pre-checks pass and the operator persists the record (Step 3), `osd.5` will be destroyed on this reconcile cycle. There is no "are you sure?" preview surfacing the captured layout. If the user typed the wrong OSD ID, the wrong OSD is gone — recovery is via the cancellation table below, not by retracting the trigger. + +### Long-term state cleanup + +GC runs first on every reconcile cycle (see "Reconcile order" in "ConfigMaps and phase state"). It only acts on entries that are **not** in an in-progress phase — for in-progress entries, the cancellation table below governs; GC does not touch them. This precedence prevents the user-changes-`replaceOSD.id`-mid-flight failure mode where mid-flight osd.5 would be destroyed, then GC'd, and stuck `destroyed` with no replacement. + +GC rules: + +| Spec state | Entry phase | Action | +|---|---|---| +| `replaceOSD` unset | `completed` | GC the completed entry. | +| `replaceOSD` unset | `prepare-pending` (waiting-for-disk only) | GC per cancellation table below. | +| `replaceOSD` unset | any other in-progress phase | No action; cancellation table governs. | +| `replaceOSD` set; `(id, confirmation)` differ from a `completed` entry | `completed` | GC; treat spec as fresh trigger. | +| `replaceOSD` set; mismatch on in-progress entry | any in-progress phase | No action; in-flight flow runs to completion, then GC fires next cycle. | + +If the user changes `spec.replaceOSD.id` from 5 to 7 mid-flight: osd.5's flow runs to `completed`; on the next reconcile, GC removes the entry (its `osd-id=5` ≠ `spec.id=7`); pre-checks then run for osd.7. The spec change is effectively queued. + +### Handling cancellation + +`replaceOSD` is a mutable spec field. Removing it (or changing its ID mid-flight) is a "cancel" intent. The operator's response depends on phase: + +| Phase | Cancel honored? | Effect | +|---|---|---| +| Pre-Step 3 (validation in flight, no record persisted yet) | Yes | Operator detects field is gone on next reconcile and stops. No state to unwind. | +| `destroy-pending` (Destroy Job in flight or about to start) | No | State record drives the flow forward. Destroy is short-lived; cancel is a no-op. | +| `prepare-pending`, waiting-for-disk (destroy complete; only `pending-db-lv-name` reserved, no `lvcreate` yet) | Yes | Operator GCs the record. `osd.5` stays `destroyed`; user runs `ceph osd purge 5` to remove the slot. **No orphan LV** (Step 6 only reserves the *name*; `lvcreate` runs in Step 7.1). **ID-preserving retry of osd.5 is unavailable after this cancel** — the original Deployment is gone (Step 4) and data + DB LVs are wiped (Step 5), so a future `replaceOSD: {id: 5}` trigger has no layout to capture in Step 2 and aborts with `LayoutCaptureFailed`. To re-add an OSD here, the user accepts a fresh ID. | +| `prepare-pending`, Prepare-Job-running (`lvcreate` may have run; `ceph-volume lvm prepare` may have started LUKS-formatting) | Only on Job failure | `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). Operator records the cancel intent and acts at Job exit: **on Job failure**, GC the record; the partially-allocated DB LV is left as a named orphan (`pending-db-lv-name`, easy to `lvremove`); osd.5 stays `destroyed`. **On Job success**, cancel is **not** honored — the new OSD is provisioned and joins the cluster. Removing the just-provisioned OSD is an `out`+`purge` workflow, not a rollback of this flow. | +| `deployment-pending` or `completed` | No | New OSD is already provisioned. The failed disk is replaced; cancel makes no sense. | + +## Required code changes + +Six changes. Items 1–3 and 5 are independent bug fixes worth landing regardless of this design; 4 and 6 are the new replacement flow. + +| # | Fix | File / lines | +|---|----|----| +| 1 | Inventory must include shared metadata disks. See "Change #1 details" below. | [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111) | +| 2 | Populate `OSDInfo.MetadataPath` and a new `OSDInfo.MetadataDevice` field for LVM-mode OSDs. The data is in `ceph-volume lvm list --format json`'s `[db]` section (LV `path` and source `devices`); the parser today walks only `[block]` entries. Forward-compat across rolling upgrade: `OSDInfo` uses standard `encoding/json` struct tags ([`osd.go#L113-L136`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L113-L136)) with no `DisallowUnknownFields` policy. An old operator decoding a new prepare-job's status CM silently drops the new field. A new operator decoding an old CM gets the zero value (empty string), and Step 2's `lvm list` fallback handles that case. | [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177) | +| 3 | `DestroyOSD` cleans up the DB LV and the dm-crypt config-key. Add `ceph config-key exists+rm`, `cryptsetup close `, and `lvremove -f ` (gated on `osdInfo.MetadataPath != ""`). Use precheck patterns (see Step 5) so genuine failures bubble up while already-clean state is tolerated. | [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292) | +| 4 | Wire LVM-mode replacement through `lvm prepare --osd-id`. Today only raw mode at [volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555) adds `--osd-id`. When `a.replaceOSD != nil` *and* a metadata device is set, pre-allocate the DB LV with `lvcreate` (using the operator-persisted name) and call `lvm prepare --osd-id` instead of `lvm batch`. **Why this primitive over `purge` + `lvm batch`:** `lvm prepare --osd-id` claims a destroyed slot atomically (race-safe; no implicit reuse via mon's lowest-free allocation policy) and matches the existing same-device replacement flow at [volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555). Alternatives tested and discussed in [`osd-id-reuse-analysis.md`](osd-id-reuse-analysis.md) (U-7). | [volume.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go) | +| 5 | Pass `OSDInfo.MetadataDevice` to the OSD daemon deployment as a new `ROOK_METADATA_SOURCE_DEVICE` env var. Future destroys read the metadata layout from the deployment without a node-side rescan. | [spec.go#L950-L1010](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/spec.go#L950-L1010) | +| 6 | New `osd-replacement-state` ConfigMap, Destroy/Prepare Job split, reconcile-order pinning. See "Change #6 details" below. | [migrate.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/migrate.go), [osd.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go) | + +#### Change #1 details — inventory bypass for Ceph-tagged shared metadata disks + +Today [`disk.go#L97-L111`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111) calls `sys.ListDevicesChild` (lsblk-based) and skips any disk where `len(children) > 1`. Once any OSD's DB LV lands on a shared metadata disk, this filter incorrectly excludes the disk from inventory, so future OSDs can't get DB LV slots on it. + +**Algorithm:** when `len(children) > 1`, run `lvs --noheadings -o lv_name,vg_name,lv_tags` filtered to those children and check for the `ceph.cluster_fsid=` LV tag — the same authoritative signal Rook already uses elsewhere ([volume.go#L85-L90](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L85-L90), [volume.go#L1130-L1135](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1130-L1135)). If any child carries this cluster's FSID, treat the disk as available. + +**Edge cases:** + +- Mixed Ceph-tagged and untagged children → still bypass. Ceph LVs identify the disk as ours; untagged LVs would block c-v from using their PE anyway — no Rook-side issue. +- `lvs` returns error / EBUSY → conservative: fall back to today's skip behavior, log. +- No tagged children, `len > 1` → skip (today's behavior; foreign LVM or partition table). + +VG/LV name patterns are convention, not guarantee; tags are. Cost: one `lvs` per filtered disk (only when `len > 1`); Rook already shells out to `lvs` / `pvs` during inventory. + +#### Change #6 details — replacement state machine and reconcile-order pinning + +Adds a new ConfigMap `osd-replacement-state` (separate from the existing `osd-migration-config` to avoid breaking that flow's int-keyed reader). Schema, lifecycle, and phase state machine: see "ConfigMaps and phase state" above. Persists `pending-db-lv-name` so Prepare Job retry doesn't orphan DB LVs. Splits the single prepare-job model into a Destroy Job + Prepare Job (motivated by the disk-swap wait — see "Proposed flow" intro). + +**Reconcile-order pinning.** The OSD reconcile entry-point [`Cluster.Start`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L255) gets a new first-step subroutine that runs **before** the existing `updateAndCreateOSDs` path. The subroutine reads `spec.replaceOSD` and the replacement CM, runs GC → drives in-flight entries → validates new spec. Only then does `updateAndCreateOSDs` run. + +The `expected-disk-pending: true` flag on the record (set while phase=`prepare-pending`) is the wire that prevents the auto-provisioning race: in `updateAndCreateOSDs`, the prepare-job spawn for a node with an `expected-disk-pending` entry is skipped — the empty replacement disk on that node is held off until the replacement Prepare Job claims it. + +**Implementation pattern reuse:** + +- Replacement CM lifecycle: same shape as `osd-migration-config` ([`migrate.go#L42-L44`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/migrate.go) — per-cluster, owner-ref'd to CephCluster, written via `k8sutil.CreateOrUpdateConfigMap`). +- Destroy Job pod profile: clones `c.provisionPodTemplateSpec` ([`provision_spec.go`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/provision_spec.go)) — same image, mounts, NodeSelector, RBAC. +- Conditions: set via `opcontroller.UpdateCondition` ([`conditions.go#L35`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/controller/conditions.go#L35)). +- Bounded waits: `util.RetryWithTimeout` ([`retry.go#L57`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/util/retry.go#L57)) — already used by OSD migration. +- Pod-deletion wait: `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)). +- Device-name validation (raw kernel-name rejection): extend `c.validateOSDSettings` ([`osd.go#L189`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L189)). + +## Out of scope + +### Multiple metadata devices on one node — works conditionally + +Rook supports per-device metadata-device pairing (`Documentation/CRDs/Cluster/ceph-cluster-crd.md:393-394`): + +```yaml +nodes: +- name: "node-1" + devices: + - name: "/dev/disk/by-path/...sda" + config: { metadataDevice: "nvme0n1" } + - name: "/dev/disk/by-path/...sdb" + config: { metadataDevice: "nvme0n1" } + - name: "/dev/disk/by-path/...sdc" + config: { metadataDevice: "nvme1n1" } # different metadata device on the same node +``` + +This layout requires exact `name:` references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this layout works structurally (each OSD's `metadata-source-device` is captured in its record at destroy time), with two caveats: + +- **Validation policy must permit exact `name:`** entries — pre-check #5's `accept-by-path` or `lenient` mode (U-10). The default `strict` mode rejects this layout from the flow. +- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls at Step 6. + +The broader multi-metadata-device feature work (improvements to per-device `metadataDevice` UX, multi-NVMe-per-node layouts) was scoped separately by maintainers in [#13240](https://github.com/rook/rook/issues/13240) (tracked by `zhucan`). This design does not add new logic for that layout — it just doesn't actively forbid replacement on it under permissive validation. + +### PVC-based OSD replacement — separate design + +PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. + +### Permanently-down host — different workflow + +If the OSD's host is gone, this flow cannot proceed (Step 4 / Step 5 require the host). Existing Rook node-decommission + OSD-purge flow handles it. + +## Open questions + +- **U-1 — Trigger surface.** *Decision: CR field `spec.storage.replaceOSD: {id, confirmation}`.* Matches the existing `spec.storage.migration` precedent (same UX — user clears the field manually after success), keeps trigger state in the CR, no separate object lifecycle. Alternatives below are listed for review discussion. + - Annotation on the OSD deployment (`rook-ceph.io/replace-osd: ""`). Operator removes the annotation on success — no spec mutation. Rejected: the OSD's deployment is *deleted* in Step 4 before destroy, so an annotation on it disappears mid-flight (state would have to migrate to the state-record CM anyway). + - New `CephOSDReplacement` CRD, one short-lived resource per intent, deleted on success. Rejected: new CRD is a heavier API surface for a feature that already fits the existing `spec.storage` shape; consistency with `spec.storage.migration` is more valuable than per-intent isolation. + +- **U-2 — Parallelism.** Issue #13240 names multi-disk-failure on a chassis as a real operational pain — replacing 4 disks means 4 sequential edits, each blocking on disk-swap wait + reconcile cadence (potentially hours per OSD). This design stays serial because `safe-to-destroy` and `lvm prepare --osd-id` are both naturally one-at-a-time. Two follow-up paths that don't break the serial-execution invariant: + - (a) **Widen the trigger surface to a list** (`replaceOSDs: [{id, confirmation}, ...]`) so the user records all intents upfront and the operator processes them in sequence without per-OSD CR edits. Cheap; removes most of the user-visible pain. + - (b) **N-per-reconcile execution** via N parallel Destroy/Prepare Jobs each running `lvm prepare --osd-id `, gated on cluster health and per-OSD `safe-to-destroy`. Bigger; needs careful PG-safety rules. The obvious-looking `lvm batch --osd-ids X Y Z --prepare` primitive does **not** work for shared-metadata setups (rejects the metadata VG outright — see U-7 / [`osd-id-reuse-analysis.md`](osd-id-reuse-analysis.md) Path E); (b) must use N parallel `lvm prepare --osd-id` invocations, not a single `lvm batch` call. + +- **U-3 — Auto-replace mode.** This design requires explicit user input. A follow-up could add an opt-in "auto-replace on disk swap" mode (e.g. `spec.storage.autoReplaceOSDs: true`): when an OSD is `down_in` and a new empty disk appears in its CR-managed slot, the operator runs the same flow without an explicit trigger. Extra checks (cluster health, PG state) would gate it. Deferred. + +- **U-4 — Configurability of the two timeouts.** Two distinct timers, different physical phenomena, different SLAs: + - `replacement-disk` wait (default 24h): time to swap a failed disk; covers walk-away-and-handle-tomorrow workflows. + - `safe-to-destroy` retry timeout (default 1h): time backfill is allowed to take after the user triggers replacement on a still-recovering OSD. 24h here would mask a stuck backfill and warrants paging. + Per-replacement override (`spec.storage.replaceOSD.timeoutSeconds`) handles different chassis-swap SLAs but adds API surface; operator-global only is simpler. + +- **U-5 — Faster wake on disk-swap.** With `rook-discover` enabled, latency floor is its udev-event delivery (seconds) up to `ROOK_DISCOVER_DEVICES_INTERVAL` (60 min). Without it, the wait re-checks every U-9 interval. Optional follow-up: treat udev "new disk" events on the node as reconcile triggers while a replacement is in progress — push from `rook-discover`, or a small sidecar deployed only while waiting. Optimization, not a correctness gap. + +- **U-6 — State-store choice for the replacement record.** *Decision: ConfigMap `osd-replacement-state`.* Matches Rook's existing OSD-orchestration pattern (per-node status CMs, `osd-migration-config`); clean object lifecycle (create/delete); doesn't pollute CR `.status` with transient state-machine state (no precedent for that in Rook). Alternatives below are listed for review discussion. + - CR `.status.replaceOSD`. Pros: visible via `kubectl get cephcluster -o yaml`; integrates natively with Conditions; one fewer object lifecycle. Cons: mixing transient operational state with the CR's status block is unidiomatic in Rook; status updates may race with other status writers; no precedent in the codebase for state-machine state in `.status`. Conditions can still be used independently of where the record lives. + - Annotation on the CephCluster CR. Pros: simple, visible. Cons: limited to ~256KB total annotation size on the resource (shared with other consumers); awkward to update structurally; no precedent for state machines in annotations. + +- **U-7 — Approach for OSD-ID preservation: `destroy + prepare --osd-id` vs alternatives.** Full comparison in [`osd-id-reuse-analysis.md`](osd-id-reuse-analysis.md). Summary: + + - **Chosen:** `ceph osd destroy` + `lvm prepare --osd-id` + operator pre-allocates DB LV. This is [Ceph's documented replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) — the design just orchestrates it across pods. Verified end-to-end on Ceph v19.2.2 (`osd-rep-log.md`). Race-safe: the destroyed slot can't be claimed by another OSD between destroy and prepare. Matches Rook's existing same-device replacement flow ([volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555)). + - **Alternative — `purge` + `lvm prepare` without `--osd-id`** (used by SAP runbook + GH issue #13240 comment 3193842038): the slot is freed by `purge`, mon allocates lowest free which happens to be the just-freed ID. Implicit reuse; depends on mon allocation policy and a non-racy purge-prepare window. + - **Alternative — `purge` + `lvm batch --prepare`** (suggested by maintainers in #13240): same implicit-ID-reuse mechanism as above, with ceph-volume handling DB-LV allocation. Verified on Ceph v19.2.2 — works (`osd-id-reuse-analysis.md` Path C); shares the implicit-reuse race window with B. + - **Alternative — `destroy` + `lvm batch --osd-ids X --prepare`** (`lvm batch` does have `--osd-ids` plural): verified on Ceph v19.2.2 — does **not** work with shared metadata devices. ceph-volume's `--osd-ids` path rejects metadata VGs with existing free PE space ("1 fast devices were passed, but none are available"). Eliminated. + +- **U-9 — Wait-for-disk re-check pattern (interval and trigger).** Step 6's wait without `rook-discover` runs an inventory Job on each reconcile and re-queues. Two related questions: + - **Interval.** Default 5 min is a working starting point. Lower (1 min) cuts user-visible latency at the cost of more inventory-Job pods. Higher (15 min) is gentler. Likely cluster-config-tunable rather than hardcoded. + - **Self-requeue vs. user re-trigger.** Alternative: instead of `Result{RequeueAfter: ...}`, the operator could emit a `WaitingForReplacementDisk` event and require the user to bump `confirmation` once they've swapped the disk to nudge the next reconcile. Pro: no operator-side polling, no inventory-Job spam. Con: breaks the "set replaceOSD and walk away" UX from the user story; user has to come back. + - Proposed: `RequeueAfter` with a configurable interval (default 5 min). Decide during PR review. + +- **U-10 — Device-matching validation policy for replacement.** Pre-check #5's policy is pluggable; the three policies (`strict`, `accept-by-path`, `lenient`) are defined in Step 1. Two questions for PR review: + + 1. **What should the default be?** + - `strict` is safest (pre-destroy rejection, clean UX), but adds adoption friction: existing kernel-name CRs and per-device `metadataDevice` layouts are blocked from this flow until migrated. [Ceph upstream](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) is laxer (uses raw `/dev/sdX` in examples); Rook generally permits exact `name:` entries, so strict here is more conservative than the rest of the ecosystem. + - `accept-by-path` is the moderate middle. Kernel names rejected; by-path users can use the flow with same-slot discipline. + - `lenient` maximizes compatibility but defers diagnosis 24h into the flow (post-destroy stall). Recoverable but bad UX. + + 2. **Should the policy be configurable, and at what scope?** + - **Operator-global** (env var like `ROOK_OSD_REPLACEMENT_DEVICE_VALIDATION`) — one knob per cluster, easy to set; doesn't accommodate mixed CR shapes within a cluster. + - **Per-replacement** (`spec.storage.replaceOSD.deviceMatchingMode: `) — most flexible, additional CR API surface. + - **Hard-coded** with no override — simplest, no API surface, but no escape valve for users hitting the per-device-config tension or slot-stable hardware. + + - **Helper (orthogonal):** a one-shot tool that rewrites a user's `storage` spec to `useAllDevices` or `deviceFilter` would reduce strict-policy friction regardless of the default. + + No load-bearing recommendation; flagging for PR-review decision. + +Coverage areas this design must validate (detailed scenarios in [`osd-test-scenarios.md`](osd-test-scenarios.md)): + +- **Happy path** on shared-metadata layouts: single OSD replaced while siblings stay up, both with and without `encryptedDevice: true`; multiple metadata devices on the same node (per-device config); same-device (raw-mode) regression. +- **Required-change validation**: new OSD deployment carries non-empty `ROOK_METADATA_DEVICE` / `ROOK_METADATA_SOURCE_DEVICE`; metadata VG with healthy siblings is now visible to inventory. +- **Crash recovery**: Destroy Job and Prepare Job killed mid-run; state-record-driven retry produces no orphan DB LVs across N retries. +- **Validation gates**: trigger after auto-provisioning is rejected; raw kernel-name device addressing is rejected before any destructive action. +- **Edge cases**: smaller replacement disk; pre-existing leaked DB LVs in the VG; encrypted-OSD dm-crypt key cleanup across Ceph versions. + +Manual verification on a Lima VM (2 simulated HDDs + 1 simulated NVMe with `databaseSizeMB: 1500`, dmcrypt on) before handoff to CI. From 559f29a363c2c2388b9b960a35e9fd45c2189ef6 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Tue, 5 May 2026 17:53:20 +0200 Subject: [PATCH 02/12] docs: osd replacement CRD schema Signed-off-by: Artem Torubarov --- osd-design.md | 553 +++++++++++++++++++++----------------------------- 1 file changed, 228 insertions(+), 325 deletions(-) diff --git a/osd-design.md b/osd-design.md index bc2b5b52938d..b9c7452f9bca 100644 --- a/osd-design.md +++ b/osd-design.md @@ -10,7 +10,7 @@ This design proposes a workflow to replace a single failed OSD in place — pres ## Notation -- **User** - the human cluster admin who edits the CR. +- **User** - the human cluster admin who edits the CR. - **Operator** - the Rook controller process. - **Data LV / data device** - the LV (or block device) holding an OSD's bulk data. One per OSD. - **DB LV / metadata device** - the LV holding the OSD's rocksdb (`block.db`). One per OSD; multiple OSDs can share the same metadata device. @@ -25,52 +25,45 @@ Two facts about the environment shape every later choice in this design. ### Rook cannot tell a replacement disk from a new disk -When a fresh empty disk appears on a node, Rook gets no signal — from the kernel, from Ceph, from the disk itself — that says "I am the replacement for the OSD that just failed". The next CephCluster reconcile calls `startProvisioningOverNodes`, which spawns a prepare-job on each node. With `useAllDevices: true` (or a matching `deviceFilter`) the prepare-job auto-provisions a new OSD on the empty disk with a fresh ID; orphan resources for the failed OSD stay leaked. +When a fresh empty disk appears on a node, Rook has no way to tell it's the replacement for a failed OSD. The next CephCluster reconcile calls `startProvisioningOverNodes`, which spawns a prepare-job on each node. With `useAllDevices: true` (or a matching `deviceFilter`) the prepare-job auto-provisions a new OSD on the empty disk with a fresh ID; orphan resources for the failed OSD stay leaked. -This is why the user must declare the intent first via `spec.storage.replaceOSD`, *before* swapping the disk. Swapping the disk first is unsafe: any reconcile trigger between the swap and the CR edit will auto-provision the new disk with a fresh ID, defeating the flow. +This is why the user must mark the OSD for replacement in the CR *before* swapping the disk. Otherwise, a reconcile triggered between the swap and the CR edit auto-provisions the new disk with a fresh ID instead of replacing osd.5. ### Storage device config must tolerate device swap -Rook lets users identify OSD data devices in three ways via `spec.storage`: +Rook lets users identify OSD data devices via `spec.storage`: - `useAllDevices: true` — match any empty disk on the node. -- `deviceFilter: ""` — match disks whose `lsblk` properties match a regex (e.g., model, vendor). -- `nodes[].devices[].name: ""` — match a specific path or name. The value can be a kernel name (`vdb`), a raw path (`/dev/sdc`), or a udev symlink path (`/dev/disk/by-path/...`, `/dev/disk/by-id/...`). +- `deviceFilter: ""` — match disks whose `lsblk` properties match a regex. +- `nodes[].devices[].name: ""` — match a specific path or name. Accepts a kernel name (`vdb`), a raw path (`/dev/sdc`), or a udev symlink (`/dev/disk/by-path/...`, `/dev/disk/by-id/...`). +- `nodes[].devices[].fullpath: ""` — explicit DevLinks match (`/dev/disk/by-id/...`, `/dev/disk/by-path/...`). Compared against discovered symlinks, not regex. -Each shape interacts differently with the Linux device-naming interfaces. The relevant guarantees: +Each shape interacts differently with the Linux device-naming interfaces: -- **Kernel names** (`vdb`, `sdc`, `/dev/sdc`) are not persistent across reboot, hot-swap, or HBA topology changes — SCSI/SATA enumeration is allocation-order based. See [Arch Wiki: Persistent block device naming](https://wiki.archlinux.org/title/Persistent_block_device_naming). [Ceph's own admin docs](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) use raw paths like `/dev/sdX` in their replacement examples, but the manual procedure can be re-validated at each step; an automated flow has fewer recovery options if the name has shifted. -- **`/dev/disk/by-path/...`** is built by udev rules from the sysfs port path. Same physical port → same `by-path` symlink (guaranteed). Different port → different `by-path`. So `by-path` survives a *same-slot* swap and breaks on a different-slot swap. Same-slot replacement is **not** a Rook or Ceph requirement: [Ceph upstream is silent on slot semantics](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd); cephadm's `ceph orch device replace` is slot-agnostic. +- **Kernel names** (`vdb`, `sdc`, `/dev/sdc`) are not guaranteed to be persistent (see [Arch Wiki: Persistent block device naming](https://wiki.archlinux.org/title/Persistent_block_device_naming)). [Ceph's own admin docs](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) use raw paths like `/dev/sdX` in their replacement examples, but the manual procedure can be re-validated at each step; an automated flow has fewer recovery options if the name has shifted. +- **`/dev/disk/by-path/...`** is built by udev rules from the sysfs port path. Same physical port → same `by-path` symlink. So `by-path` survives a *same-slot* swap and breaks on a different-slot swap. Same-slot replacement is **not** a Rook or Ceph requirement: [Ceph upstream is silent on slot semantics](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd); cephadm's `ceph orch device replace` is slot-agnostic. - **`/dev/disk/by-id/...`** identifies the disk by hardware serial / WWN. Different disk → different `by-id`. Useless for replacement (the new disk *is* a different disk). - **`/dev/disk/by-uuid/...`** identifies the filesystem/LV UUID. The replacement disk has a fresh UUID after provisioning. Same as `by-id`: useless here. -The shapes that tolerate any swap (same-slot or different-slot, any new disk) are `useAllDevices` and `deviceFilter` — both slot-and-disk-agnostic. `by-path` tolerates only same-slot replacement. Kernel names tolerate only the lucky case where the kernel happens to assign the same name. +The shapes that tolerate any swap (same-slot or different-slot, any new disk) are `useAllDevices` and `deviceFilter`. `by-path` tolerates only same-slot replacement. Kernel names tolerate only the lucky case where the kernel happens to assign the same name. `by-id`/`by-uuid` references in `name`/`fullpath` cannot work for a disk that hasn't been seen yet. -**Tension with per-device CR config.** Some legitimate Rook layouts *require* exact `name:` entries — notably per-device `config.metadataDevice` (`Documentation/CRDs/Cluster/ceph-cluster-crd.md:393-394`), which attaches a metadata-device pairing to a specific data device entry. There's no way to express "this data device pairs with that NVMe" via `useAllDevices` alone. Strictly rejecting all exact references in this flow would block these layouts. See "Multiple metadata devices on one node" in Out of scope. - -The replacement flow's pre-check #5 enforces a validation policy — the default and configurability are open question U-10. The intent is that users with simple homogeneous-node setups (`useAllDevices` / `deviceFilter`) work transparently, while users on slot-stable hardware or per-device config can opt into a more permissive policy. +The replacement flow must validate the affected OSD's CR references beforehand so the new disk is still resolvable under those references after the swap. ## Current gaps -Rook already has a same-device replacement flow for OSDs whose data and DB share one device. The user triggers it via `spec.storage.migration.confirmation` in the CR; the operator passes the OSD ID to the prepare-job pod via the `ROOK_REPLACE_OSD` env var, and the prepare-job calls `ceph-volume raw prepare --osd-id` to provision a new OSD reusing the destroyed slot. For the shared-metadata case, five gaps prevent that flow from working end-to-end: +Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration.confirmation`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs five additional fixes: 1. The replacement code path runs only in raw mode; LVM mode (required when a metadata device is configured) does not pass `--osd-id`, so the new OSD gets a new ID. (`initializeDevicesLVMMode`, [volume.go#L584-L844](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L584-L844)) 2. Destroy zaps only the data LV; the DB LV on the shared metadata disk stays as an orphan. (`DestroyOSD`, [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) -3. The dm-crypt key in Ceph's config-key store is never removed, leading to LUKS collisions on retry of encrypted OSDs. (same `DestroyOSD` body — [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) -4. Once any OSD is provisioned on a shared metadata disk, Rook's inventory excludes that disk from future discovery (the "has children" filter). (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) +3. The dm-crypt key in Ceph's config-key store is never removed, leading to LUKS collisions on retry of encrypted OSDs. (`DestroyOSD`, [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) +4. **The prepare-pod can't find a shared metadata disk once it hosts a DB LV.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. The first OSD's DB LV trips that filter, and the prepare-pod's `initializeDevicesLVMMode` then errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). 5. `OSDInfo.MetadataPath` is never populated for LVM-mode OSDs (the parser walks only `[block]` entries from `ceph-volume lvm list`), so the operator has no record of which metadata disk a destroyed OSD used. (`GetCephVolumeLVMOSDs`, [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177)) ## Proposed flow This flow orchestrates [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm prepare --osd-id` → `lvm activate`) across short-lived Kubernetes Jobs, with operator-side state for crash recovery and Rook-specific gates around auto-provisioning. cephadm — Ceph's container-orchestrator analogue — preserves OSD IDs by default ([cephadm OSD service docs](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd)); this design follows the same convention. -Two short-lived jobs — Destroy Job and Prepare Job — separated by the wait for the replacement disk. The operator owns all phase transitions and the wait; jobs are workers observed via `Job.status.succeeded`. - ->Why split into two jobs (vs. one job like the existing OSD migration flow)? ->- The disk-swap wait can take hours. Keeping a job pod alive across it is wasteful — the operator, not a job, should own the wait. ->- Destroy and prepare are independently retryable. If destroy succeeds and prepare fails, only prepare re-runs. - -One OSD per reconcile cycle, gated by `safe-to-destroy `. +Two short-lived jobs — Destroy Job and Prepare Job — separated by the wait for the replacement disk. The operator owns all phase transitions and the wait; jobs are workers observed via `Job.status.succeeded`. Replacements run serially cluster-wide. ### Sequence @@ -78,261 +71,263 @@ One OSD per reconcile cycle, gated by `safe-to-destroy `. sequenceDiagram autonumber actor User - participant CR as CephCluster CR + participant CR as Rook CR participant Op as Operator - participant Map as Replacement CM - participant OldPod as OSD pod 5 old + participant OldPod as Old OSD pod participant DJ as Destroy Job participant PJ as Prepare Job participant Ceph as Ceph - participant NewPod as OSD pod 5 new - User->>CR: set spec.storage.replaceOSD id=5 + User->>CR: set replaceOSD id=5 Op->>CR: read trigger - Op->>Ceph: ceph osd dump get fsid for osd.5 + Op->>CR: write phase=Validating + Op->>Ceph: ceph osd dump (validate exists, get fsid) + Op->>Ceph: safe-to-destroy 5 Op->>OldPod: read deployment env - Op->>Map: write phase=destroy-pending,
layout + Op->>CR: update phase=Destroying + OSD info Op->>OldPod: delete deployment + destroy OldPod Op->>OldPod: wait for pod termination - Op->>DJ: create + Op->>+DJ: create DJ->>Ceph: ceph osd destroy osd.5 - DJ->>Ceph: ceph config-key exists, rm - DJ->>DJ: cryptsetup close db mapping + DJ->>Ceph: config-key rm dm-crypt key (if encrypted) + DJ->>DJ: cryptsetup close db mapping (if encrypted) DJ->>DJ: lvremove db lv DJ->>DJ: ceph-volume lvm zap data lv - Op->>DJ: observe Succeeded - Op->>Map: phase=prepare-pending,
pending-db-lv-name - Note over User,Op: User swaps the failed disk
(any time after CR edit) - Note over Op: Wait for replacement disk:
- with rook-discover: watch local-device-NODE CM
- without: spawn inventory Job, requeue 5m (U-9) - Op->>PJ: create - PJ->>PJ: lvcreate using persisted lv name - PJ->>Ceph: ceph-volume lvm prepare
osd-id=5 - Note over PJ: writes per-node status CM
(rook-ceph-osd-NODE-status,
existing CM, not Map) - Op->>PJ: observe Succeeded - Op->>Map: phase=deployment-pending + DJ-->>-Op: Succeeded + Op->>CR: phase=Waiting + Note over Op: wait for replacement disk (non-blocking) + Note over User,Op: User swaps the failed disk + Op->>CR: phase=Preparing + Op->>+PJ: create from recorded OSD info + PJ->>PJ: lvcreate using persisted name + PJ->>Ceph: ceph-volume lvm prepare --osd-id 5 + PJ->>PJ: write new OSD info to existing per-node status CM + PJ-->>-Op: Succeeded + create participant NewPod as New OSD pod Op->>NewPod: create deployment with id 5 NewPod->>Ceph: lvm activate, join cluster - Op->>Map: phase=completed once Ready + Op->>Ceph: ceph osd metadata 5 until Ready + Op->>CR: phase=Completed ``` -### ConfigMaps and phase state +### Open question: controller placement -Two ConfigMaps appear in the flow: +The diagram doesn't pick a concrete CR or controller for the replacement reconcile logic. Two candidates: extend the existing CephCluster controller (which already hosts `spec.storage.migration`), or introduce a separate `CephOSDReplace` CRD with its own controller. The design leans toward the separate CRD for the following reasons: -1. **`osd-replacement-state`** — new in this design. Per-cluster, single-key. Lives in the operator namespace, owner-ref'd to the CephCluster (same lifecycle pattern as `osd-migration-config` at [`migrate.go#L42-L44`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/migrate.go)). Created when validation persists the trigger (Step 3); transitioned through phases by the operator; deleted (or its single entry overwritten) when the user moves on to a different OSD or clears `replaceOSD`. -2. **`rook-ceph-osd--status`** — existing per-node prepare-job output CM. The Prepare Job (Step 7) writes the new OSD's layout here; the existing reconcile path at [`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324) consumes it to create the daemon Deployment. This design does not change its shape or lifecycle. +1. **CephCluster's `Reconcile()` runs mon, mgr, and osd reconcile sequentially in one call** ([`cluster.go#L116-L160`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/cluster.go#L116-L160)). New long-running logic on the OSD path can interfere with mon/mgr reconcile for the same cluster. +2. **Replacement is long-running and multi-step**, so its state has to survive between reconciles. The cluster controller has no existing place to store sub-operation state — adding one (a side ConfigMap, or extending `CephCluster.status`) is part of the cost. +3. **Replacement reconciles need two outcomes the current cluster reconcile can't express**: terminal failure (bad CR rejected) and `RequeueAfter` (waiting for external events — disk inserted, Job done). Today `osd.Cluster.Start()` returns plain `error`; the parent reconcile has no way to learn "OSD step is mid-replacement, retry in N minutes." It's also unclear how a requeue would interact with components reconciled after the OSD step in the same `Reconcile` call. -The replacement CM holds at most one entry per cluster, keyed by `osd-id`. Re-trigger of the same OSD (with a fresh `confirmation` string — see "Trigger already consumed" in Step 1 pre-checks) overwrites the entry's confirmation and resets phase. Trigger of a different OSD ID while one is in flight does not collide: the "no other replacement in progress" pre-check blocks it until the in-flight one completes. Collision on re-trigger is structurally impossible. +Concrete shape of each candidate: -**Phase state machine:** +- **Extend the cluster controller** — state in either a side ConfigMap (`osd-replacement-state`, mirroring `osd-migration-config`) or `CephCluster.status`. Same UX as `spec.storage.migration`. +- **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile goroutine; never touches the existing OSD path. Light coupling on the cluster side: skip auto-provisioning on affected nodes; surface an `OSDReplacementInProgress` condition. -``` - ┌────────── (timeout) ──────────┐ - ▼ │ -(no entry) → destroy-pending → prepare-pending → deployment-pending → completed → (GC'd) - ▲ - └── (cancel via remove replaceOSD; only honored in waiting-for-disk substate) -``` +The rest of this design proposes the CRD shape concretely. Using `spec.storage.replaceOSD` on CephCluster instead is a fallback that reuses the same state machine and step semantics — implications flagged inline where they differ. -- **destroy-pending**: operator deleted the OSD deployment, Destroy Job is in flight or about to start. -- **prepare-pending**: Destroy Job succeeded, two substates — *waiting for disk* (no Prepare Job yet, only `pending-db-lv-name` reserved) and *Job running* (`lvcreate` and `ceph-volume lvm prepare` in flight). -- **deployment-pending**: Prepare Job succeeded, operator is creating the new daemon Deployment. -- **completed**: new daemon Ready in Ceph; entry kept until the next spec change for audit, then GC'd. +### State -**Full example of the record:** +State lives on `CephOSDReplace.spec` (immutable post-create) and `.status` (operator-updated). Status surfacing follows the standard Kubernetes conditions pattern. Field details and sources inline as YAML comments below. For raw-mode OSDs, `dbLV`, `metadataSourceDevice`, `metadataVG`, and `databaseSizeMB` are absent and `dataLV` is the raw device path; the Destroy step skips DB cleanup and the Prepare step omits `--block.db`. ```yaml -osd-id: 5 -node: node-1 # required for Destroy/Prepare Job NodeSelector; survives Step 4's deployment delete -phase: destroy-pending # destroy-pending → prepare-pending → deployment-pending → completed -data-lv: /dev/ceph-data-vg-5/osd-block-aaa... -db-lv: /dev/ceph-metadata-vg-1/osd-db-bbb... -metadata-source-device: nvme0n1 -metadata-vg: ceph-metadata-vg-1 -crush-device-class: hdd -database-size-mb: 4096 -encrypted: true -osd-fsid: 8b7e6c19-... -pending-db-lv-name: # populated when phase advances to prepare-pending -expected-disk-pending: false # set true while phase=prepare-pending; gates auto-provision skip per required change 6 -confirmation: # value from spec at trigger time; populated on phase=completed -new-fsid: # populated on phase=completed; for audit/diagnostics only, never for re-arming -completed-at: # populated on phase=completed +apiVersion: ceph.rook.io/v1 +kind: CephOSDReplace +metadata: + name: replace-osd-5 + namespace: rook-ceph +spec: # immutable post-create; re-replace = create a new CR + cephCluster: my-cluster # target cluster in this namespace + osdId: 5 + confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; typo guard against destroying the wrong OSD + autoOut: false # optional; if true, operator marks healthy OSD `out` automatically. Default: false (fail-fast on up+in) + safeToDestroyTimeout: 1h # optional; how long Validating tolerates EBUSY before Failed. Default: 1h + diskWaitTimeout: 24h # optional; how long Waiting tolerates a missing disk before Failed. Default: 24h + +status: + phase: Destroying # Pending | Validating | Destroying | Waiting | Preparing | Completed | Failed | Cancelled + conditions: + - type: Ready + status: "False" + reason: Destroying + message: Destroy Job in flight + observedGeneration: 1 + lastTransitionTime: "2026-05-05T12:00:00Z" + + # captured at the Validating → Destroying transition + osdInfo: + node: node-1 # OSD deployment NodeSelector; survives the deployment delete + dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH + dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE; absent for raw-mode OSDs + metadataSourceDevice: nvme0n1 # OSD deployment env ROOK_METADATA_SOURCE_DEVICE; absent for raw-mode OSDs; from `ceph-volume lvm list` if env missing (see Capture step) + metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` + crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS + databaseSizeMB: 4096 # from `lvs --noheadings -o lv_size ` ÷ 1MiB + encrypted: true # from LV tag `ceph.encrypted` on + osdFsid: 8b7e6c19-... # from `ceph osd dump --format json` + + # populated on phase=Completed + newFsid: "" # for audit only; never used for re-arming + completedAt: null ``` -**Reconcile order on every cycle:** the OSD reconcile entry-point ([`Cluster.Start` in `osd.go#L255`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L255)) gains a new first-step subroutine that runs **before** the existing `updateAndCreateOSDs` path: +`metadataSourceDevice` and the env it's read from (`ROOK_METADATA_SOURCE_DEVICE`) are only present on OSDs deployed after required changes #2 and #5 land; older OSDs use the fallback in [Capture](#3-capture). U-8 explores eliminating the fallback at the source. -1. **GC** stale entries (rules in "Long-term state cleanup" below). -2. **Drive in-flight** entries forward via the state machine. -3. **Validate** any newly-set `spec.replaceOSD` and persist the entry on success. +**Cancel and re-replace.** Cancel = delete the CR; a finalizer runs the operator's cleanup (delete partially-allocated DB LV if any; leave the OSD `destroyed` for the user to `ceph osd purge` manually). Re-replacement of the same OSD = create a new CR with a different name. Terminal CRs (`Completed`, `Cancelled`, `Failed`) are inert — keep them as audit trail or delete them; the operator requires neither. -Only after this returns does `updateAndCreateOSDs` run. This ordering prevents auto-provisioning from racing a fresh trigger when the operator restarts after a CR edit (the "operator-down race"): a replacement disk inserted during operator downtime is held off via the `expected-disk-pending` flag in the record until the replacement Prepare Job claims it. +#### Coordination -### Step-by-step +Replacements run serially cluster-wide as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. + +With a separate `CephOSDReplace` CRD, coordination is explicit: new CRs enter `Pending` and wait for any peer `CephOSDReplace` in the same namespace (targeting the same cluster) to reach a terminal phase. FIFO ordering by `(creationTimestamp, UID)` — the UID tiebreaker handles same-second creations. Race-safe under optimistic concurrency: two CRs that see each other both stay `Pending`; the older wakes first. -The walk-through uses a concrete example: +Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. + +#### Phase state machine ``` -OSD ID: 5 -metadata VG: ceph-metadata-vg-1 -data device: /dev/sdc → /dev/sdh after swap -databaseSizeMB: 4096 -crush-device-class: hdd -encryption: on + Pending ─→ Validating ─→ Destroying ─→ Waiting ─→ Preparing ─→ Completed + │ │ + ▼ ▼ + Failed/Cancelled Failed/Cancelled ``` -#### Step 1 — User sets `replaceOSD` on the CR (diagram arrows 1-2) +(With the cluster-CR fallback, `Pending` is omitted — a single `spec.storage.replaceOSD` field admits only one in-flight replacement.) -Typical trigger is a failed disk, but failure is not required — `safe-to-destroy` is the only gate, so the flow also covers proactive replacement of a healthy OSD. +Per-phase behavior: -```yaml -spec: - storage: - useAllNodes: true - useAllDevices: true # or use deviceFilter; an exact `name:` entry on osd.5's device would be rejected by pre-check #5 - replaceOSD: - id: 5 - confirmation: "yes-really-replace-osd-5" -``` +| Phase | Normal exit | Transient failure (retried) | Terminal exit | +|---|---|---|---| +| (no record) | → `Pending` once trigger valid | — | — | +| `Pending` | → `Validating` once no earlier peer is in non-terminal phase | re-checks each reconcile while an earlier peer is in-flight | → `Cancelled` if user deletes the CR | +| `Validating` | → `Destroying` once all checks pass | `safe-to-destroy` returns EBUSY (peers backfilling) — re-checked each reconcile | → `Cancelled` on CR delete; → `Failed` with `reason=InvalidSpec` (target OSD missing, unstable device name, confirmation mismatch) or `reason=OSDStillIn` (up+in target without `autoOut: true`) or `reason=NotSafeToDestroy` (escalation timeout) | +| `Destroying` | → `Waiting` on Destroy Job success | Destroy Job retries on transient errors (Ceph unreachable, pod scheduling) | — | +| `Waiting` | → `Preparing` once replacement disk visible | inventory poll until disk visible | → `Cancelled` on CR delete; → `Failed` with `reason=ReplacementDiskMissing` after disk-swap wait expires | +| `Preparing` | → `Completed` when new daemon is Ready in Ceph | Prepare Job pod retries on transient errors; `lvcreate` precheck handles partial LV from a prior pod; Deployment creation retries | — | +| `Completed` | record persists for audit | — | terminal | +| `Cancelled`, `Failed` | record persists for audit | — | terminal until user-side action | -`confirmation` is a free-form string the user picks. It does not encode the OSD ID; the example just embeds `5` for human clarity. To re-trigger replacement of the *same* OSD ID after a successful run, the user changes `confirmation` to a new string (e.g., `"yes-really-replace-osd-5-take-2"`). Same UX as `spec.storage.migration.confirmation` today. +Phase is the operator's internal cursor. The user-visible signal is a single `Ready` condition: `True` only on `Completed`; `False` otherwise, with `reason` set to the current phase name (or a typed terminal reason for `Failed`: `InvalidSpec`, `OSDStillIn`, `NotSafeToDestroy`, `ReplacementDiskMissing`). This gives `kubectl wait --for=condition=Ready` semantics. Use `metav1.Condition` (with `observedGeneration`), not Rook's legacy `cephv1.Condition`. -The user can swap the disk at any point after the edit succeeds — before, during, or after destroy. Step 5 tolerates a missing data PV. Only ordering rule: edit the CR first, then swap. +#### Reconcile resume behavior -If multiple OSDs need replacement, the user sets `replaceOSD`, waits for completion, then sets it again with a different ID. `replaceOSD` is an object, not a list — same shape as `spec.storage.migration` for consistency. Parallelism is open question U-2. +| Phase on disk | Recovery on next reconcile | +|---|---| +| no record | Trigger re-evaluated. If valid, record created with `phase=Pending`. No destructive action taken yet. | +| `Pending` | Peer list re-evaluated; if no earlier-timestamp peer is in non-terminal phase, advance to `Validating`. No destructive action taken yet. | +| `Validating` | Checks re-run (idempotent); `safe-to-destroy` escalation clock continues from the record's creation timestamp. No destructive action taken yet. | +| `Destroying`, no Destroy Job | Operator re-issues the deployment delete (idempotent), waits for pod termination, creates the Destroy Job. | +| `Destroying`, Destroy Job in flight | Operator awaits Job; on retry, recreates it. All commands in Destroy step are idempotent via precheck patterns. | +| `Waiting` | Operator polls for replacement disk; once visible, advances to `Preparing` and creates the Prepare Job (UUID generated inline, baked into Job env so pod retries reuse it). | +| `Preparing`, Prepare Job in flight | Operator awaits Job; on retry, recreates it. `lvcreate` skipped if LV exists; `lvm prepare --osd-id` reuses the destroyed slot. | +| `Preparing`, Prepare Job done | Existing per-node OSD-status reconcile creates the Deployment; operator polls `ceph osd metadata` until Ready, then transitions to `Completed`. | +| `Completed` | Validation short-circuits; record deleted on next spec change. | -**Pre-checks.** Each check runs on each reconcile when `spec.replaceOSD` is set. Possible outcomes per check: +> **⚠️ Destroy is irreversible.** Once `Validating` passes (phase advances to `Destroying`), `osd.5` will be destroyed on this reconcile cycle. There is no preview surfacing the captured OSD info. If the user typed the wrong OSD ID, the wrong OSD is gone — recovery is via [Cancellation](#cancellation), not by retracting the trigger. -- **Continue** — advance to the next check. -- **Short-circuit** — no action this reconcile (idempotency / in-flight). -- **Terminal-reject** — set `ReplacementRejected` condition + Kubernetes Event via `opcontroller.UpdateCondition` ([`conditions.go#L35`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/controller/conditions.go#L35)); user must change spec to recover. -- **Transient-wait** — set a `WaitingFor*` condition; re-evaluate next reconcile, no spec change needed. +### Step-by-step -1. **Trigger already consumed.** Replacement CM has a `phase: completed` entry whose `(osd-id, confirmation)` match the spec. Match is on `(id, confirmation)` only — *not* `new-fsid`. (Re-using fsid would silently destroy an OSD that the user manually purged and recreated outside this flow.) → **Short-circuit.** -2. **Trigger already in flight.** Replacement CM has an in-progress entry (any phase before `completed`) whose `(osd-id, confirmation)` match the spec. → **Short-circuit**; the state machine drives the existing entry forward instead of re-validating. -3. **OSD 5 exists** in the OSD map. → **Terminal-reject** if absent (wrong ID, user edits spec). -4. **`safe-to-destroy 5`** returns OK. The only safety gate; `down`/`out` alone is not sufficient because data may not have replicated to peers. → **Transient-wait** (`WaitingForSafeToDestroy`) while peers backfill — verified on Ceph v19.2.2 in [`osd-rep-log.md`](osd-rep-log.md) §1.2 that `safe-to-destroy` returns EBUSY in this state. Bounded escalation timeout (default 1h; see U-4) flips to terminal `SafeToDestroyTimeout` — backfill stuck for 1h+ warrants paging. -5. **Failed OSD's CR matching is swap-tolerant** — evaluated per the validation policy (default `strict`; see U-10 for the policy and configurability discussion): - - **`strict`** — reject if the failed OSD is matched by *any* exact `name:` entry in `spec.storage.nodes[*].devices[*]` on the OSD's node. The CR must match the failed OSD via `useAllDevices` or `deviceFilter`. Implementation: look up the failed OSD's data device from its deployment; scan the CR's `name:` entries on that node; reject if any resolves to that device. - - **`accept-by-path`** — reject only kernel-name-style references (`vdb`, `sdc`, `/dev/sdc`); accept `/dev/disk/by-path/...` references. The user takes responsibility for performing same-slot replacement. - - **`lenient`** — accept any CR shape. Mismatches surface as a Step 6 stall (`ReplacementDiskMissing` after the U-4 timeout). +The walk-through uses the running example. - → **Terminal-reject** if the chosen policy rejects (spec must be made swap-tolerant before this flow can run). -6. **No unexpected OSD on the node** — catches the auto-provisioning race (a replacement disk was inserted before this trigger fired and Rook auto-provisioned a new OSD on it). Compare: - - `ceph osd metadata` filtered by hostname (already used by [`clusterdisruption/osd.go#L450`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/disruption/clusterdisruption/osd.go#L450)), - - vs. OSD Deployments owned by Rook on this node (`app=rook-ceph-osd`, filtered by `NodeSelector[k8sutil.LabelHostname()]`). +#### 1. Trigger — user creates a `CephOSDReplace` CR - Any OSD Ceph reports with no matching Rook Deployment is unexpected. → **Terminal-reject** (user removes the orphan before re-triggering). -7. **No other replacement is in progress** (different `osd-id`). → **Transient-wait** (`WaitingForInFlightReplacement`); self-clearing once the in-flight entry reaches `completed` and is GC'd. +Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` after `mon_osd_down_out_interval` (default 600s) and backfills; once the OSD is drained from every PG's acting set, `safe-to-destroy` clears and the flow proceeds. The user creates a `CephOSDReplace` CR and replaces failed device in datacenter. -#### Step 2 — Capture layout +For proactive replacement of a healthy (up+in) OSD, two options: -The operator captures the OSD's layout from sources that do not require the failed data device. +1. **Default — user marks `out` first.** With `autoOut: false` (default), Validating fails fast on a still-`in` OSD with `reason=OSDStillIn`. The user runs `ceph osd out ` themselves, waits for backfill, then re-applies the CR. +2. **Opt-in — `autoOut: true`.** Operator runs `ceph osd out ` at entry to Validating and loops `safe-to-destroy` through the full backfill. The `safeToDestroyTimeout` should be extended to fit the cluster's expected backfill time when using this. -| Field | Source | Example | -| ------------------------- | -------------------------------------------------------------------------------------------- | ------------------------------------- | -| `osd-fsid` | `ceph osd dump --format json` | `8b7e6c19-...` | -| `osd-id` | OSD pod label `ceph-osd-id` | `5` | -| `node` | OSD deployment `Spec.Template.Spec.NodeSelector[k8sutil.LabelHostname()]` | `node-1` | -| `data-lv` | OSD deployment env `ROOK_BLOCK_PATH` | `/dev/ceph-data-vg-5/osd-block-aaa…` | -| `db-lv` | OSD deployment env `ROOK_METADATA_DEVICE` ¹ | `/dev/ceph-metadata-vg-1/osd-db-bbb…` | -| `metadata-source-device` | OSD deployment env `ROOK_METADATA_SOURCE_DEVICE` ² | `nvme0n1` | -| `crush-device-class` | OSD deployment env `ROOK_OSD_CRUSH_DEVICE_CLASS` | `hdd` | -| `metadata-vg` | `pvs --noheadings -o vg_name ` | `ceph-metadata-vg-1` | -| `database-size-mb` | `lvs --noheadings -o lv_size ` ÷ 1MiB ³ | `4096` | -| `encrypted` | LV tag `ceph.encrypted` on `` ³ | `true` | +The disk can be swapped any time after the CR is applied — the Capture step tolerates a missing data PV. If multiple OSDs need replacement — open question U-2. -¹ Existing env, but populated only for raw-mode OSDs today. Required change #2 fixes the parser to populate it for LVM-mode OSDs as well. -² New env, added by required change #5. For OSDs whose deployment predates required change #5, this env is missing — see fallback below. -³ Read from the OSD's own DB LV (the metadata VG is by construction intact at Step 2: failure is on the data device, and Step 5's `lvremove` hasn't run yet). Live spec is *not* the source: a user-edited `spec.storage.config.databaseSizeMB` between original provisioning and replacement would size the new DB LV inconsistently with siblings, and `encrypted` is immutable per-OSD so a CR-level toggle cannot retroactively change it. If the OSD's own LV is missing for any reason, fall back to a surviving sibling LV in the same VG. +#### 2. Validate -**Fallback when `ROOK_METADATA_*` env vars are missing.** For deployments predating required change #5, the operator captures `db-lv` and `metadata-source-device` from a one-shot `ceph-volume lvm list --format json` Job on the OSD's node, via Rook's existing `cmdreporter` ([`cmdreporter.go`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/cmdreporter/cmdreporter.go) — same pattern used today for network/version detection). The pod profile mirrors the prepare-job's (privileged + `/dev`, `/run/lvm`, `/run/udev` mounts, NodeSelector pinned to the failed OSD's node). Output's `[db]` entry: `devices` field is the metadata source device, `tags.ceph.db_device` is the DB LV path. Correct even when the data device has physically failed — `ceph-volume lvm list` reads from VG metadata replicated on the metadata-VG's surviving PV. Verified empirically against the Lima cluster's output for a healthy shared-metadata OSD. If the Job fails or returns no entry for the target OSD, validation rejects with `LayoutCaptureFailed` (terminal — user investigates, e.g. metadata disk also failed → out of scope). +Phase `Validating`. Entered when a fresh trigger advances out of `Pending` and no matching record exists. The operator persists the record on entry (so the `safe-to-destroy` escalation clock — origin = record creation timestamp — survives reconciles), then runs the checks below each reconcile cycle until exit. Failures land on `phase=Failed` with a typed `reason` exposed via `.status.conditions`. -#### Step 3 — Persist the replacement record (diagram arrow 5) +1. **Confirmation matches.** `spec.confirmation` must equal `"yes-really-replace-osd-{spec.osdId}"`. → `Failed` with `reason=InvalidSpec` on mismatch (typo guard). +2. **Target OSD exists** in the OSD map. → `Failed` with `reason=InvalidSpec` if absent. +3. **Target OSD is destroyable.** If the OSD is `up && in`: with `spec.autoOut: false` (default), → `Failed` with `reason=OSDStillIn`. With `spec.autoOut: true`, the operator runs `ceph osd out ` once at entry and falls through to check 5. +4. **CR-level device matching is swap-tolerant.** → `Failed` with `reason=InvalidSpec` if the OSD's data device is referenced by an unstable name in the CR. Validation policy is configurable per U-10. +5. **`safe-to-destroy ` returns OK.** Returns EBUSY while any PG still has the OSD in its acting set — the only safety gate (`down`/`out` alone is not sufficient because data may not have replicated). EBUSY → stay in `Validating`, re-check next reconcile. `spec.safeToDestroyTimeout` exceeded → `Failed` with `reason=NotSafeToDestroy`. -Operator writes the replacement CM with `phase: destroy-pending` and the layout captured in Step 2. Field schema and lifecycle: see "ConfigMaps and phase state" above. From this point on, the record is the source of truth for retry — a crashed operator restarts and resumes from the persisted phase. +When all checks pass, the operator advances to `Destroying` and proceeds to [Capture](#3-capture). -#### Step 4 — Delete OSD deployment, wait for pod termination, create Destroy Job (diagram arrows 6-8) +There is no auto-provisioning-race detection. If the user mis-orders — inserts the disk before creating the `CephOSDReplace` CR — Rook auto-provisions a new OSD on it (with a fresh ID), and the subsequent replacement then stalls in the Wait step (no empty disk left to claim). The user resolves by purging the auto-provisioned OSD; the design relies on this observable failure mode rather than embedding pre-trigger detection. -Operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` — this deletes the deployment and polls until the pod is gone. Then it creates the Destroy Job populated with the layout. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and Step 5's `cryptsetup close` would fail. +#### 3. Capture -If the wait times out (transiently NotReady node), the operator sets `WaitingForOSDPodTermination` and re-checks on the next reconcile. The operator does NOT force-delete: a stuck pod on a NotReady node may still be holding the LUKS mapping when kubelet recovers; force-delete would diverge K8s and host state. +Capture the failed OSD's info from sources that do not require the failed data device. Runs at the `Validating` → `Destroying` transition: the operator captures the fields below and atomically updates the existing record (sets `phase=Destroying` plus the captured fields). From this point on, the record carries OSD info; a crashed operator restarts and resumes from the persisted phase. + +The OSD's own DB LV is the source for `database-size-mb` and `encrypted` (the metadata VG is by construction intact at this step: failure is on the data device, and the destroy LV-removal hasn't run yet). Live spec is *not* the source: a user-edited `spec.storage.config.databaseSizeMB` between original provisioning and replacement would size the new DB LV inconsistently with siblings, and `encrypted` is immutable per-OSD so a CR-level toggle cannot retroactively change it. If the OSD's own LV is missing, fall back to a surviving sibling LV in the same VG. -**Host permanently down — out of scope.** If the host is genuinely gone (powered off, hardware failure), this flow cannot proceed: the Destroy Job's NodeSelector pins it to that node, and even a force-deleted OSD pod doesn't bring the kubelet back. The Destroy Job stays Pending. Replacement of an OSD on a permanently-dead host is a different workflow (node decommission, then OSD-out-and-purge, then re-add the host with fresh OSDs) — handled by existing Rook flows, not this design. The operator surfaces this case via a `ReplacementHostUnavailable` event after both the pod-termination wait and a Destroy-Job-Pending wait expire. +For OSDs whose deployment predates required changes #2 and #5, `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` are absent. A one-shot `ceph-volume lvm list --format json` Job on the OSD's node, via Rook's existing `cmdreporter`, fills both. Output's `[db]` entry: `devices` → metadata source device, `tags.ceph.db_device` → DB LV path. Correct even when the data device has physically failed (reads from VG metadata replicated on the metadata-VG's surviving PV). Verified on the Lima cluster. -#### Step 5 — Destroy Job (diagram arrows 9-13) +#### 4. Destroy -Operator-owned phase stays `destroy-pending` until the Job reports `Succeeded`. The Job's container invokes `DestroyOSD` ([`remove.go#L244-L292`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go). The bash below specifies the behavior `DestroyOSD` must implement after required change #3 lands (today it only does the first step and a partial last step). Each operation is idempotent on retry; no standalone shell script ships in the operator. +Operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` — this deletes the deployment and polls until the pod is gone. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and the next step's `cryptsetup close` would fail. + +If the wait times out (transiently NotReady node), the operator sets `WaitingForOSDPodTermination` and re-checks on the next reconcile. The operator does NOT force-delete: a stuck pod on a NotReady node may still be holding the LUKS mapping when kubelet recovers; force-delete would diverge K8s and host state. + +**Host permanently down — out of scope.** If the host is gone (powered off, hardware failure), this flow cannot proceed: the Destroy Job's NodeSelector pins it to that node, and force-deleting the OSD pod doesn't bring the kubelet back. The Destroy Job stays Pending. Replacement of an OSD on a permanently-dead host is a different workflow (node decommission, then OSD-out-and-purge, then re-add the host with fresh OSDs) — handled by existing Rook flows. + +The Destroy Job's container invokes `DestroyOSD` ([`remove.go#L244-L292`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go). The bash below specifies the behavior `DestroyOSD` must implement after required change #3 lands (today it only does the first step and a partial last step). Each operation is idempotent on retry; no standalone shell script ships in the operator. ```bash -# 5.1 Destroy in Ceph (preserves OSD ID 5 for reuse). +# Destroy in Ceph (preserves OSD ID 5 for reuse). ceph osd destroy osd.5 --yes-i-really-mean-it # idempotent: already-destroyed → succeeds -# 5.2 Remove dm-crypt key. On Ceph v19.2.2 (verified) `ceph osd destroy` already -# cleans the key and `config-key rm` on a missing key is itself idempotent -# (returns 0), so this whole step is typically a no-op. The explicit `exists` -# precheck is defensive: keeps the chain safe on older Ceph versions where -# rm's exit-code behavior on missing key has not been measured. +# Remove dm-crypt key. On Ceph v19.2.2 (verified) `ceph osd destroy` already +# cleans the key and `config-key rm` on a missing key is itself idempotent +# (returns 0), so this whole step is typically a no-op. The explicit `exists` +# precheck is defensive: keeps the chain safe on older Ceph versions where +# rm's exit-code behavior on missing key has not been measured. ceph config-key exists dm-crypt/osd/8b7e6c19-.../luks \ && ceph config-key rm dm-crypt/osd/8b7e6c19-.../luks -# 5.3 Close DB-side LUKS mapping. The cryptsetup arg is the device-mapper name, -# not the LUKS UUID. Enumerate children with TYPE explicit and pick the crypt -# child specifically — robust against future LV-stack shapes (snapshots, -# thin pools) that could produce additional non-crypt children. -# Precheck pattern (no || true): if the mapping is gone, do nothing; if it's -# present and close fails (busy device), the error bubbles up and the state -# machine retries. +# Close DB-side LUKS mapping. Enumerate children with TYPE explicit and pick +# the crypt child specifically — robust against future LV-stack shapes. DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db-bbb... | awk '$2=="crypt"{print $1; exit}') [ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ && cryptsetup close "$DB_MAPPING" -# 5.4 Free the DB slot. Precheck (no || true): real lvremove failures bubble up -# and the state machine retries. +# Free the DB slot. Real lvremove failures bubble up; state machine retries. lvs /dev/ceph-metadata-vg-1/osd-db-bbb... >/dev/null 2>&1 \ && lvremove -f /dev/ceph-metadata-vg-1/osd-db-bbb... -# 5.5 Zap the data LV (also handles the data-side dm-crypt mapping). -# Precheck mirrors 5.4: skip if the LV no longer exists. Real failures -# (zap returns non-zero with the LV present — partial wipe, busy device) -# bubble up via Job exit and are retried by the state machine. +# Zap the data LV (also handles the data-side dm-crypt mapping). lvs /dev/ceph-data-vg-5/osd-block-aaa... >/dev/null 2>&1 \ && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block-aaa... --destroy ``` -After Job completes successfully, operator advances record to `phase: prepare-pending` and does Step 6. +After the Job completes, operator advances the record to `phase: Waiting`. -#### Step 6 — Pre-allocate DB LV name and wait for replacement disk (diagram arrow 14) +#### 5. Wait for replacement disk -Operator generates a fresh uuid for the new DB LV and persists it in the record (`pending-db-lv-name`) before Step 7.1's `lvcreate` runs. On retry, the same name is reused — no orphan DB LVs from retries. +The wait is non-blocking (see [Open question: controller placement](#open-question-controller-placement) above). Each reconcile cycle either checks for the disk and yields, or finds it and creates the Prepare Job. Inventory needs a path that doesn't auto-provision — the standard prepare-job spawn for this node would otherwise claim the empty disk with a fresh ID. Two cases: -The operator then waits for the replacement disk to appear on the node. The operator pod has no `/dev` access; the existing prepare-job spawn (which would otherwise inventory the node) is *suppressed* for this node by change #6's `expected-disk-pending` flag — without that suppression, it would auto-provision the new disk with a fresh ID. So inventory needs a path that doesn't provision: +- **`rook-discover` enabled** — operator watches the per-node `local-device-` CM. Reconcile triggers on CM update via the hotplug-CM watch. Latency: seconds (rook-discover's udev monitor) up to its `ROOK_DISCOVER_DEVICES_INTERVAL` (default 60 min) for the polling fallback. +- **`rook-discover` disabled** (the operator's default) — the operator yields with a periodic re-check (default 5 min; see U-9), and on each cycle spawns a one-shot `ceph-volume inventory --format json` Job via `cmdreporter`. The Job is read-only — does not auto-provision — so it doesn't conflict with the replacement. -- **If `rook-discover` is enabled:** operator watches the per-node `local-device-` CM. Reconcile is triggered on CM update via the hotplug-CM watch (`controller.go:279`). Latency: seconds (rook-discover's udev monitor) up to its `ROOK_DISCOVER_DEVICES_INTERVAL` (default 60 min) for the polling fallback. -- **If `rook-discover` is disabled** (the operator's default): the operator returns `Result{RequeueAfter: 5m}` from each reconcile while in `prepare-pending` waiting-for-disk, and spawns a one-shot `ceph-volume inventory --format json` Job via the existing `cmdreporter` pattern (same one used for Step 2's older-OSD fallback). The Job runs node-side, writes its output to a result CM, and the operator reads it on the next reconcile. Latency ≈ `RequeueAfter` interval (5m) + Job pod startup. - -The 5-min `RequeueAfter` interval is a working default, not a load-bearing decision — see open question U-9. The wait blocks only this OSD's flow; other OSD reconcile work proceeds normally. - -While waiting, the operator sets `WaitingForReplacementDisk` on the CephCluster status. Default timeout 24h (U-4). On timeout the condition flips to `ReplacementDiskMissing` and polling stops. +While waiting, the operator surfaces a `WaitingForReplacementDisk` condition. Default timeout `spec.diskWaitTimeout` (24h). On timeout the condition flips to `ReplacementDiskMissing` and polling stops. **Recovery from timeout — two paths:** -1. **Insert the disk and bump `confirmation`** in the CR. Pre-checks re-run and the wait resumes. `pending-db-lv-name` is preserved across the cycle (Step 7.1's precheck handles the LV being either already-allocated or absent). -2. **Abandon** by removing `spec.storage.replaceOSD`. Per "Handling cancellation", removing the field in this substate is honored: the operator GCs the record; `osd.5` stays `destroyed` in the OSD map; user runs `ceph osd purge 5` manually if they want to remove the slot. +1. **Insert the disk and create a new CR** for the same OSD ID (re-replacement). The new CR enters `Validating` fresh; the failed CR is inert. +2. **Abandon** by deleting the CR. Per [Cancellation](#cancellation), this substate honors cancel: the finalizer cleans up; `osd.5` stays `destroyed`; user runs `ceph osd purge 5` manually if they want to remove the slot. -#### Step 7 — Prepare Job (diagram arrows 15-17) +#### 6. Prepare -Phase `prepare-pending`. The Job receives the record (including `pending-db-lv-name`) as env vars. +Phase `Preparing` (entered when the replacement disk is visible). The operator generates a fresh UUID for the new DB LV and bakes it into the Job spec as an env var (same pattern as `ROOK_REPLACE_OSD` in `provision_spec.go:317,322`); pod retries within the Job's backoff reuse the same env. The Job performs: ```bash -# 7.1 Pre-allocate the DB LV using the persisted name. Idempotent on retry — -# if the LV already exists from a previous attempt, lvcreate is skipped. +# Pre-allocate the DB LV using the persisted name. Idempotent on retry — +# if the LV already exists from a previous attempt, lvcreate is skipped. lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ || lvcreate -L 4096M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y -# 7.2 Provision the new OSD with the preserved ID. -# --dmcrypt is conditional on the record's `encrypted` field; -# omitted for unencrypted OSDs. +# Provision the new OSD with the preserved ID. +# --dmcrypt is conditional on the record's `encrypted` field; +# omitted for unencrypted OSDs. ceph-volume lvm prepare \ --bluestore [--dmcrypt] \ --osd-id 5 \ @@ -341,113 +336,51 @@ ceph-volume lvm prepare \ --crush-device-class hdd ``` -(The uuid in `osd-db-12cf3a91-...` is the operator-generated uuid from Step 6, not the OSD's fsid. ceph-volume assigns its own fsid during prepare and writes `ceph.osd_fsid` / `ceph.db_uuid` LV tags.) - -Prepare writes the new OSD's layout (data path, DB path, metadata source device) to the per-node status CM that Rook already uses to drive daemon creation. After the Job succeeds, operator advances to `phase: deployment-pending`. +The UUID in `osd-db-12cf3a91-...` is the operator-generated UUID from the Wait step, not the OSD's fsid. ceph-volume assigns its own fsid during prepare and writes `ceph.osd_fsid` / `ceph.db_uuid` LV tags. -#### Step 8 — Operator creates the new OSD deployment (diagram arrows 18-20) +Prepare writes the new OSD's info to the per-node prepare-job status CM that Rook already uses to drive daemon creation. The phase stays `Preparing` while the operator creates the Deployment and waits for the new daemon to become Ready in Ceph. -Reuses the existing reconcile path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` (no fallback `lvm list` job needed for a future replacement of this same OSD). +#### 7. Activate -#### Step 9 — Mark replacement complete (diagram arrow 21) +Reuses the existing reconcile path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — no fallback Job needed for any future replacement of this OSD. -Operator polls `ceph osd metadata `. Ready = a record returned with a non-empty fsid, `id` matching, and `hostname` matching the record's `node`. This single check covers both the up-in-Ceph signal and the new-fsid capture; `ceph osd metadata` is the source of truth, not K8s readiness-probe semantics. +#### 8. Complete -On Ready, the operator transitions the replacement CM entry from `phase: deployment-pending` to `phase: completed` and records `confirmation`, `new-fsid`, and `completed-at`. The entry is kept (not deleted) so the next reconcile sees the consumed trigger and short-circuits via pre-check #1. Same UX as `spec.storage.migration` today: the operator never mutates `spec.replaceOSD`; the user clears the field manually when they want to move on. +Like the disk wait, the wait for the new daemon to join Ceph is non-blocking. Each reconcile cycle calls `ceph osd metadata `. Ready = a record returned with a non-empty fsid, `id` matching, and `hostname` matching the record's `node`. This single check covers both the up-in-Ceph signal and the new-fsid capture. -If the new OSD does not reach Ready, the record stays in its in-progress phase and the next reconcile resumes from there. +On Ready, the operator transitions to `phase: Completed` and records `newFsid` and `completedAt`. The record persists for audit; the user keeps or deletes the CR at their leisure (the next reconcile short-circuits on the terminal phase). -### Idempotency / resume table +### Cancellation -| Phase on disk | Recovery on next reconcile | -| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------- | -| no record | Validation re-evaluated. No destructive action taken yet. | -| `destroy-pending`, no Destroy Job exists | Operator re-issues the deployment delete (idempotent), waits for pod termination, creates the Destroy Job. | -| `destroy-pending`, Destroy Job in flight | Operator awaits Job; on retry, recreates the Job. All commands in Step 5 are idempotent via precheck patterns. | -| `prepare-pending`, no Prepare Job yet | Operator polls for replacement disk; once visible, creates Prepare Job. Same `pending-db-lv-name` reused — no new orphan. | -| `prepare-pending`, Prepare Job in flight | Operator awaits Job; on retry, recreates it. `lvcreate` skipped if LV exists (7.1 precheck). `lvm prepare --osd-id` reuses the destroyed slot. | -| `deployment-pending` | Existing per-node OSD-status reconcile creates the deployment. | -| `completed` (with consumed `confirmation` + `new-fsid`) | Flow done. Pre-check #1 (trigger already consumed) short-circuits subsequent reconciles until spec moves on; entry then GC'd per "Long-term state cleanup". | - -> **⚠️ Destroy is irreversible.** Once pre-checks pass and the operator persists the record (Step 3), `osd.5` will be destroyed on this reconcile cycle. There is no "are you sure?" preview surfacing the captured layout. If the user typed the wrong OSD ID, the wrong OSD is gone — recovery is via the cancellation table below, not by retracting the trigger. - -### Long-term state cleanup - -GC runs first on every reconcile cycle (see "Reconcile order" in "ConfigMaps and phase state"). It only acts on entries that are **not** in an in-progress phase — for in-progress entries, the cancellation table below governs; GC does not touch them. This precedence prevents the user-changes-`replaceOSD.id`-mid-flight failure mode where mid-flight osd.5 would be destroyed, then GC'd, and stuck `destroyed` with no replacement. - -GC rules: - -| Spec state | Entry phase | Action | -|---|---|---| -| `replaceOSD` unset | `completed` | GC the completed entry. | -| `replaceOSD` unset | `prepare-pending` (waiting-for-disk only) | GC per cancellation table below. | -| `replaceOSD` unset | any other in-progress phase | No action; cancellation table governs. | -| `replaceOSD` set; `(id, confirmation)` differ from a `completed` entry | `completed` | GC; treat spec as fresh trigger. | -| `replaceOSD` set; mismatch on in-progress entry | any in-progress phase | No action; in-flight flow runs to completion, then GC fires next cycle. | - -If the user changes `spec.replaceOSD.id` from 5 to 7 mid-flight: osd.5's flow runs to `completed`; on the next reconcile, GC removes the entry (its `osd-id=5` ≠ `spec.id=7`); pre-checks then run for osd.7. The spec change is effectively queued. - -### Handling cancellation - -`replaceOSD` is a mutable spec field. Removing it (or changing its ID mid-flight) is a "cancel" intent. The operator's response depends on phase: +Cancel = delete the `CephOSDReplace` CR. A finalizer runs the operator's per-phase response: | Phase | Cancel honored? | Effect | |---|---|---| -| Pre-Step 3 (validation in flight, no record persisted yet) | Yes | Operator detects field is gone on next reconcile and stops. No state to unwind. | -| `destroy-pending` (Destroy Job in flight or about to start) | No | State record drives the flow forward. Destroy is short-lived; cancel is a no-op. | -| `prepare-pending`, waiting-for-disk (destroy complete; only `pending-db-lv-name` reserved, no `lvcreate` yet) | Yes | Operator GCs the record. `osd.5` stays `destroyed`; user runs `ceph osd purge 5` to remove the slot. **No orphan LV** (Step 6 only reserves the *name*; `lvcreate` runs in Step 7.1). **ID-preserving retry of osd.5 is unavailable after this cancel** — the original Deployment is gone (Step 4) and data + DB LVs are wiped (Step 5), so a future `replaceOSD: {id: 5}` trigger has no layout to capture in Step 2 and aborts with `LayoutCaptureFailed`. To re-add an OSD here, the user accepts a fresh ID. | -| `prepare-pending`, Prepare-Job-running (`lvcreate` may have run; `ceph-volume lvm prepare` may have started LUKS-formatting) | Only on Job failure | `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). Operator records the cancel intent and acts at Job exit: **on Job failure**, GC the record; the partially-allocated DB LV is left as a named orphan (`pending-db-lv-name`, easy to `lvremove`); osd.5 stays `destroyed`. **On Job success**, cancel is **not** honored — the new OSD is provisioned and joins the cluster. Removing the just-provisioned OSD is an `out`+`purge` workflow, not a rollback of this flow. | -| `deployment-pending` or `completed` | No | New OSD is already provisioned. The failed disk is replaced; cancel makes no sense. | +| `Pending` | Yes | Finalizer removes the CR. Nothing has happened. | +| `Validating` | Yes | Finalizer removes the CR. Nothing destructive has happened. (If `autoOut: true` was set and the operator marked the OSD `out`, the OSD stays `out` — user marks `in` manually if they want to recover the cluster's original layout.) | +| `Destroying` | No | State record drives the flow forward. Destroy is short-lived; cancel is a no-op. | +| `Waiting` (destroy complete; no Prepare Job yet) | Yes | Finalizer removes the CR. `osd.5` stays `destroyed`; user runs `ceph osd purge 5` to remove the slot. **No orphan LV**. **ID-preserving retry of osd.5 is unavailable after this cancel** — the original Deployment is gone (Destroy step) and data + DB LVs are wiped, so a future `CephOSDReplace` for the same ID has no OSD info to capture and aborts with `OSDInfoCaptureFailed`. To re-add an OSD here, the user accepts a fresh ID. | +| `Preparing`, Prepare Job running (`lvcreate` may have run; `ceph-volume lvm prepare` may have started LUKS-formatting) | Only on Job failure | `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). Operator records the cancel intent and acts at Job exit: **on Job failure**, finalizer removes the CR; the partially-allocated DB LV is left as a named orphan (UUID is in the failed Job's env); osd.5 stays `destroyed`. **On Job success**, cancel is **not** honored — the new OSD is provisioned and joins the cluster. Removing the just-provisioned OSD is an `out`+`purge` workflow, not a rollback of this flow. | +| `Preparing` (post-Job, awaiting daemon Ready) or `Completed` | No | New OSD is already provisioned. Cancel makes no sense. | ## Required code changes -Six changes. Items 1–3 and 5 are independent bug fixes worth landing regardless of this design; 4 and 6 are the new replacement flow. - -| # | Fix | File / lines | -|---|----|----| -| 1 | Inventory must include shared metadata disks. See "Change #1 details" below. | [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111) | -| 2 | Populate `OSDInfo.MetadataPath` and a new `OSDInfo.MetadataDevice` field for LVM-mode OSDs. The data is in `ceph-volume lvm list --format json`'s `[db]` section (LV `path` and source `devices`); the parser today walks only `[block]` entries. Forward-compat across rolling upgrade: `OSDInfo` uses standard `encoding/json` struct tags ([`osd.go#L113-L136`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L113-L136)) with no `DisallowUnknownFields` policy. An old operator decoding a new prepare-job's status CM silently drops the new field. A new operator decoding an old CM gets the zero value (empty string), and Step 2's `lvm list` fallback handles that case. | [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177) | -| 3 | `DestroyOSD` cleans up the DB LV and the dm-crypt config-key. Add `ceph config-key exists+rm`, `cryptsetup close `, and `lvremove -f ` (gated on `osdInfo.MetadataPath != ""`). Use precheck patterns (see Step 5) so genuine failures bubble up while already-clean state is tolerated. | [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292) | -| 4 | Wire LVM-mode replacement through `lvm prepare --osd-id`. Today only raw mode at [volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555) adds `--osd-id`. When `a.replaceOSD != nil` *and* a metadata device is set, pre-allocate the DB LV with `lvcreate` (using the operator-persisted name) and call `lvm prepare --osd-id` instead of `lvm batch`. **Why this primitive over `purge` + `lvm batch`:** `lvm prepare --osd-id` claims a destroyed slot atomically (race-safe; no implicit reuse via mon's lowest-free allocation policy) and matches the existing same-device replacement flow at [volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555). Alternatives tested and discussed in [`osd-id-reuse-analysis.md`](osd-id-reuse-analysis.md) (U-7). | [volume.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go) | -| 5 | Pass `OSDInfo.MetadataDevice` to the OSD daemon deployment as a new `ROOK_METADATA_SOURCE_DEVICE` env var. Future destroys read the metadata layout from the deployment without a node-side rescan. | [spec.go#L950-L1010](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/spec.go#L950-L1010) | -| 6 | New `osd-replacement-state` ConfigMap, Destroy/Prepare Job split, reconcile-order pinning. See "Change #6 details" below. | [migrate.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/migrate.go), [osd.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go) | - -#### Change #1 details — inventory bypass for Ceph-tagged shared metadata disks - -Today [`disk.go#L97-L111`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111) calls `sys.ListDevicesChild` (lsblk-based) and skips any disk where `len(children) > 1`. Once any OSD's DB LV lands on a shared metadata disk, this filter incorrectly excludes the disk from inventory, so future OSDs can't get DB LV slots on it. - -**Algorithm:** when `len(children) > 1`, run `lvs --noheadings -o lv_name,vg_name,lv_tags` filtered to those children and check for the `ceph.cluster_fsid=` LV tag — the same authoritative signal Rook already uses elsewhere ([volume.go#L85-L90](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L85-L90), [volume.go#L1130-L1135](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1130-L1135)). If any child carries this cluster's FSID, treat the disk as available. - -**Edge cases:** - -- Mixed Ceph-tagged and untagged children → still bypass. Ceph LVs identify the disk as ours; untagged LVs would block c-v from using their PE anyway — no Rook-side issue. -- `lvs` returns error / EBUSY → conservative: fall back to today's skip behavior, log. -- No tagged children, `len > 1` → skip (today's behavior; foreign LVM or partition table). - -VG/LV name patterns are convention, not guarantee; tags are. Cost: one `lvs` per filtered disk (only when `len > 1`); Rook already shells out to `lvs` / `pvs` during inventory. - -#### Change #6 details — replacement state machine and reconcile-order pinning - -Adds a new ConfigMap `osd-replacement-state` (separate from the existing `osd-migration-config` to avoid breaking that flow's int-keyed reader). Schema, lifecycle, and phase state machine: see "ConfigMaps and phase state" above. Persists `pending-db-lv-name` so Prepare Job retry doesn't orphan DB LVs. Splits the single prepare-job model into a Destroy Job + Prepare Job (motivated by the disk-swap wait — see "Proposed flow" intro). - -**Reconcile-order pinning.** The OSD reconcile entry-point [`Cluster.Start`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L255) gets a new first-step subroutine that runs **before** the existing `updateAndCreateOSDs` path. The subroutine reads `spec.replaceOSD` and the replacement CM, runs GC → drives in-flight entries → validates new spec. Only then does `updateAndCreateOSDs` run. +Six changes. Items 1–3 and 5 are independent bug fixes that map 1:1 to the five gaps in [Current gaps](#current-gaps) and are worth landing regardless of this design. Item 4 wires LVM-mode replacement; item 6 is the new orchestration. Implementation-level details (algorithms, edge cases, pattern reuse) are tracked separately from the design. -The `expected-disk-pending: true` flag on the record (set while phase=`prepare-pending`) is the wire that prevents the auto-provisioning race: in `updateAndCreateOSDs`, the prepare-job spawn for a node with an `expected-disk-pending` entry is skipped — the empty replacement disk on that node is held off until the replacement Prepare Job claims it. - -**Implementation pattern reuse:** - -- Replacement CM lifecycle: same shape as `osd-migration-config` ([`migrate.go#L42-L44`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/migrate.go) — per-cluster, owner-ref'd to CephCluster, written via `k8sutil.CreateOrUpdateConfigMap`). -- Destroy Job pod profile: clones `c.provisionPodTemplateSpec` ([`provision_spec.go`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/provision_spec.go)) — same image, mounts, NodeSelector, RBAC. -- Conditions: set via `opcontroller.UpdateCondition` ([`conditions.go#L35`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/controller/conditions.go#L35)). -- Bounded waits: `util.RetryWithTimeout` ([`retry.go#L57`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/util/retry.go#L57)) — already used by OSD migration. -- Pod-deletion wait: `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)). -- Device-name validation (raw kernel-name rejection): extend `c.validateOSDSettings` ([`osd.go#L189`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L189)). +| # | Fix | File / lines | +|---|---|---| +| 1 | Inventory must include shared metadata disks. The current `len(children) > 1` filter excludes any disk hosting a Ceph LV. Algorithm: when the filter would trigger, check children for the `ceph.cluster_fsid=` LV tag — same authoritative signal Rook already uses elsewhere — and bypass the filter if any child carries this cluster's FSID. | [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111) | +| 2 | Populate `OSDInfo.MetadataPath` and a new `OSDInfo.MetadataDevice` field for LVM-mode OSDs. The data is in `ceph-volume lvm list --format json`'s `[db]` section (LV `path` and source `devices`); the parser today walks only `[block]` entries. Forward-compat across rolling upgrade: standard `encoding/json` decode with no `DisallowUnknownFields` policy — old operator silently drops the new field; new operator decoding an old CM gets the zero value (which the Capture fallback handles). | [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177) | +| 3 | `DestroyOSD` cleans up the DB LV and the dm-crypt config-key. Add `ceph config-key exists+rm`, `cryptsetup close `, and `lvremove -f ` (gated on `osdInfo.MetadataPath != ""`). Use the precheck patterns from the Destroy step so genuine failures bubble up while already-clean state is tolerated. | [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292) | +| 4 | Wire LVM-mode replacement through `lvm prepare --osd-id`. Today only raw mode at [volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555) adds `--osd-id`. When `a.replaceOSD != nil` and a metadata device is set, pre-allocate the DB LV with `lvcreate` (using the operator-persisted name) and call `lvm prepare --osd-id` instead of `lvm batch`. `lvm prepare --osd-id` claims a destroyed slot atomically (race-safe; no implicit reuse via mon's lowest-free allocation policy) and matches the existing same-device replacement flow. | [volume.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go) | +| 5 | Pass `OSDInfo.MetadataDevice` to the OSD daemon deployment as a new `ROOK_METADATA_SOURCE_DEVICE` env var. Future destroys read the metadata info from the deployment without a node-side rescan. | [spec.go#L950-L1010](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/spec.go#L950-L1010) | +| 6 | Orchestration — `CephOSDReplace` CRD definition + dedicated controller implementing the state machine (Pending queue, Validating with confirmation/up+in/safe-to-destroy checks, Destroy/Prepare Job split, non-blocking disk wait). Cluster-side coupling: skip auto-provisioning on nodes with an active replacement; surface an `OSDReplacementInProgress` condition on `CephCluster.status`. | new package `pkg/operator/ceph/cluster/osd/replace/` | ## Out of scope ### Multiple metadata devices on one node — works conditionally -Rook supports per-device metadata-device pairing (`Documentation/CRDs/Cluster/ceph-cluster-crd.md:393-394`): +Rook supports per-device metadata-device pairing: ```yaml nodes: @@ -461,12 +394,12 @@ nodes: config: { metadataDevice: "nvme1n1" } # different metadata device on the same node ``` -This layout requires exact `name:` references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this layout works structurally (each OSD's `metadata-source-device` is captured in its record at destroy time), with two caveats: +This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadata-source-device` is captured in its record at destroy time), with two caveats: -- **Validation policy must permit exact `name:`** entries — pre-check #5's `accept-by-path` or `lenient` mode (U-10). The default `strict` mode rejects this layout from the flow. -- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls at Step 6. +- **Validation policy must permit exact entries** — see U-10. The default `strict` mode rejects this setup from the flow. +- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. -The broader multi-metadata-device feature work (improvements to per-device `metadataDevice` UX, multi-NVMe-per-node layouts) was scoped separately by maintainers in [#13240](https://github.com/rook/rook/issues/13240) (tracked by `zhucan`). This design does not add new logic for that layout — it just doesn't actively forbid replacement on it under permissive validation. +The broader multi-metadata-device feature work (improvements to per-device `metadataDevice` UX, multi-NVMe-per-node setups) was scoped separately by maintainers in [#13240](https://github.com/rook/rook/issues/13240) (tracked by `zhucan`). This design does not add new logic for that setup — it just doesn't actively forbid replacement on it under permissive validation. ### PVC-based OSD replacement — separate design @@ -474,62 +407,32 @@ PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, ### Permanently-down host — different workflow -If the OSD's host is gone, this flow cannot proceed (Step 4 / Step 5 require the host). Existing Rook node-decommission + OSD-purge flow handles it. +If the OSD's host is gone, this flow cannot proceed (the Destroy step requires the host). Existing Rook node-decommission + OSD-purge flow handles it. ## Open questions -- **U-1 — Trigger surface.** *Decision: CR field `spec.storage.replaceOSD: {id, confirmation}`.* Matches the existing `spec.storage.migration` precedent (same UX — user clears the field manually after success), keeps trigger state in the CR, no separate object lifecycle. Alternatives below are listed for review discussion. - - Annotation on the OSD deployment (`rook-ceph.io/replace-osd: ""`). Operator removes the annotation on success — no spec mutation. Rejected: the OSD's deployment is *deleted* in Step 4 before destroy, so an annotation on it disappears mid-flight (state would have to migrate to the state-record CM anyway). - - New `CephOSDReplacement` CRD, one short-lived resource per intent, deleted on success. Rejected: new CRD is a heavier API surface for a feature that already fits the existing `spec.storage` shape; consistency with `spec.storage.migration` is more valuable than per-intent isolation. - -- **U-2 — Parallelism.** Issue #13240 names multi-disk-failure on a chassis as a real operational pain — replacing 4 disks means 4 sequential edits, each blocking on disk-swap wait + reconcile cadence (potentially hours per OSD). This design stays serial because `safe-to-destroy` and `lvm prepare --osd-id` are both naturally one-at-a-time. Two follow-up paths that don't break the serial-execution invariant: - - (a) **Widen the trigger surface to a list** (`replaceOSDs: [{id, confirmation}, ...]`) so the user records all intents upfront and the operator processes them in sequence without per-OSD CR edits. Cheap; removes most of the user-visible pain. - - (b) **N-per-reconcile execution** via N parallel Destroy/Prepare Jobs each running `lvm prepare --osd-id `, gated on cluster health and per-OSD `safe-to-destroy`. Bigger; needs careful PG-safety rules. The obvious-looking `lvm batch --osd-ids X Y Z --prepare` primitive does **not** work for shared-metadata setups (rejects the metadata VG outright — see U-7 / [`osd-id-reuse-analysis.md`](osd-id-reuse-analysis.md) Path E); (b) must use N parallel `lvm prepare --osd-id` invocations, not a single `lvm batch` call. - -- **U-3 — Auto-replace mode.** This design requires explicit user input. A follow-up could add an opt-in "auto-replace on disk swap" mode (e.g. `spec.storage.autoReplaceOSDs: true`): when an OSD is `down_in` and a new empty disk appears in its CR-managed slot, the operator runs the same flow without an explicit trigger. Extra checks (cluster health, PG state) would gate it. Deferred. - -- **U-4 — Configurability of the two timeouts.** Two distinct timers, different physical phenomena, different SLAs: - - `replacement-disk` wait (default 24h): time to swap a failed disk; covers walk-away-and-handle-tomorrow workflows. - - `safe-to-destroy` retry timeout (default 1h): time backfill is allowed to take after the user triggers replacement on a still-recovering OSD. 24h here would mask a stuck backfill and warrants paging. - Per-replacement override (`spec.storage.replaceOSD.timeoutSeconds`) handles different chassis-swap SLAs but adds API surface; operator-global only is simpler. - -- **U-5 — Faster wake on disk-swap.** With `rook-discover` enabled, latency floor is its udev-event delivery (seconds) up to `ROOK_DISCOVER_DEVICES_INTERVAL` (60 min). Without it, the wait re-checks every U-9 interval. Optional follow-up: treat udev "new disk" events on the node as reconcile triggers while a replacement is in progress — push from `rook-discover`, or a small sidecar deployed only while waiting. Optimization, not a correctness gap. - -- **U-6 — State-store choice for the replacement record.** *Decision: ConfigMap `osd-replacement-state`.* Matches Rook's existing OSD-orchestration pattern (per-node status CMs, `osd-migration-config`); clean object lifecycle (create/delete); doesn't pollute CR `.status` with transient state-machine state (no precedent for that in Rook). Alternatives below are listed for review discussion. - - CR `.status.replaceOSD`. Pros: visible via `kubectl get cephcluster -o yaml`; integrates natively with Conditions; one fewer object lifecycle. Cons: mixing transient operational state with the CR's status block is unidiomatic in Rook; status updates may race with other status writers; no precedent in the codebase for state-machine state in `.status`. Conditions can still be used independently of where the record lives. - - Annotation on the CephCluster CR. Pros: simple, visible. Cons: limited to ~256KB total annotation size on the resource (shared with other consumers); awkward to update structurally; no precedent for state machines in annotations. - -- **U-7 — Approach for OSD-ID preservation: `destroy + prepare --osd-id` vs alternatives.** Full comparison in [`osd-id-reuse-analysis.md`](osd-id-reuse-analysis.md). Summary: - - - **Chosen:** `ceph osd destroy` + `lvm prepare --osd-id` + operator pre-allocates DB LV. This is [Ceph's documented replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) — the design just orchestrates it across pods. Verified end-to-end on Ceph v19.2.2 (`osd-rep-log.md`). Race-safe: the destroyed slot can't be claimed by another OSD between destroy and prepare. Matches Rook's existing same-device replacement flow ([volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555)). - - **Alternative — `purge` + `lvm prepare` without `--osd-id`** (used by SAP runbook + GH issue #13240 comment 3193842038): the slot is freed by `purge`, mon allocates lowest free which happens to be the just-freed ID. Implicit reuse; depends on mon allocation policy and a non-racy purge-prepare window. - - **Alternative — `purge` + `lvm batch --prepare`** (suggested by maintainers in #13240): same implicit-ID-reuse mechanism as above, with ceph-volume handling DB-LV allocation. Verified on Ceph v19.2.2 — works (`osd-id-reuse-analysis.md` Path C); shares the implicit-reuse race window with B. - - **Alternative — `destroy` + `lvm batch --osd-ids X --prepare`** (`lvm batch` does have `--osd-ids` plural): verified on Ceph v19.2.2 — does **not** work with shared metadata devices. ceph-volume's `--osd-ids` path rejects metadata VGs with existing free PE space ("1 fast devices were passed, but none are available"). Eliminated. - -- **U-9 — Wait-for-disk re-check pattern (interval and trigger).** Step 6's wait without `rook-discover` runs an inventory Job on each reconcile and re-queues. Two related questions: - - **Interval.** Default 5 min is a working starting point. Lower (1 min) cuts user-visible latency at the cost of more inventory-Job pods. Higher (15 min) is gentler. Likely cluster-config-tunable rather than hardcoded. - - **Self-requeue vs. user re-trigger.** Alternative: instead of `Result{RequeueAfter: ...}`, the operator could emit a `WaitingForReplacementDisk` event and require the user to bump `confirmation` once they've swapped the disk to nudge the next reconcile. Pro: no operator-side polling, no inventory-Job spam. Con: breaks the "set replaceOSD and walk away" UX from the user story; user has to come back. - - Proposed: `RequeueAfter` with a configurable interval (default 5 min). Decide during PR review. - -- **U-10 — Device-matching validation policy for replacement.** Pre-check #5's policy is pluggable; the three policies (`strict`, `accept-by-path`, `lenient`) are defined in Step 1. Two questions for PR review: +### Decided (rationale captured for review) - 1. **What should the default be?** - - `strict` is safest (pre-destroy rejection, clean UX), but adds adoption friction: existing kernel-name CRs and per-device `metadataDevice` layouts are blocked from this flow until migrated. [Ceph upstream](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) is laxer (uses raw `/dev/sdX` in examples); Rook generally permits exact `name:` entries, so strict here is more conservative than the rest of the ecosystem. - - `accept-by-path` is the moderate middle. Kernel names rejected; by-path users can use the flow with same-slot discipline. - - `lenient` maximizes compatibility but defers diagnosis 24h into the flow (post-destroy stall). Recoverable but bad UX. +- **U-1 — Trigger surface.** Decision: dedicated `CephOSDReplace` CRD with `spec` carrying trigger fields (see [State](#state)). Cluster-CR fallback shape: `spec.storage.replaceOSD: {id, confirmation, ...}` mirroring `spec.storage.migration`. +- **U-6 — State-record substrate.** Resolved by [Open question: controller placement](#open-question-controller-placement) — `CephOSDReplace.status`. +- **U-7 — OSD-ID preservation primitive.** Decision: `ceph osd destroy` + `lvm prepare --osd-id` + operator pre-allocates DB LV. This is [Ceph's documented replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) — the design just orchestrates it across pods. Verified end-to-end on Ceph v19.2.2 (`osd-rep-log.md`). Race-safe: the destroyed slot can't be claimed by another OSD between destroy and prepare. Matches Rook's existing same-device replacement flow ([volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555)). Alternatives compared in `osd-id-reuse-analysis.md`. - 2. **Should the policy be configurable, and at what scope?** - - **Operator-global** (env var like `ROOK_OSD_REPLACEMENT_DEVICE_VALIDATION`) — one knob per cluster, easy to set; doesn't accommodate mixed CR shapes within a cluster. - - **Per-replacement** (`spec.storage.replaceOSD.deviceMatchingMode: `) — most flexible, additional CR API surface. - - **Hard-coded** with no override — simplest, no API surface, but no escape valve for users hitting the per-device-config tension or slot-stable hardware. +### Open for PR-review decision - - **Helper (orthogonal):** a one-shot tool that rewrites a user's `storage` spec to `useAllDevices` or `deviceFilter` would reduce strict-policy friction regardless of the default. +- **U-O — Controller placement.** [Open question: controller placement](#open-question-controller-placement) lays out the trade-offs. The doc leans toward the CRD option but does not prescribe. Maintainers' call. +- **U-2 — Parallelism.** Issue #13240 names multi-disk-failure on a chassis as a real operational pain — replacing N disks means N sequential CR edits, each blocking on disk-swap wait + reconcile cadence. This design stays serial for operational simplicity, not correctness — per-OSD `safe-to-destroy`'s drained semantic makes concurrent destroys safe. Two follow-up paths: (a) widen the trigger surface to a list (`replaceOSDs: [{id, confirmation}, …]`); (b) N-per-reconcile execution via N parallel Destroy/Prepare Jobs each running `lvm prepare --osd-id `. An `lvm batch --osd-ids X Y Z --prepare` invocation does **not** work for shared-metadata setups (rejects the metadata VG outright); (b) must use N parallel `lvm prepare --osd-id` invocations, not a single `lvm batch` call. +- **U-3 — Auto-replace mode.** Follow-up: opt-in `spec.storage.autoReplaceOSDs: true` to run the same flow without explicit trigger when an OSD is `down_in` and a new empty disk appears. Extra checks (cluster health, PG state) would gate it. Deferred. +- **U-4 — Default timeout values.** Per-CR configurability is decided (`spec.safeToDestroyTimeout`, `spec.diskWaitTimeout` — see [State](#state)). Defaults open: 1h / 24h is a starting point. `safeToDestroyTimeout` default fits the failed-disk path (Ceph auto-`out`s and backfills before the user even creates the CR); needs a longer override when `autoOut: true` is used on a healthy OSD where backfill begins inside the flow. +- **U-5 — Faster wake on disk-swap.** With `rook-discover` enabled, latency floor is its udev-event delivery (seconds) up to `ROOK_DISCOVER_DEVICES_INTERVAL` (60 min). Without it, the wait re-checks every U-9 interval. Optional follow-up: treat udev "new disk" events on the node as reconcile triggers while a replacement is in progress. +- **U-8 — `getOSDInfo` fallback for old deployments.** Rook regenerates each OSD's Deployment spec on every reconcile, but `getOSDInfo` only recovers what's already in the env. For OSDs deployed before changes #2 and #5, the new env vars are missing → empty values on regenerated deployments. The Capture fallback handles this per replacement. Alternative: push the `lvm list` fallback into `getOSDInfo` itself, so existing deployments get backfilled on operator upgrade. The per-replacement fallback then becomes redundant. Worth doing if change #2's footprint is small. +- **U-9 — Wait-for-disk re-check pattern.** Default 5 min interval is a working starting point. Lower (1 min) cuts latency at the cost of more inventory-Job pods. Likely cluster-config-tunable rather than hardcoded. +- **U-10 — Device-matching validation policy.** Three policies: `strict` (reject any exact entry on the OSD's data device), `accept-by-path` (reject only kernel-name entries), `lenient` (accept anything; mismatch surfaces as Wait-step stall). Defaults and configurability scope (operator-global / per-replacement / hard-coded) are open. - No load-bearing recommendation; flagging for PR-review decision. +## Validation plan Coverage areas this design must validate (detailed scenarios in [`osd-test-scenarios.md`](osd-test-scenarios.md)): -- **Happy path** on shared-metadata layouts: single OSD replaced while siblings stay up, both with and without `encryptedDevice: true`; multiple metadata devices on the same node (per-device config); same-device (raw-mode) regression. +- **Happy path** on shared-metadata setups: single OSD replaced while siblings stay up, with and without `encryptedDevice: true`; multiple metadata devices on the same node (per-device config); same-device (raw-mode) regression. - **Required-change validation**: new OSD deployment carries non-empty `ROOK_METADATA_DEVICE` / `ROOK_METADATA_SOURCE_DEVICE`; metadata VG with healthy siblings is now visible to inventory. - **Crash recovery**: Destroy Job and Prepare Job killed mid-run; state-record-driven retry produces no orphan DB LVs across N retries. - **Validation gates**: trigger after auto-provisioning is rejected; raw kernel-name device addressing is rejected before any destructive action. From 93b62ff0571119863b729fa5613b36174649bd54 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Tue, 5 May 2026 21:24:10 +0200 Subject: [PATCH 03/12] docs: cleanup Signed-off-by: Artem Torubarov --- osd-design.md | 202 +++++++++++++++++--------------------------------- 1 file changed, 67 insertions(+), 135 deletions(-) diff --git a/osd-design.md b/osd-design.md index b9c7452f9bca..4ce423d4c129 100644 --- a/osd-design.md +++ b/osd-design.md @@ -124,11 +124,11 @@ Concrete shape of each candidate: - **Extend the cluster controller** — state in either a side ConfigMap (`osd-replacement-state`, mirroring `osd-migration-config`) or `CephCluster.status`. Same UX as `spec.storage.migration`. - **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile goroutine; never touches the existing OSD path. Light coupling on the cluster side: skip auto-provisioning on affected nodes; surface an `OSDReplacementInProgress` condition. -The rest of this design proposes the CRD shape concretely. Using `spec.storage.replaceOSD` on CephCluster instead is a fallback that reuses the same state machine and step semantics — implications flagged inline where they differ. +The rest of this design is based on a separate `CephOSDReplace` CRD, with implications for the cluster-CR fallback flagged inline. ### State -State lives on `CephOSDReplace.spec` (immutable post-create) and `.status` (operator-updated). Status surfacing follows the standard Kubernetes conditions pattern. Field details and sources inline as YAML comments below. For raw-mode OSDs, `dbLV`, `metadataSourceDevice`, `metadataVG`, and `databaseSizeMB` are absent and `dataLV` is the raw device path; the Destroy step skips DB cleanup and the Prepare step omits `--block.db`. +State lives on `CephOSDReplace.spec` and `.status`. `spec.cephCluster` and `spec.osdId` are immutable post-create. `.status` carries phase and conditions following the K8s operator pattern. ```yaml apiVersion: ceph.rook.io/v1 @@ -136,9 +136,9 @@ kind: CephOSDReplace metadata: name: replace-osd-5 namespace: rook-ceph -spec: # immutable post-create; re-replace = create a new CR - cephCluster: my-cluster # target cluster in this namespace - osdId: 5 +spec: + cephCluster: my-cluster # immutable; target cluster in this namespace + osdId: 5 # immutable confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; typo guard against destroying the wrong OSD autoOut: false # optional; if true, operator marks healthy OSD `out` automatically. Default: false (fail-fast on up+in) safeToDestroyTimeout: 1h # optional; how long Validating tolerates EBUSY before Failed. Default: 1h @@ -159,7 +159,7 @@ status: node: node-1 # OSD deployment NodeSelector; survives the deployment delete dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE; absent for raw-mode OSDs - metadataSourceDevice: nvme0n1 # OSD deployment env ROOK_METADATA_SOURCE_DEVICE; absent for raw-mode OSDs; from `ceph-volume lvm list` if env missing (see Capture step) + metadataSourceDevice: nvme0n1 # OSD deployment env ROOK_METADATA_SOURCE_DEVICE; absent for raw-mode OSDs metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS databaseSizeMB: 4096 # from `lvs --noheadings -o lv_size ` ÷ 1MiB @@ -171,17 +171,17 @@ status: completedAt: null ``` -`metadataSourceDevice` and the env it's read from (`ROOK_METADATA_SOURCE_DEVICE`) are only present on OSDs deployed after required changes #2 and #5 land; older OSDs use the fallback in [Capture](#3-capture). U-8 explores eliminating the fallback at the source. - **Cancel and re-replace.** Cancel = delete the CR; a finalizer runs the operator's cleanup (delete partially-allocated DB LV if any; leave the OSD `destroyed` for the user to `ceph osd purge` manually). Re-replacement of the same OSD = create a new CR with a different name. Terminal CRs (`Completed`, `Cancelled`, `Failed`) are inert — keep them as audit trail or delete them; the operator requires neither. #### Coordination Replacements run serially cluster-wide as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. -With a separate `CephOSDReplace` CRD, coordination is explicit: new CRs enter `Pending` and wait for any peer `CephOSDReplace` in the same namespace (targeting the same cluster) to reach a terminal phase. FIFO ordering by `(creationTimestamp, UID)` — the UID tiebreaker handles same-second creations. Race-safe under optimistic concurrency: two CRs that see each other both stay `Pending`; the older wakes first. +The queue is implemented via a `Pending` phase. Each reconcile, the controller lists peer `CephOSDReplace` CRs in the same namespace targeting the same cluster. If no earlier-`creationTimestamp` peer is in a non-terminal phase, this CR advances to `Validating`; otherwise it stays in `Pending` and re-checks next reconcile. UID breaks same-second ties. + +> Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. -Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. +In both shapes, the cluster controller's auto-provisioning must skip nodes with an active replacement — otherwise the empty replacement disk gets claimed with a fresh ID before the Prepare step can use it. Without explicit trigger, Rook has no way to tell a replacement disk from a new disk (see [Constraints](#rook-cannot-tell-a-replacement-disk-from-a-new-disk)). #### Phase state machine @@ -192,38 +192,24 @@ Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordinati Failed/Cancelled Failed/Cancelled ``` -(With the cluster-CR fallback, `Pending` is omitted — a single `spec.storage.replaceOSD` field admits only one in-flight replacement.) +> With the cluster-CR fallback, `Pending` is omitted (single field admits one in-flight); state offloads to a side ConfigMap similar to `osd-migration-config`. Per-phase behavior: | Phase | Normal exit | Transient failure (retried) | Terminal exit | |---|---|---|---| -| (no record) | → `Pending` once trigger valid | — | — | -| `Pending` | → `Validating` once no earlier peer is in non-terminal phase | re-checks each reconcile while an earlier peer is in-flight | → `Cancelled` if user deletes the CR | -| `Validating` | → `Destroying` once all checks pass | `safe-to-destroy` returns EBUSY (peers backfilling) — re-checked each reconcile | → `Cancelled` on CR delete; → `Failed` with `reason=InvalidSpec` (target OSD missing, unstable device name, confirmation mismatch) or `reason=OSDStillIn` (up+in target without `autoOut: true`) or `reason=NotSafeToDestroy` (escalation timeout) | +| (no record) | → `Pending` on CR create | — | — | +| `Pending` | → `Validating` once no earlier peer is in flight (one replacement per cluster at a time) | re-checks each reconcile while an earlier peer is in-flight | → `Cancelled` if user deletes the CR | +| `Validating` | → `Destroying` once all checks pass | `safe-to-destroy` returns EBUSY (peers backfilling) — re-checked each reconcile | → `Cancelled` on CR delete; → `Failed` on validation failure (target OSD invalid, swap-intolerant CR, up+in without `autoOut`, or `safe-to-destroy` timeout) | | `Destroying` | → `Waiting` on Destroy Job success | Destroy Job retries on transient errors (Ceph unreachable, pod scheduling) | — | | `Waiting` | → `Preparing` once replacement disk visible | inventory poll until disk visible | → `Cancelled` on CR delete; → `Failed` with `reason=ReplacementDiskMissing` after disk-swap wait expires | | `Preparing` | → `Completed` when new daemon is Ready in Ceph | Prepare Job pod retries on transient errors; `lvcreate` precheck handles partial LV from a prior pod; Deployment creation retries | — | -| `Completed` | record persists for audit | — | terminal | -| `Cancelled`, `Failed` | record persists for audit | — | terminal until user-side action | - -Phase is the operator's internal cursor. The user-visible signal is a single `Ready` condition: `True` only on `Completed`; `False` otherwise, with `reason` set to the current phase name (or a typed terminal reason for `Failed`: `InvalidSpec`, `OSDStillIn`, `NotSafeToDestroy`, `ReplacementDiskMissing`). This gives `kubectl wait --for=condition=Ready` semantics. Use `metav1.Condition` (with `observedGeneration`), not Rook's legacy `cephv1.Condition`. - -#### Reconcile resume behavior +| `Completed` | terminal — success | — | — | +| `Cancelled`, `Failed` | terminal | — | — | -| Phase on disk | Recovery on next reconcile | -|---|---| -| no record | Trigger re-evaluated. If valid, record created with `phase=Pending`. No destructive action taken yet. | -| `Pending` | Peer list re-evaluated; if no earlier-timestamp peer is in non-terminal phase, advance to `Validating`. No destructive action taken yet. | -| `Validating` | Checks re-run (idempotent); `safe-to-destroy` escalation clock continues from the record's creation timestamp. No destructive action taken yet. | -| `Destroying`, no Destroy Job | Operator re-issues the deployment delete (idempotent), waits for pod termination, creates the Destroy Job. | -| `Destroying`, Destroy Job in flight | Operator awaits Job; on retry, recreates it. All commands in Destroy step are idempotent via precheck patterns. | -| `Waiting` | Operator polls for replacement disk; once visible, advances to `Preparing` and creates the Prepare Job (UUID generated inline, baked into Job env so pod retries reuse it). | -| `Preparing`, Prepare Job in flight | Operator awaits Job; on retry, recreates it. `lvcreate` skipped if LV exists; `lvm prepare --osd-id` reuses the destroyed slot. | -| `Preparing`, Prepare Job done | Existing per-node OSD-status reconcile creates the Deployment; operator polls `ceph osd metadata` until Ready, then transitions to `Completed`. | -| `Completed` | Validation short-circuits; record deleted on next spec change. | +User-visible: `Ready=True` on `Completed`, `Ready=False` otherwise; `reason` carries the current phase or a typed terminal reason. -> **⚠️ Destroy is irreversible.** Once `Validating` passes (phase advances to `Destroying`), `osd.5` will be destroyed on this reconcile cycle. There is no preview surfacing the captured OSD info. If the user typed the wrong OSD ID, the wrong OSD is gone — recovery is via [Cancellation](#cancellation), not by retracting the trigger. +> **⚠️ Destroy is irreversible.** Once `Validating` passes, `osd.5` will be destroyed on the next reconcile. If the user typed the wrong OSD ID, the wrong OSD is gone. ### Step-by-step @@ -231,66 +217,48 @@ The walk-through uses the running example. #### 1. Trigger — user creates a `CephOSDReplace` CR -Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` after `mon_osd_down_out_interval` (default 600s) and backfills; once the OSD is drained from every PG's acting set, `safe-to-destroy` clears and the flow proceeds. The user creates a `CephOSDReplace` CR and replaces failed device in datacenter. +Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` after `mon_osd_down_out_interval` (default 600s) and backfills; once the OSD is drained from every PG's acting set, `safe-to-destroy` clears and the flow proceeds. The user creates a `CephOSDReplace` CR and replaces the failed device in the datacenter. -For proactive replacement of a healthy (up+in) OSD, two options: +Healthy (up+in) OSDs require either `ceph osd out` first or `spec.autoOut: true` — see [Validate](#2-validate). -1. **Default — user marks `out` first.** With `autoOut: false` (default), Validating fails fast on a still-`in` OSD with `reason=OSDStillIn`. The user runs `ceph osd out ` themselves, waits for backfill, then re-applies the CR. -2. **Opt-in — `autoOut: true`.** Operator runs `ceph osd out ` at entry to Validating and loops `safe-to-destroy` through the full backfill. The `safeToDestroyTimeout` should be extended to fit the cluster's expected backfill time when using this. - -The disk can be swapped any time after the CR is applied — the Capture step tolerates a missing data PV. If multiple OSDs need replacement — open question U-2. +On creation, the CR enters `Pending` and waits for any in-flight replacement to terminate. Once cleared, it advances to `Validating`. The disk can be swapped any time after the CR is applied — the Capture step tolerates a missing data device. #### 2. Validate -Phase `Validating`. Entered when a fresh trigger advances out of `Pending` and no matching record exists. The operator persists the record on entry (so the `safe-to-destroy` escalation clock — origin = record creation timestamp — survives reconciles), then runs the checks below each reconcile cycle until exit. Failures land on `phase=Failed` with a typed `reason` exposed via `.status.conditions`. +Run each reconcile cycle until all checks pass or one fails terminally: 1. **Confirmation matches.** `spec.confirmation` must equal `"yes-really-replace-osd-{spec.osdId}"`. → `Failed` with `reason=InvalidSpec` on mismatch (typo guard). 2. **Target OSD exists** in the OSD map. → `Failed` with `reason=InvalidSpec` if absent. 3. **Target OSD is destroyable.** If the OSD is `up && in`: with `spec.autoOut: false` (default), → `Failed` with `reason=OSDStillIn`. With `spec.autoOut: true`, the operator runs `ceph osd out ` once at entry and falls through to check 5. -4. **CR-level device matching is swap-tolerant.** → `Failed` with `reason=InvalidSpec` if the OSD's data device is referenced by an unstable name in the CR. Validation policy is configurable per U-10. +4. **CR-level device matching is swap-tolerant.** → `Failed` with `reason=InvalidSpec` if the OSD's data device is referenced by an unstable name in the CR (rules per [U-6](#open-questions)). 5. **`safe-to-destroy ` returns OK.** Returns EBUSY while any PG still has the OSD in its acting set — the only safety gate (`down`/`out` alone is not sufficient because data may not have replicated). EBUSY → stay in `Validating`, re-check next reconcile. `spec.safeToDestroyTimeout` exceeded → `Failed` with `reason=NotSafeToDestroy`. -When all checks pass, the operator advances to `Destroying` and proceeds to [Capture](#3-capture). - -There is no auto-provisioning-race detection. If the user mis-orders — inserts the disk before creating the `CephOSDReplace` CR — Rook auto-provisions a new OSD on it (with a fresh ID), and the subsequent replacement then stalls in the Wait step (no empty disk left to claim). The user resolves by purging the auto-provisioned OSD; the design relies on this observable failure mode rather than embedding pre-trigger detection. - -#### 3. Capture - -Capture the failed OSD's info from sources that do not require the failed data device. Runs at the `Validating` → `Destroying` transition: the operator captures the fields below and atomically updates the existing record (sets `phase=Destroying` plus the captured fields). From this point on, the record carries OSD info; a crashed operator restarts and resumes from the persisted phase. -The OSD's own DB LV is the source for `database-size-mb` and `encrypted` (the metadata VG is by construction intact at this step: failure is on the data device, and the destroy LV-removal hasn't run yet). Live spec is *not* the source: a user-edited `spec.storage.config.databaseSizeMB` between original provisioning and replacement would size the new DB LV inconsistently with siblings, and `encrypted` is immutable per-OSD so a CR-level toggle cannot retroactively change it. If the OSD's own LV is missing, fall back to a surviving sibling LV in the same VG. +#### 3. Destroy -For OSDs whose deployment predates required changes #2 and #5, `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` are absent. A one-shot `ceph-volume lvm list --format json` Job on the OSD's node, via Rook's existing `cmdreporter`, fills both. Output's `[db]` entry: `devices` → metadata source device, `tags.ceph.db_device` → DB LV path. Correct even when the data device has physically failed (reads from VG metadata replicated on the metadata-VG's surviving PV). Verified on the Lima cluster. +Before deleting the deployment, the operator captures `.status.osdInfo` (sources per the YAML comments in [State](#state)). Most fields come from the OSD deployment's env. Two come off the host: -#### 4. Destroy +- `databaseSizeMB` and `encrypted` — read from the OSD's DB LV (or a surviving sibling LV in the same VG if the OSD's own LV is missing). The live spec is not a source: a user-edited `spec.storage.config.databaseSizeMB` would size the new DB LV inconsistently with siblings. +- `metadataSourceDevice` — for OSDs created by older operator versions (env not yet plumbed), a one-shot `ceph-volume lvm list --format json` Job on the OSD's node fills it. The Job reads VG metadata from the surviving PV on the metadata device, so it works even after the data device has physically failed. -Operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` — this deletes the deployment and polls until the pod is gone. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and the next step's `cryptsetup close` would fail. +Then the operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and the next step's `cryptsetup close` would fail. If the wait times out (transient NotReady node), the operator re-checks on the next reconcile. No force-delete — a stuck pod on a NotReady node may still hold the LUKS mapping when kubelet recovers. -If the wait times out (transiently NotReady node), the operator sets `WaitingForOSDPodTermination` and re-checks on the next reconcile. The operator does NOT force-delete: a stuck pod on a NotReady node may still be holding the LUKS mapping when kubelet recovers; force-delete would diverge K8s and host state. - -**Host permanently down — out of scope.** If the host is gone (powered off, hardware failure), this flow cannot proceed: the Destroy Job's NodeSelector pins it to that node, and force-deleting the OSD pod doesn't bring the kubelet back. The Destroy Job stays Pending. Replacement of an OSD on a permanently-dead host is a different workflow (node decommission, then OSD-out-and-purge, then re-add the host with fresh OSDs) — handled by existing Rook flows. - -The Destroy Job's container invokes `DestroyOSD` ([`remove.go#L244-L292`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go). The bash below specifies the behavior `DestroyOSD` must implement after required change #3 lands (today it only does the first step and a partial last step). Each operation is idempotent on retry; no standalone shell script ships in the operator. +The Destroy Job's container invokes `DestroyOSD` ([`remove.go#L244-L292`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go). The bash below specifies what `DestroyOSD` must do (today it only handles the first step and a partial last step). Each operation is idempotent on retry. ```bash # Destroy in Ceph (preserves OSD ID 5 for reuse). -ceph osd destroy osd.5 --yes-i-really-mean-it # idempotent: already-destroyed → succeeds +ceph osd destroy osd.5 --yes-i-really-mean-it -# Remove dm-crypt key. On Ceph v19.2.2 (verified) `ceph osd destroy` already -# cleans the key and `config-key rm` on a missing key is itself idempotent -# (returns 0), so this whole step is typically a no-op. The explicit `exists` -# precheck is defensive: keeps the chain safe on older Ceph versions where -# rm's exit-code behavior on missing key has not been measured. +# Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). ceph config-key exists dm-crypt/osd/8b7e6c19-.../luks \ && ceph config-key rm dm-crypt/osd/8b7e6c19-.../luks -# Close DB-side LUKS mapping. Enumerate children with TYPE explicit and pick -# the crypt child specifically — robust against future LV-stack shapes. +# Close DB-side LUKS mapping. DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db-bbb... | awk '$2=="crypt"{print $1; exit}') [ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ && cryptsetup close "$DB_MAPPING" -# Free the DB slot. Real lvremove failures bubble up; state machine retries. +# Free the DB slot. lvs /dev/ceph-metadata-vg-1/osd-db-bbb... >/dev/null 2>&1 \ && lvremove -f /dev/ceph-metadata-vg-1/osd-db-bbb... @@ -301,33 +269,23 @@ lvs /dev/ceph-data-vg-5/osd-block-aaa... >/dev/null 2>&1 \ After the Job completes, operator advances the record to `phase: Waiting`. -#### 5. Wait for replacement disk - -The wait is non-blocking (see [Open question: controller placement](#open-question-controller-placement) above). Each reconcile cycle either checks for the disk and yields, or finds it and creates the Prepare Job. Inventory needs a path that doesn't auto-provision — the standard prepare-job spawn for this node would otherwise claim the empty disk with a fresh ID. Two cases: - -- **`rook-discover` enabled** — operator watches the per-node `local-device-` CM. Reconcile triggers on CM update via the hotplug-CM watch. Latency: seconds (rook-discover's udev monitor) up to its `ROOK_DISCOVER_DEVICES_INTERVAL` (default 60 min) for the polling fallback. -- **`rook-discover` disabled** (the operator's default) — the operator yields with a periodic re-check (default 5 min; see U-9), and on each cycle spawns a one-shot `ceph-volume inventory --format json` Job via `cmdreporter`. The Job is read-only — does not auto-provision — so it doesn't conflict with the replacement. +#### 4. Wait for replacement disk -While waiting, the operator surfaces a `WaitingForReplacementDisk` condition. Default timeout `spec.diskWaitTimeout` (24h). On timeout the condition flips to `ReplacementDiskMissing` and polling stops. +Each reconcile of the CR, the controller checks if the replacement disk is visible on the node; if not, it requeues. Discovery uses Rook's existing paths — `rook-discover` (when enabled) for udev-event-driven updates, otherwise the operator's per-reconcile prepare-job inventory. The cluster controller skips auto-provisioning on the node while this CR is active (see [Coordination](#coordination)). -**Recovery from timeout — two paths:** +Timeout per `spec.diskWaitTimeout` (default 24h) → `Failed` with `reason=ReplacementDiskMissing`. After timeout, the user can either insert the disk and create a new CR for the same OSD ID, or delete the CR (osd stays destroyed; `ceph osd purge` to free the slot). -1. **Insert the disk and create a new CR** for the same OSD ID (re-replacement). The new CR enters `Validating` fresh; the failed CR is inert. -2. **Abandon** by deleting the CR. Per [Cancellation](#cancellation), this substate honors cancel: the finalizer cleans up; `osd.5` stays `destroyed`; user runs `ceph osd purge 5` manually if they want to remove the slot. +#### 5. Prepare -#### 6. Prepare - -Phase `Preparing` (entered when the replacement disk is visible). The operator generates a fresh UUID for the new DB LV and bakes it into the Job spec as an env var (same pattern as `ROOK_REPLACE_OSD` in `provision_spec.go:317,322`); pod retries within the Job's backoff reuse the same env. The Job performs: +Phase `Preparing` (entered when the replacement disk is visible). The operator generates a fresh UUID for the new DB LV and passes it as an env var on the Job (same pattern as `ROOK_REPLACE_OSD` in `provision_spec.go:317,322`); pod retries reuse the same env. The Job runs: ```bash -# Pre-allocate the DB LV using the persisted name. Idempotent on retry — -# if the LV already exists from a previous attempt, lvcreate is skipped. +# Pre-allocate the DB LV using the persisted name; skip if it already exists. lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ || lvcreate -L 4096M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y -# Provision the new OSD with the preserved ID. -# --dmcrypt is conditional on the record's `encrypted` field; -# omitted for unencrypted OSDs. +# Provision the new OSD with the preserved ID. --dmcrypt only when the record's +# `encrypted` field is true. ceph-volume lvm prepare \ --bluestore [--dmcrypt] \ --osd-id 5 \ @@ -336,45 +294,25 @@ ceph-volume lvm prepare \ --crush-device-class hdd ``` -The UUID in `osd-db-12cf3a91-...` is the operator-generated UUID from the Wait step, not the OSD's fsid. ceph-volume assigns its own fsid during prepare and writes `ceph.osd_fsid` / `ceph.db_uuid` LV tags. - -Prepare writes the new OSD's info to the per-node prepare-job status CM that Rook already uses to drive daemon creation. The phase stays `Preparing` while the operator creates the Deployment and waits for the new daemon to become Ready in Ceph. +The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node prepare-job CM that Rook already uses to drive daemon creation). The phase stays `Preparing` while the operator creates the Deployment and waits for the new daemon to become Ready in Ceph. -#### 7. Activate +#### 6. Activate -Reuses the existing reconcile path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — no fallback Job needed for any future replacement of this OSD. +Reuses the existing path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — future replacement of this OSD won't need the Capture fallback. -#### 8. Complete +#### 7. Complete -Like the disk wait, the wait for the new daemon to join Ceph is non-blocking. Each reconcile cycle calls `ceph osd metadata `. Ready = a record returned with a non-empty fsid, `id` matching, and `hostname` matching the record's `node`. This single check covers both the up-in-Ceph signal and the new-fsid capture. - -On Ready, the operator transitions to `phase: Completed` and records `newFsid` and `completedAt`. The record persists for audit; the user keeps or deletes the CR at their leisure (the next reconcile short-circuits on the terminal phase). +While in `Preparing`, the controller calls `ceph osd metadata ` each reconcile. Ready = a record returned with a non-empty fsid, matching `id`, and matching `hostname`. On Ready, transition to `phase: Completed` and record `newFsid` and `completedAt`. ### Cancellation -Cancel = delete the `CephOSDReplace` CR. A finalizer runs the operator's per-phase response: - -| Phase | Cancel honored? | Effect | -|---|---|---| -| `Pending` | Yes | Finalizer removes the CR. Nothing has happened. | -| `Validating` | Yes | Finalizer removes the CR. Nothing destructive has happened. (If `autoOut: true` was set and the operator marked the OSD `out`, the OSD stays `out` — user marks `in` manually if they want to recover the cluster's original layout.) | -| `Destroying` | No | State record drives the flow forward. Destroy is short-lived; cancel is a no-op. | -| `Waiting` (destroy complete; no Prepare Job yet) | Yes | Finalizer removes the CR. `osd.5` stays `destroyed`; user runs `ceph osd purge 5` to remove the slot. **No orphan LV**. **ID-preserving retry of osd.5 is unavailable after this cancel** — the original Deployment is gone (Destroy step) and data + DB LVs are wiped, so a future `CephOSDReplace` for the same ID has no OSD info to capture and aborts with `OSDInfoCaptureFailed`. To re-add an OSD here, the user accepts a fresh ID. | -| `Preparing`, Prepare Job running (`lvcreate` may have run; `ceph-volume lvm prepare` may have started LUKS-formatting) | Only on Job failure | `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). Operator records the cancel intent and acts at Job exit: **on Job failure**, finalizer removes the CR; the partially-allocated DB LV is left as a named orphan (UUID is in the failed Job's env); osd.5 stays `destroyed`. **On Job success**, cancel is **not** honored — the new OSD is provisioned and joins the cluster. Removing the just-provisioned OSD is an `out`+`purge` workflow, not a rollback of this flow. | -| `Preparing` (post-Job, awaiting daemon Ready) or `Completed` | No | New OSD is already provisioned. Cancel makes no sense. | +Cancel = delete the `CephOSDReplace` CR; a finalizer runs cleanup. Cancel is honored cleanly in `Pending`, `Validating`, and `Waiting` (after Destroy completes — `osd.5` stays `destroyed`; user runs `ceph osd purge 5` to free the slot). `Destroying` is short-lived and ignores cancel. Once the new OSD is provisioned (post-Prepare-Job-success or `Completed`), cancel makes no sense — removing the new OSD is an `out`+`purge` workflow, not a rollback. -## Required code changes +**Cancel during Validating with `autoOut: true`.** If the operator already marked the OSD `out`, the OSD stays `out` after cancel. User marks `in` manually to recover the original cluster layout. -Six changes. Items 1–3 and 5 are independent bug fixes that map 1:1 to the five gaps in [Current gaps](#current-gaps) and are worth landing regardless of this design. Item 4 wires LVM-mode replacement; item 6 is the new orchestration. Implementation-level details (algorithms, edge cases, pattern reuse) are tracked separately from the design. +**Cancel during Waiting — ID-preserving retry unavailable.** Data and DB LVs were wiped at Destroy; a future `CephOSDReplace` for the same ID has no OSD info to capture and aborts. To re-add an OSD here, accept a fresh ID. -| # | Fix | File / lines | -|---|---|---| -| 1 | Inventory must include shared metadata disks. The current `len(children) > 1` filter excludes any disk hosting a Ceph LV. Algorithm: when the filter would trigger, check children for the `ceph.cluster_fsid=` LV tag — same authoritative signal Rook already uses elsewhere — and bypass the filter if any child carries this cluster's FSID. | [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111) | -| 2 | Populate `OSDInfo.MetadataPath` and a new `OSDInfo.MetadataDevice` field for LVM-mode OSDs. The data is in `ceph-volume lvm list --format json`'s `[db]` section (LV `path` and source `devices`); the parser today walks only `[block]` entries. Forward-compat across rolling upgrade: standard `encoding/json` decode with no `DisallowUnknownFields` policy — old operator silently drops the new field; new operator decoding an old CM gets the zero value (which the Capture fallback handles). | [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177) | -| 3 | `DestroyOSD` cleans up the DB LV and the dm-crypt config-key. Add `ceph config-key exists+rm`, `cryptsetup close `, and `lvremove -f ` (gated on `osdInfo.MetadataPath != ""`). Use the precheck patterns from the Destroy step so genuine failures bubble up while already-clean state is tolerated. | [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292) | -| 4 | Wire LVM-mode replacement through `lvm prepare --osd-id`. Today only raw mode at [volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555) adds `--osd-id`. When `a.replaceOSD != nil` and a metadata device is set, pre-allocate the DB LV with `lvcreate` (using the operator-persisted name) and call `lvm prepare --osd-id` instead of `lvm batch`. `lvm prepare --osd-id` claims a destroyed slot atomically (race-safe; no implicit reuse via mon's lowest-free allocation policy) and matches the existing same-device replacement flow. | [volume.go](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go) | -| 5 | Pass `OSDInfo.MetadataDevice` to the OSD daemon deployment as a new `ROOK_METADATA_SOURCE_DEVICE` env var. Future destroys read the metadata info from the deployment without a node-side rescan. | [spec.go#L950-L1010](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/spec.go#L950-L1010) | -| 6 | Orchestration — `CephOSDReplace` CRD definition + dedicated controller implementing the state machine (Pending queue, Validating with confirmation/up+in/safe-to-destroy checks, Destroy/Prepare Job split, non-blocking disk wait). Cluster-side coupling: skip auto-provisioning on nodes with an active replacement; surface an `OSDReplacementInProgress` condition on `CephCluster.status`. | new package `pkg/operator/ceph/cluster/osd/replace/` | +**Cancel during Preparing, Job in flight.** `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). The operator records the cancel intent and acts at Job exit. On Job failure, the finalizer removes the CR; the partially-allocated DB LV is left as a named orphan. On Job success, cancel is not honored — the new OSD joins the cluster. ## Out of scope @@ -394,9 +332,9 @@ nodes: config: { metadataDevice: "nvme1n1" } # different metadata device on the same node ``` -This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadata-source-device` is captured in its record at destroy time), with two caveats: +This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadataSourceDevice` is captured in its `osdInfo` at destroy time), with two caveats: -- **Validation policy must permit exact entries** — see U-10. The default `strict` mode rejects this setup from the flow. +- **Device-name validation must permit exact entries** — open question. - **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. The broader multi-metadata-device feature work (improvements to per-device `metadataDevice` UX, multi-NVMe-per-node setups) was scoped separately by maintainers in [#13240](https://github.com/rook/rook/issues/13240) (tracked by `zhucan`). This design does not add new logic for that setup — it just doesn't actively forbid replacement on it under permissive validation. @@ -411,31 +349,25 @@ If the OSD's host is gone, this flow cannot proceed (the Destroy step requires t ## Open questions -### Decided (rationale captured for review) +1. **Controller placement.** Design leans toward a separate `CephOSDReplace` CRD; `spec.storage.replaceOSD` on CephCluster (mirroring `spec.storage.migration`) is a fallback — see [Open question: controller placement](#open-question-controller-placement). Maintainers' call. -- **U-1 — Trigger surface.** Decision: dedicated `CephOSDReplace` CRD with `spec` carrying trigger fields (see [State](#state)). Cluster-CR fallback shape: `spec.storage.replaceOSD: {id, confirmation, ...}` mirroring `spec.storage.migration`. -- **U-6 — State-record substrate.** Resolved by [Open question: controller placement](#open-question-controller-placement) — `CephOSDReplace.status`. -- **U-7 — OSD-ID preservation primitive.** Decision: `ceph osd destroy` + `lvm prepare --osd-id` + operator pre-allocates DB LV. This is [Ceph's documented replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) — the design just orchestrates it across pods. Verified end-to-end on Ceph v19.2.2 (`osd-rep-log.md`). Race-safe: the destroyed slot can't be claimed by another OSD between destroy and prepare. Matches Rook's existing same-device replacement flow ([volume.go#L548-L555](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L548-L555)). Alternatives compared in `osd-id-reuse-analysis.md`. +2. **Parallelism.** The proposed OSD replacement process is serial. Are there use-cases for parallel replacement we should support — multiple OSDs safe-to-destroy on the same node, all safe-to-destroy in the cluster at once, configurable concurrency? -### Open for PR-review decision +3. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (CR creation). Should there be a follow-up option for automated replacement that triggers the same flow when a failed OSD and a fresh disk are detected on a node? -- **U-O — Controller placement.** [Open question: controller placement](#open-question-controller-placement) lays out the trade-offs. The doc leans toward the CRD option but does not prescribe. Maintainers' call. -- **U-2 — Parallelism.** Issue #13240 names multi-disk-failure on a chassis as a real operational pain — replacing N disks means N sequential CR edits, each blocking on disk-swap wait + reconcile cadence. This design stays serial for operational simplicity, not correctness — per-OSD `safe-to-destroy`'s drained semantic makes concurrent destroys safe. Two follow-up paths: (a) widen the trigger surface to a list (`replaceOSDs: [{id, confirmation}, …]`); (b) N-per-reconcile execution via N parallel Destroy/Prepare Jobs each running `lvm prepare --osd-id `. An `lvm batch --osd-ids X Y Z --prepare` invocation does **not** work for shared-metadata setups (rejects the metadata VG outright); (b) must use N parallel `lvm prepare --osd-id` invocations, not a single `lvm batch` call. -- **U-3 — Auto-replace mode.** Follow-up: opt-in `spec.storage.autoReplaceOSDs: true` to run the same flow without explicit trigger when an OSD is `down_in` and a new empty disk appears. Extra checks (cluster health, PG state) would gate it. Deferred. -- **U-4 — Default timeout values.** Per-CR configurability is decided (`spec.safeToDestroyTimeout`, `spec.diskWaitTimeout` — see [State](#state)). Defaults open: 1h / 24h is a starting point. `safeToDestroyTimeout` default fits the failed-disk path (Ceph auto-`out`s and backfills before the user even creates the CR); needs a longer override when `autoOut: true` is used on a healthy OSD where backfill begins inside the flow. -- **U-5 — Faster wake on disk-swap.** With `rook-discover` enabled, latency floor is its udev-event delivery (seconds) up to `ROOK_DISCOVER_DEVICES_INTERVAL` (60 min). Without it, the wait re-checks every U-9 interval. Optional follow-up: treat udev "new disk" events on the node as reconcile triggers while a replacement is in progress. -- **U-8 — `getOSDInfo` fallback for old deployments.** Rook regenerates each OSD's Deployment spec on every reconcile, but `getOSDInfo` only recovers what's already in the env. For OSDs deployed before changes #2 and #5, the new env vars are missing → empty values on regenerated deployments. The Capture fallback handles this per replacement. Alternative: push the `lvm list` fallback into `getOSDInfo` itself, so existing deployments get backfilled on operator upgrade. The per-replacement fallback then becomes redundant. Worth doing if change #2's footprint is small. -- **U-9 — Wait-for-disk re-check pattern.** Default 5 min interval is a working starting point. Lower (1 min) cuts latency at the cost of more inventory-Job pods. Likely cluster-config-tunable rather than hardcoded. -- **U-10 — Device-matching validation policy.** Three policies: `strict` (reject any exact entry on the OSD's data device), `accept-by-path` (reject only kernel-name entries), `lenient` (accept anything; mismatch surfaces as Wait-step stall). Defaults and configurability scope (operator-global / per-replacement / hard-coded) are open. +4. **Default values.** Are these defaults reasonable: `safeToDestroyTimeout: 1h`, `diskWaitTimeout: 24h`, disk-wait re-check interval `5 min` (cluster-config tunable)? -## Validation plan +5. **Disk-swap responsiveness.** Can the design rely on Rook's existing discovery (rook-discover when enabled, otherwise the per-reconcile prepare-job inventory) to detect the replacement disk? Expected latency? Caveats when the cluster uses `useAllDevices` vs. regex `deviceFilter` vs. exact `name:` entries? -Coverage areas this design must validate (detailed scenarios in [`osd-test-scenarios.md`](osd-test-scenarios.md)): +6. **Device-name validation.** Should the operator validate that the OSD's data-device reference in `spec.storage.nodes[*].devices[*]` is swap-tolerant before destroying, or is this the user's responsibility? Sample to consider: -- **Happy path** on shared-metadata setups: single OSD replaced while siblings stay up, with and without `encryptedDevice: true`; multiple metadata devices on the same node (per-device config); same-device (raw-mode) regression. -- **Required-change validation**: new OSD deployment carries non-empty `ROOK_METADATA_DEVICE` / `ROOK_METADATA_SOURCE_DEVICE`; metadata VG with healthy siblings is now visible to inventory. -- **Crash recovery**: Destroy Job and Prepare Job killed mid-run; state-record-driven retry produces no orphan DB LVs across N retries. -- **Validation gates**: trigger after auto-provisioning is rejected; raw kernel-name device addressing is rejected before any destructive action. -- **Edge cases**: smaller replacement disk; pre-existing leaked DB LVs in the VG; encrypted-OSD dm-crypt key cleanup across Ceph versions. + ```yaml + spec: + storage: + nodes: + - name: node-1 + devices: + - name: /dev/sda # kernel name — not swap-stable + - name: /dev/disk/by-path/... # by-path — same-slot swap only + ``` -Manual verification on a Lima VM (2 simulated HDDs + 1 simulated NVMe with `databaseSizeMB: 1500`, dmcrypt on) before handoff to CI. From 4b5437e2650728de9750793e49bd31e931e21c72 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Wed, 6 May 2026 10:57:13 +0200 Subject: [PATCH 04/12] docs: fix code links Signed-off-by: Artem Torubarov --- osd-design.md | 90 ++++++++++++++++++++++++++------------------------- 1 file changed, 46 insertions(+), 44 deletions(-) diff --git a/osd-design.md b/osd-design.md index 4ce423d4c129..c104fb27ec88 100644 --- a/osd-design.md +++ b/osd-design.md @@ -51,13 +51,14 @@ The replacement flow must validate the affected OSD's CR references beforehand s ## Current gaps -Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration.confirmation`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs five additional fixes: +Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration.confirmation`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs six additional fixes: -1. The replacement code path runs only in raw mode; LVM mode (required when a metadata device is configured) does not pass `--osd-id`, so the new OSD gets a new ID. (`initializeDevicesLVMMode`, [volume.go#L584-L844](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L584-L844)) -2. Destroy zaps only the data LV; the DB LV on the shared metadata disk stays as an orphan. (`DestroyOSD`, [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) -3. The dm-crypt key in Ceph's config-key store is never removed, leading to LUKS collisions on retry of encrypted OSDs. (`DestroyOSD`, [remove.go#L244-L292](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) -4. **The prepare-pod can't find a shared metadata disk once it hosts a DB LV.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. The first OSD's DB LV trips that filter, and the prepare-pod's `initializeDevicesLVMMode` then errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). -5. `OSDInfo.MetadataPath` is never populated for LVM-mode OSDs (the parser walks only `[block]` entries from `ceph-volume lvm list`), so the operator has no record of which metadata disk a destroyed OSD used. (`GetCephVolumeLVMOSDs`, [volume.go#L1104-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1104-L1177)) +1. The replacement code path runs only in raw mode; LVM mode (required when a metadata device is configured) does not pass `--osd-id`, so the new OSD gets a new ID. (`initializeDevicesLVMMode`, [volume.go#L587-L847](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L587-L847)) +2. Destroy zaps only the data LV; the DB LV on the shared metadata disk stays as an orphan. (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) +3. The dm-crypt key in Ceph's config-key store is never removed, leading to LUKS collisions on retry of encrypted OSDs. (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) +4. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). +5. `OSDInfo.MetadataPath` is not populated by the prepare-job re-discovery path on LVM-mode OSDs (`GetCephVolumeLVMOSDs` walks only `[block]` entries from `ceph-volume lvm list`). The operator-side path (`getOSDInfo` at [osd.go#L748](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L748)) does populate it from `ROOK_METADATA_DEVICE` env, but anything that goes through re-discovery (a redeploy after CM loss) loses the field. ([volume.go#L1082-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1082-L1177)) +6. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. ## Proposed flow @@ -78,8 +79,8 @@ sequenceDiagram participant PJ as Prepare Job participant Ceph as Ceph - User->>CR: set replaceOSD id=5 - Op->>CR: read trigger + User->>CR: create CR with osd id 5 + Op->>CR: read CR Op->>CR: write phase=Validating Op->>Ceph: ceph osd dump (validate exists, get fsid) Op->>Ceph: safe-to-destroy 5 @@ -100,7 +101,7 @@ sequenceDiagram Note over User,Op: User swaps the failed disk Op->>CR: phase=Preparing Op->>+PJ: create from recorded OSD info - PJ->>PJ: lvcreate using persisted name + PJ->>PJ: lvcreate using LV name from Job env PJ->>Ceph: ceph-volume lvm prepare --osd-id 5 PJ->>PJ: write new OSD info to existing per-node status CM PJ-->>-Op: Succeeded @@ -115,9 +116,8 @@ sequenceDiagram The diagram doesn't pick a concrete CR or controller for the replacement reconcile logic. Two candidates: extend the existing CephCluster controller (which already hosts `spec.storage.migration`), or introduce a separate `CephOSDReplace` CRD with its own controller. The design leans toward the separate CRD for the following reasons: -1. **CephCluster's `Reconcile()` runs mon, mgr, and osd reconcile sequentially in one call** ([`cluster.go#L116-L160`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/cluster.go#L116-L160)). New long-running logic on the OSD path can interfere with mon/mgr reconcile for the same cluster. -2. **Replacement is long-running and multi-step**, so its state has to survive between reconciles. The cluster controller has no existing place to store sub-operation state — adding one (a side ConfigMap, or extending `CephCluster.status`) is part of the cost. -3. **Replacement reconciles need two outcomes the current cluster reconcile can't express**: terminal failure (bad CR rejected) and `RequeueAfter` (waiting for external events — disk inserted, Job done). Today `osd.Cluster.Start()` returns plain `error`; the parent reconcile has no way to learn "OSD step is mid-replacement, retry in N minutes." It's also unclear how a requeue would interact with components reconciled after the OSD step in the same `Reconcile` call. +1. **CephCluster's `reconcileCephDaemons` is monolithic and synchronous** — mon, mgr, and osd reconcile run sequentially in one call ([`cluster.go#L116-L160`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/cluster.go#L116-L160)); `osd.Cluster.Start()` returns plain `error`, so there's no way to express terminal failure (bad CR rejected) vs. transient `RequeueAfter` (waiting for disk-swap or Job completion). Adding long-running multi-step logic to this path interferes with mon/mgr reconcile and lacks the return semantics the flow needs. +2. **Replacement state has to survive between reconciles**, and the cluster controller has no existing place to store sub-operation state — adding one (a side ConfigMap, or extending `CephCluster.status`) is part of the cost. Concrete shape of each candidate: @@ -159,15 +159,15 @@ status: node: node-1 # OSD deployment NodeSelector; survives the deployment delete dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE; absent for raw-mode OSDs - metadataSourceDevice: nvme0n1 # OSD deployment env ROOK_METADATA_SOURCE_DEVICE; absent for raw-mode OSDs - metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` + metadataSourceDevice: /dev/vdd # new env ROOK_METADATA_SOURCE_DEVICE (added by this design); absent for raw-mode OSDs + metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` on the OSD's host (metadata device must still be readable) crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS - databaseSizeMB: 4096 # from `lvs --noheadings -o lv_size ` ÷ 1MiB - encrypted: true # from LV tag `ceph.encrypted` on - osdFsid: 8b7e6c19-... # from `ceph osd dump --format json` + databaseSizeMB: 1500 # from `ceph-volume lvm list --format json` lv_size (bytes) / 1048576 + encrypted: true # from `ceph-volume lvm list --format json` tags.ceph.encrypted + osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # from `ceph osd dump --format json`: `.osds[id=].uuid` # populated on phase=Completed - newFsid: "" # for audit only; never used for re-arming + newFsid: "" # recorded on Completed completedAt: null ``` @@ -175,13 +175,13 @@ status: #### Coordination -Replacements run serially cluster-wide as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. +Replacements run serially per CephCluster as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. The queue is implemented via a `Pending` phase. Each reconcile, the controller lists peer `CephOSDReplace` CRs in the same namespace targeting the same cluster. If no earlier-`creationTimestamp` peer is in a non-terminal phase, this CR advances to `Validating`; otherwise it stays in `Pending` and re-checks next reconcile. UID breaks same-second ties. > Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. -In both shapes, the cluster controller's auto-provisioning must skip nodes with an active replacement — otherwise the empty replacement disk gets claimed with a fresh ID before the Prepare step can use it. Without explicit trigger, Rook has no way to tell a replacement disk from a new disk (see [Constraints](#rook-cannot-tell-a-replacement-disk-from-a-new-disk)). +In both shapes, the cluster controller's auto-provisioning must skip nodes with an active replacement — otherwise the empty replacement disk gets claimed with a fresh ID before the Prepare step can use it. Mechanism: the replacement controller publishes an `OSDReplacementInProgress` condition on `CephCluster.status` (with the affected node listed); the cluster controller's auto-provisioning reads this condition before spawning prepare-jobs. Without explicit trigger, Rook has no way to tell a replacement disk from a new disk (see [Constraints](#rook-cannot-tell-a-replacement-disk-from-a-new-disk)). #### Phase state machine @@ -209,8 +209,6 @@ Per-phase behavior: User-visible: `Ready=True` on `Completed`, `Ready=False` otherwise; `reason` carries the current phase or a typed terminal reason. -> **⚠️ Destroy is irreversible.** Once `Validating` passes, `osd.5` will be destroyed on the next reconcile. If the user typed the wrong OSD ID, the wrong OSD is gone. - ### Step-by-step The walk-through uses the running example. @@ -221,16 +219,18 @@ Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` after `m Healthy (up+in) OSDs require either `ceph osd out` first or `spec.autoOut: true` — see [Validate](#2-validate). -On creation, the CR enters `Pending` and waits for any in-flight replacement to terminate. Once cleared, it advances to `Validating`. The disk can be swapped any time after the CR is applied — the Capture step tolerates a missing data device. +On creation, the CR enters `Pending` and waits for any in-flight replacement to terminate. Once cleared, it advances to `Validating`. The disk can be swapped any time after the CR is applied — Destroy's capture step tolerates a missing data device. #### 2. Validate +> **⚠️ Destroy is irreversible.** Once Validate passes, `osd.5` is destroyed on the next reconcile. If the user typed the wrong OSD ID, the wrong OSD is gone. + Run each reconcile cycle until all checks pass or one fails terminally: 1. **Confirmation matches.** `spec.confirmation` must equal `"yes-really-replace-osd-{spec.osdId}"`. → `Failed` with `reason=InvalidSpec` on mismatch (typo guard). 2. **Target OSD exists** in the OSD map. → `Failed` with `reason=InvalidSpec` if absent. 3. **Target OSD is destroyable.** If the OSD is `up && in`: with `spec.autoOut: false` (default), → `Failed` with `reason=OSDStillIn`. With `spec.autoOut: true`, the operator runs `ceph osd out ` once at entry and falls through to check 5. -4. **CR-level device matching is swap-tolerant.** → `Failed` with `reason=InvalidSpec` if the OSD's data device is referenced by an unstable name in the CR (rules per [U-6](#open-questions)). +4. **CR-level device matching is swap-tolerant.** The OSD's data device must be referenced via `useAllDevices`, `deviceFilter`, or a `/dev/disk/by-path/...` path (same-slot replacement only). Kernel names (`vdb`, `/dev/sda`), `by-id`, and `by-uuid` references are rejected — they can't resolve to a fresh disk. → `Failed` with `reason=InvalidSpec` on rejection. (Whether to make this configurable is [open question 6](#open-questions).) 5. **`safe-to-destroy ` returns OK.** Returns EBUSY while any PG still has the OSD in its acting set — the only safety gate (`down`/`out` alone is not sufficient because data may not have replicated). EBUSY → stay in `Validating`, re-check next reconcile. `spec.safeToDestroyTimeout` exceeded → `Failed` with `reason=NotSafeToDestroy`. @@ -243,31 +243,31 @@ Before deleting the deployment, the operator captures `.status.osdInfo` (sources Then the operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and the next step's `cryptsetup close` would fail. If the wait times out (transient NotReady node), the operator re-checks on the next reconcile. No force-delete — a stuck pod on a NotReady node may still hold the LUKS mapping when kubelet recovers. -The Destroy Job's container invokes `DestroyOSD` ([`remove.go#L244-L292`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L292)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go). The bash below specifies what `DestroyOSD` must do (today it only handles the first step and a partial last step). Each operation is idempotent on retry. +The Destroy Job's container invokes `DestroyOSD` ([`remove.go#L244-L290`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go#L272). The bash below specifies what `DestroyOSD` must do (today it only handles the first step and a partial last step). Each operation is idempotent on retry. Pod profile is cloned from the existing `c.provisionPodTemplateSpec` — inherits `DM_DISABLE_UDEV=1` (required to bypass udev sync inside privileged containers) and the `cephx-keyring-update` init container. ```bash # Destroy in Ceph (preserves OSD ID 5 for reuse). ceph osd destroy osd.5 --yes-i-really-mean-it # Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). -ceph config-key exists dm-crypt/osd/8b7e6c19-.../luks \ - && ceph config-key rm dm-crypt/osd/8b7e6c19-.../luks +ceph config-key exists dm-crypt/osd//luks \ + && ceph config-key rm dm-crypt/osd//luks # Close DB-side LUKS mapping. -DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db-bbb... | awk '$2=="crypt"{print $1; exit}') +DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db- | awk '$2=="crypt"{print $1; exit}') [ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ && cryptsetup close "$DB_MAPPING" # Free the DB slot. -lvs /dev/ceph-metadata-vg-1/osd-db-bbb... >/dev/null 2>&1 \ - && lvremove -f /dev/ceph-metadata-vg-1/osd-db-bbb... +lvs /dev/ceph-metadata-vg-1/osd-db- >/dev/null 2>&1 \ + && lvremove -f /dev/ceph-metadata-vg-1/osd-db- # Zap the data LV (also handles the data-side dm-crypt mapping). -lvs /dev/ceph-data-vg-5/osd-block-aaa... >/dev/null 2>&1 \ - && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block-aaa... --destroy +lvs /dev/ceph-data-vg-5/osd-block- >/dev/null 2>&1 \ + && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block- --destroy ``` -After the Job completes, operator advances the record to `phase: Waiting`. +After the Job completes, operator advances the record to `phase: Waiting`. The captured `osdInfo.dbLV` now refers to a removed LV — only its parent VG (`metadataVG`) is live; Prepare uses that VG to create a fresh DB LV. #### 4. Wait for replacement disk @@ -277,19 +277,19 @@ Timeout per `spec.diskWaitTimeout` (default 24h) → `Failed` with `reason=Repla #### 5. Prepare -Phase `Preparing` (entered when the replacement disk is visible). The operator generates a fresh UUID for the new DB LV and passes it as an env var on the Job (same pattern as `ROOK_REPLACE_OSD` in `provision_spec.go:317,322`); pod retries reuse the same env. The Job runs: +Phase `Preparing` (entered when the replacement disk is visible). The operator generates a fresh UUID for the new DB LV and passes it as an env var on the Job (same pattern as `ROOK_REPLACE_OSD` in `provision_spec.go:317,322`); pod retries reuse the same env. Pod profile is cloned from `c.provisionPodTemplateSpec` (inherits `DM_DISABLE_UDEV=1` and the `cephx-keyring-update` init container — without these, `lvcreate`/`cryptsetup` hangs in `udev_wait` and `ceph-volume lvm prepare` fails with `RADOS permission denied`). The Job runs: ```bash -# Pre-allocate the DB LV using the persisted name; skip if it already exists. +# Pre-allocate the DB LV using the name from the Job env; skip if it already exists. lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ - || lvcreate -L 4096M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y + || lvcreate -L 1500M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y # Provision the new OSD with the preserved ID. --dmcrypt only when the record's # `encrypted` field is true. ceph-volume lvm prepare \ --bluestore [--dmcrypt] \ --osd-id 5 \ - --data /dev/sdh \ + --data /dev/vdh \ --block.db /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... \ --crush-device-class hdd ``` @@ -298,11 +298,11 @@ The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node #### 6. Activate -Reuses the existing path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — future replacement of this OSD won't need the Capture fallback. +Reuses the existing path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — future replacement of this OSD won't need the older-version fallback in Destroy's capture step. #### 7. Complete -While in `Preparing`, the controller calls `ceph osd metadata ` each reconcile. Ready = a record returned with a non-empty fsid, matching `id`, and matching `hostname`. On Ready, transition to `phase: Completed` and record `newFsid` and `completedAt`. +While in `Preparing`, the controller checks readiness each reconcile via `ceph osd tree`: the OSD must be `up` AND `in`. Then captures `osd_uuid` from `ceph osd metadata ` and transitions to `phase: Completed` with `newFsid` and `completedAt` recorded. (`ceph osd metadata` populates as soon as the daemon registers — gating on `tree`'s `up`+`in` waits until the daemon has actually joined the cluster.) ### Cancellation @@ -334,7 +334,7 @@ nodes: This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadataSourceDevice` is captured in its `osdInfo` at destroy time), with two caveats: -- **Device-name validation must permit exact entries** — open question. +- **Device-name validation must permit exact entries** — see [open question 6](#open-questions). - **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. The broader multi-metadata-device feature work (improvements to per-device `metadataDevice` UX, multi-NVMe-per-node setups) was scoped separately by maintainers in [#13240](https://github.com/rook/rook/issues/13240) (tracked by `zhucan`). This design does not add new logic for that setup — it just doesn't actively forbid replacement on it under permissive validation. @@ -355,11 +355,11 @@ If the OSD's host is gone, this flow cannot proceed (the Destroy step requires t 3. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (CR creation). Should there be a follow-up option for automated replacement that triggers the same flow when a failed OSD and a fresh disk are detected on a node? -4. **Default values.** Are these defaults reasonable: `safeToDestroyTimeout: 1h`, `diskWaitTimeout: 24h`, disk-wait re-check interval `5 min` (cluster-config tunable)? +4. **Default values.** Proposed: `safeToDestroyTimeout: 1h`, `diskWaitTimeout: 24h`, disk-wait re-check interval `5 min` (cluster-config tunable). Reasonable, or do reviewers see a reason to change them? 5. **Disk-swap responsiveness.** Can the design rely on Rook's existing discovery (rook-discover when enabled, otherwise the per-reconcile prepare-job inventory) to detect the replacement disk? Expected latency? Caveats when the cluster uses `useAllDevices` vs. regex `deviceFilter` vs. exact `name:` entries? -6. **Device-name validation.** Should the operator validate that the OSD's data-device reference in `spec.storage.nodes[*].devices[*]` is swap-tolerant before destroying, or is this the user's responsibility? Sample to consider: +6. **Device-name validation.** Proposed: reject kernel names (`/dev/sda`, `vda`), `by-id`, and `by-uuid` references; accept `useAllDevices`, `deviceFilter`, and `by-path` (with implicit same-slot expectation). Sample: ```yaml spec: @@ -367,7 +367,9 @@ If the OSD's host is gone, this flow cannot proceed (the Destroy step requires t nodes: - name: node-1 devices: - - name: /dev/sda # kernel name — not swap-stable - - name: /dev/disk/by-path/... # by-path — same-slot swap only + - name: /dev/sda # kernel name — rejected (not swap-stable) + - name: /dev/disk/by-path/... # by-path — accepted (same-slot only) ``` + Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? + From 30a329de8178596f5a7f276d3ef8facce38b759d Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Wed, 6 May 2026 13:23:12 +0200 Subject: [PATCH 05/12] docs: prepare job discover only mode Signed-off-by: Artem Torubarov --- osd-design.md | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/osd-design.md b/osd-design.md index c104fb27ec88..882654642575 100644 --- a/osd-design.md +++ b/osd-design.md @@ -21,7 +21,11 @@ A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NV ## Constraints -Two facts about the environment shape every later choice in this design. +Three facts about the environment shape every later choice in this design. + +### Replacement is same-host + +The new disk must go to the same host as the destroyed OSD. The captured `metadataVG` is host-local, and the Prepare Job runs `ceph-volume lvm prepare --block.db /dev//...` against it. Cross-host replacement is permitted by Ceph but out of scope here. ### Rook cannot tell a replacement disk from a new disk @@ -51,7 +55,7 @@ The replacement flow must validate the affected OSD's CR references beforehand s ## Current gaps -Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration.confirmation`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs six additional fixes: +Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration.confirmation`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs seven additional fixes: 1. The replacement code path runs only in raw mode; LVM mode (required when a metadata device is configured) does not pass `--osd-id`, so the new OSD gets a new ID. (`initializeDevicesLVMMode`, [volume.go#L587-L847](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L587-L847)) 2. Destroy zaps only the data LV; the DB LV on the shared metadata disk stays as an orphan. (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) @@ -59,6 +63,7 @@ Rook has no automated flow for replacing a failed OSD today. The closest existin 4. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). 5. `OSDInfo.MetadataPath` is not populated by the prepare-job re-discovery path on LVM-mode OSDs (`GetCephVolumeLVMOSDs` walks only `[block]` entries from `ceph-volume lvm list`). The operator-side path (`getOSDInfo` at [osd.go#L748](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L748)) does populate it from `ROOK_METADATA_DEVICE` env, but anything that goes through re-discovery (a redeploy after CM loss) loses the field. ([volume.go#L1082-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1082-L1177)) 6. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. +7. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. ## Proposed flow @@ -181,7 +186,17 @@ The queue is implemented via a `Pending` phase. Each reconcile, the controller l > Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. -In both shapes, the cluster controller's auto-provisioning must skip nodes with an active replacement — otherwise the empty replacement disk gets claimed with a fresh ID before the Prepare step can use it. Mechanism: the replacement controller publishes an `OSDReplacementInProgress` condition on `CephCluster.status` (with the affected node listed); the cluster controller's auto-provisioning reads this condition before spawning prepare-jobs. Without explicit trigger, Rook has no way to tell a replacement disk from a new disk (see [Constraints](#rook-cannot-tell-a-replacement-disk-from-a-new-disk)). +#### Auto-provisioning skip + +The Rook cluster controller spawns the prepare-job, which by default auto-discovers devices and provisions new OSDs. To make the replacement flow work, the cluster controller must run the prepare-job in "discover only" mode on a node where a replacement is running — discovery happens, provisioning doesn't. + +In the existing cluster controller, add a gate before each `runPrepareJob` call in `startProvisioningOverNodes` ([create.go#L345](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/create.go#L345)): list `CephOSDReplace` CRs whose `rook.io/osd-replacement-node` label equals the current node; if any is in a non-terminal phase, launch the Job with `ROOK_DISCOVER_ONLY=true` in its env. The replacement controller stamps this label on its CR at creation, reading the node from the target OSD's deployment. It clears the label on transition to a terminal phase; the deletion finalizer is a backup. + +In discover-only mode, the prepare-job runs the same discovery code as a normal run (`DiscoverDevicesWithFilter` + `getAvailableDevices` at [daemon.go#L341](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L341)) and writes the eligible-device list to the existing per-node status CM (`rook-ceph-osd--status`) — without running `lvm batch`. + +When the discovery DaemonSet is enabled (`ROOK_ENABLE_DISCOVERY_DAEMON=true`), `local-device-` is updated on udev events with seconds latency. The replacement controller may watch that CM as a fast wake-up signal, but treats the discover-only status CM as authoritative — `local-device-` is unfiltered (does not apply the cluster's `deviceFilter`/`useAllDevices`). + +> With the cluster-CR fallback (`spec.storage.replaceOSD` on CephCluster), the cluster controller reads its own spec field instead of listing CRs — same flag plumbing. #### Phase state machine @@ -223,7 +238,7 @@ On creation, the CR enters `Pending` and waits for any in-flight replacement to #### 2. Validate -> **⚠️ Destroy is irreversible.** Once Validate passes, `osd.5` is destroyed on the next reconcile. If the user typed the wrong OSD ID, the wrong OSD is gone. +> ** Destroy is irreversible.** Once Validate passes, `osd.5` is destroyed on the next reconcile. If the user typed the wrong OSD ID, the wrong OSD is gone. Run each reconcile cycle until all checks pass or one fails terminally: @@ -271,7 +286,7 @@ After the Job completes, operator advances the record to `phase: Waiting`. The c #### 4. Wait for replacement disk -Each reconcile of the CR, the controller checks if the replacement disk is visible on the node; if not, it requeues. Discovery uses Rook's existing paths — `rook-discover` (when enabled) for udev-event-driven updates, otherwise the operator's per-reconcile prepare-job inventory. The cluster controller skips auto-provisioning on the node while this CR is active (see [Coordination](#coordination)). +Each reconcile of the CR, the controller checks if the replacement disk is visible by reading the per-node status CM (`rook-ceph-osd--status`, populated by the cluster controller's prepare-job in discover-only mode — see [Auto-provisioning skip](#auto-provisioning-skip)). If the empty replacement disk isn't there yet, requeue. When it appears, advance to `Preparing`. Timeout per `spec.diskWaitTimeout` (default 24h) → `Failed` with `reason=ReplacementDiskMissing`. After timeout, the user can either insert the disk and create a new CR for the same OSD ID, or delete the CR (osd stays destroyed; `ceph osd purge` to free the slot). @@ -373,3 +388,5 @@ If the OSD's host is gone, this flow cannot proceed (the Destroy step requires t Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? +7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `metadataVG` lives on the original host. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? + From 0ea2f3f7226a1d0a4b54f75f3cf94d5835aba0b2 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Wed, 6 May 2026 20:54:35 +0200 Subject: [PATCH 06/12] docs: move to a single replace job Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 374 +++++++++++++++++++++++++++++++ osd-design.md | 392 --------------------------------- 2 files changed, 374 insertions(+), 392 deletions(-) create mode 100644 design/ceph/osd-replacement.md delete mode 100644 osd-design.md diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md new file mode 100644 index 000000000000..7ba7551c022f --- /dev/null +++ b/design/ceph/osd-replacement.md @@ -0,0 +1,374 @@ +# Design: Single OSD replacement with a shared metadata device + +Issue: [rook/rook#13240](https://github.com/rook/rook/issues/13240) + +## Problem + +When an OSD's data and metadata live on different devices (per `spec.storage` `metadataDevice` config in the CephCluster CR), Rook today cannot replace a single failed OSD on its own. The user must either re-provision all OSDs sharing the same metadata device or run a multi-step manual workflow including scaling down the operator to zero. Raw-mode OSDs (data and metadata on a single disk) follow a similar manual procedure today, with fewer steps. + +This design proposes a workflow to replace a single failed OSD in place — preserving its OSD ID — without affecting other OSDs sharing the same metadata device. + +## Notation + +- **User** - the human cluster admin who edits the CR. +- **Operator** - the Rook controller process. +- **Data LV / data device** - the LV (or block device) holding an OSD's bulk data. One per OSD. +- **DB LV / metadata device** - the LV holding the OSD's rocksdb (`block.db`). One per OSD; multiple OSDs can share the same metadata device. + +## User story + +A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NVMe metadata device. The user marks `osd.5` for replacement in the Rook CR, swaps the physical disk in the chassis, and walks away. Rook destroys `osd.5`, frees its DB LV slot on the NVMe, provisions a new OSD on the replacement disk *with the same OSD ID 5*, and the other four OSDs on the same NVMe stay up the whole time. + +## Constraints + +### Replacement is same-host + +The new disk must go to the same host as the destroyed OSD: the DB slot freed by destroying the old OSD lives on a metadata device attached to that host, and the replacement OSD's DB must reuse it. Cross-host replacement is permitted by Ceph but out of scope here. + +### Rook cannot tell a replacement disk from a new disk + +When a fresh empty disk appears on a node, Rook has no way to tell it's the replacement for a failed OSD. With `useAllDevices` or a matching `deviceFilter`, the next reconcile auto-provisions the new disk with a fresh ID and leaks the failed OSD's resources. The user must mark the OSD for replacement in the CR *before* swapping the disk. + +### Storage device config must tolerate device swap + +Rook lets users identify OSD data devices via `spec.storage`: + +- `useAllDevices: true` — match any empty disk on the node. +- `deviceFilter: ""` — match disks whose `lsblk` properties match a regex. +- `nodes[].devices[].name: ""` — match a specific path or name. Accepts a kernel name (`vdb`), a raw path (`/dev/sdc`), or a udev symlink (`/dev/disk/by-path/...`, `/dev/disk/by-id/...`). +- `nodes[].devices[].fullpath: ""` — explicit DevLinks match (`/dev/disk/by-id/...`, `/dev/disk/by-path/...`). Compared against discovered symlinks, not regex. + +Each shape interacts differently with the Linux device-naming interfaces: + +- **Kernel names** (`vdb`, `sdc`, `/dev/sdc`) are assigned by the kernel at boot and [not guaranteed to be persistent](https://wiki.archlinux.org/title/Persistent_block_device_naming). +- **`/dev/disk/by-path/...`** is a udev symlink built from the sysfs port path: same physical port, same symlink. +- **`/dev/disk/by-id/...`** is a udev symlink built from the disk's hardware serial / WWN, unique per physical disk. +- **`/dev/disk/by-uuid/...`** is a udev symlink built from the filesystem or LV UUID, assigned at provisioning time. + +The shapes that tolerate any swap (same-slot or different-slot, any new disk) are `useAllDevices` and `deviceFilter`. `by-path` tolerates only same-slot replacement. Kernel names tolerate only the lucky case where the kernel happens to assign the same name. `by-id`/`by-uuid` references in `name`/`fullpath` cannot work for a disk that hasn't been seen yet. + +The replacement flow must validate the affected OSD's CR references beforehand so the new disk is still resolvable under those references after the swap. + +## Current gaps + +Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs the following fixes: + +1. `DestroyOSD` cleans up only the data LV. The DB LV on the shared metadata disk stays as an orphan, and the dm-crypt key in Ceph's config-key store is never removed (causing LUKS collisions on retry of encrypted OSDs). (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) +2. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). +3. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. +4. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. + +## Proposed flow + +This flow orchestrates [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm prepare --osd-id` → `lvm activate`) inside a single short-lived Kubernetes Job, with state machine maintained in Rook CR status. `cephadm` — Ceph's container-orchestrator analogue — preserves OSD IDs by default ([cephadm OSD service docs](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd)); this design follows the same convention. + +### Sequence + +```mermaid +sequenceDiagram + autonumber + actor User + participant CR as Rook CR + participant Op as Operator + participant OldPod as Old OSD pod + participant RJ as Replace Job + participant Ceph as Ceph + + User->>CR: create CR with osd id 5 + Op->>CR: read CR + Op->>CR: write phase=Validating + Op->>Ceph: ceph osd dump (validate exists, get fsid) + Op->>CR: write phase=Waiting + Note over Op: each reconcile, poll inventory CM and requeue until disk visible + Note over User,Op: User swaps the failed disk + Op->>CR: write phase=Replacing + Note over User,Ceph: from here, cancellation has side effects (see Cancellation) + Op->>Ceph: ceph osd out 5 (if autoOut and up+in) + Op->>Ceph: safe-to-destroy 5 (poll until OK) + Op->>OldPod: read deployment env + Op->>CR: capture OSD info + Op->>OldPod: delete deployment + destroy OldPod + Op->>OldPod: wait for pod termination + Op->>+RJ: create (env from osdInfo) + RJ->>Ceph: safe-to-destroy 5 (re-check) + RJ->>Ceph: ceph osd destroy osd.5 + RJ->>Ceph: config-key rm dm-crypt key (if encrypted) + RJ->>RJ: cryptsetup close db mapping (if encrypted) + RJ->>RJ: lvremove db lv + RJ->>RJ: ceph-volume lvm zap data lv + RJ->>RJ: lvcreate new db lv + RJ->>Ceph: ceph-volume lvm prepare --osd-id 5 + RJ->>RJ: write new OSD info to per-node status CM + RJ-->>-Op: Succeeded + create participant NewPod as New OSD pod + Op->>NewPod: create deployment (id=5, dataLV, new dbLV, metadataSourceDevice, encrypted) + NewPod->>Ceph: join cluster + Op->>Ceph: ceph osd metadata 5 until Ready + Op->>CR: phase=Completed +``` + +### Open question: controller placement + +The diagram doesn't pick a concrete CR or controller for the replacement reconcile logic. Two candidates: extend the existing CephCluster controller (which already hosts `spec.storage.migration`), or introduce a separate `CephOSDReplace` CRD with its own controller. The design leans toward the separate CRD for the following reasons: + +1. **CephCluster's `reconcileCephDaemons` is monolithic and synchronous** — mon, mgr, and osd reconcile run sequentially in one call ([`cluster.go#L116-L160`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/cluster.go#L116-L160)); `osd.Cluster.Start()` returns plain `error`, so there's no way to express terminal failure (bad CR rejected) vs. transient `RequeueAfter` (waiting for disk-swap or Job completion). Adding long-running multi-step logic to this path interferes with mon/mgr reconcile and lacks the return semantics the flow needs. +2. **Replacement state has to survive between reconciles**, and the cluster controller has no existing place to store sub-operation state — adding one (a side ConfigMap, or extending `CephCluster.status`) is part of the cost. + +Concrete shape of each candidate: + +- **Extend the cluster controller** — state in either a side ConfigMap (`osd-replacement-state`, similar to `osd-migration-config`) or `CephCluster.status`. +- **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile loop; never touches the existing OSD path. Light coupling on the cluster side: skip auto-provisioning on affected nodes. + +The rest of this design is based on a separate `CephOSDReplace` CRD, with implications for the cluster-CR fallback flagged inline. + +### CRD proposal + +Config lives on `CephOSDReplace.spec` and state in `.status`. `spec.cephCluster` and `spec.osdId` are immutable post-create. `.status` carries phase and conditions following the K8s operator pattern. + +```yaml +apiVersion: ceph.rook.io/v1 +kind: CephOSDReplace +metadata: + name: replace-osd-5 + namespace: rook-ceph +spec: + cephCluster: my-cluster # immutable; target cluster in this namespace + osdId: 5 # immutable + confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; typo guard against destroying the wrong OSD + autoOut: false # optional; if true, operator marks healthy OSD `out` automatically (during Replacing). Default: false (fail-fast on up+in at Validating) + safeToDestroyTimeout: 1h # optional; how long Replacing tolerates EBUSY on safe-to-destroy before Failed. Default: 1h + diskWaitTimeout: 24h # optional; how long Waiting tolerates a missing disk before Failed. Default: 24h + +status: + phase: Replacing # Pending | Validating | Waiting | Replacing | Completed | Failed | Cancelled + conditions: + - type: Ready + status: "False" + reason: Replacing + message: Replace Job in flight + observedGeneration: 1 + lastTransitionTime: "2026-05-05T12:00:00Z" + + # captured at start of Replacing, before deployment delete + osdInfo: + node: node-1 # OSD deployment NodeSelector; survives the deployment delete + dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH + dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE; absent for raw-mode OSDs + metadataSourceDevice: /dev/vdd # new env ROOK_METADATA_SOURCE_DEVICE (added by this design); absent for raw-mode OSDs + metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` on the OSD's host (metadata device must still be readable) + crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS + databaseSizeMB: 1500 # from `ceph-volume lvm list --format json` lv_size (bytes) / 1048576 + encrypted: true # from `ceph-volume lvm list --format json` tags.ceph.encrypted + osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # from `ceph osd dump --format json`: `.osds[id=].uuid` + + # populated on phase=Completed + newFsid: "" # recorded on Completed + completedAt: null +``` + +Users cancel a replacement by deleting the `CephOSDReplace` CR. A finalizer gives the operator a chance to clean up before the CR is removed. If the user cancels before the operator picks up the swapped disk, no Ceph or host state has changed and the CR is removed cleanly. If the replacement has already started, the operator runs it to a terminal state (success or failure) before removing the CR — see [Cancellation](#cancellation). + +CR names are arbitrary. To re-replace the same OSD, the user creates a new CR with a different name. Terminal CRs (`Completed`, `Cancelled`, `Failed`) for the same `osdId` are ignored by the operator and can be deleted when no longer useful. + +#### Coordination + +Replacements run serially per CephCluster as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. + +The queue is implemented via a `Pending` phase. Each reconcile, the controller lists peer `CephOSDReplace` CRs in the same namespace targeting the same cluster. If no earlier-`creationTimestamp` peer is in a non-terminal phase, this CR advances to `Validating`; otherwise it stays in `Pending` and re-checks next reconcile. UID breaks same-second ties. + +> Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. + +#### Auto-provisioning skip + +The Rook cluster controller spawns the prepare-job, which by default auto-discovers devices and provisions new OSDs. To make the replacement flow work, the cluster controller must run the prepare-job in "discover only" mode on a node where a replacement is running — discovery happens, provisioning doesn't. + +In the existing cluster controller, add a gate before each `runPrepareJob` call in `startProvisioningOverNodes` ([create.go#L345](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/create.go#L345)): list `CephOSDReplace` CRs whose `rook.io/osd-replacement-node` label equals the current node; if any is in a non-terminal phase, launch the Job with `ROOK_DISCOVER_ONLY=true` in its env. The replacement controller stamps this label on its CR at creation, reading the node from the target OSD's deployment. It clears the label on transition to a terminal phase; the deletion finalizer is a backup. + +In discover-only mode, the prepare-job runs the same discovery code as a normal run (`DiscoverDevicesWithFilter` + `getAvailableDevices` at [daemon.go#L341](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L341)) and writes the eligible-device list to the existing per-node status CM (`rook-ceph-osd--status`) and stops without provisioning. + +The optional discovery DaemonSet (`ROOK_ENABLE_DISCOVERY_DAEMON=true`) only inventories devices; it doesn't provision. When enabled, it updates `local-device-` on udev events with seconds latency. The replacement controller may watch that CM as a fast wake-up signal, but treats the discover-only status CM as authoritative — `local-device-` is unfiltered (does not apply the cluster's `deviceFilter`/`useAllDevices`). + +> With the cluster-CR fallback (`spec.storage.replaceOSD` on CephCluster), the cluster controller reads its own spec field instead of listing CRs — same flag plumbing. + +#### Phase state machine + +``` + Pending ─→ Validating ─→ Waiting ─→ Replacing ─→ Completed + │ │ │ │ + ▼ ▼ ▼ ▼ + Cancelled / Failed +``` + +On operator restart, reconcile resumes from `.status.phase` plus observable state — Jobs by name, deployment presence, `osdInfo` populated, OSD `up+in` in `ceph osd tree`. Sub-step progress within `Replacing` is not persisted on the CR. + +### Step-by-step + +The walk-through uses the running example. + +#### 1. Trigger — user creates a `CephOSDReplace` CR + +Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` and rebalances the data. User creates a `CephOSDReplace` CR and replaces the failed disk in the datacenter. + +Healthy (`up+in`) OSDs are rejected by [Validate](#2-validate) unless the user marks the OSD out manually first (`ceph osd out `) or sets `spec.autoOut: true`. + +On creation, the CR enters `Pending` and waits for any earlier in-flight replacement to terminate. Once cleared, it advances to `Validating`. + +#### 2. Validate + +Cheap upfront checks. Each reconcile cycle runs the checks in order; the first failure ends the phase. + +1. **Confirmation matches.** `spec.confirmation` must equal `"yes-really-replace-osd-{spec.osdId}"`. On mismatch: `Failed` with `reason=InvalidSpec` (typo guard). +2. **Target OSD exists.** If absent from the OSD map: `Failed` with `reason=InvalidSpec`. +3. **Target OSD is destroyable.** If `up && in` and `spec.autoOut: false`: `Failed` with `reason=OSDStillIn`. If `up && in` and `spec.autoOut: true`: accepted; the actual `ceph osd out` runs in [Replace](#4-replace), not here. +4. **CR-level device matching is swap-tolerant.** The OSD's data device must be referenced via `useAllDevices`, `deviceFilter`, or a `/dev/disk/by-path/...` path (same-slot replacement only). Kernel names (`vdb`, `/dev/sda`), `by-id`, and `by-uuid` references are rejected — they can't resolve to a fresh disk. On rejection: `Failed` with `reason=InvalidSpec`. (Whether to make this configurable is [open question 6](#open-questions).) + +On all checks passing, advances to `Waiting`. + +#### 3. Wait for replacement disk + +Each reconcile, the controller checks whether the replacement disk is visible by reading the per-node status CM (`rook-ceph-osd--status`, populated by the prepare-job). If the empty replacement disk isn't there yet, requeue. When it appears, advance to `Replacing`. + +Cancel during `Waiting` is clean: no Ceph or host state has been changed, no LVs touched. Deletion of the CR ends the flow with no recovery needed. + +If `spec.diskWaitTimeout` (default 24h) is exceeded, transitions to `Failed` with `reason=ReplacementDiskMissing`. After timeout, the user can insert the disk and create a new CR for the same OSD ID (the OSD is still alive in Ceph at this point), or delete the CR. + +#### 4. Replace + +`Replacing` runs the full set of state changes in sequence. On each reconcile, the operator inspects observable state (deployment presence, `status.osdInfo`, Replace Job status, `ceph osd tree`) and runs the next unfinished sub-step. + +1. **autoOut (conditional).** If the OSD is `up && in` and `spec.autoOut: true`, run `ceph osd out `. For the typical failed-disk case the OSD is already `out` (Ceph auto-marked it after `mon_osd_down_out_interval`) and this step is a no-op. + +2. **Wait for `safe-to-destroy` OK.** `ceph osd safe-to-destroy ` returns OK only after the OSD is fully drained from every PG's acting set. Requeued until OK. If `spec.safeToDestroyTimeout` (default 1h) is exceeded, transitions to `Failed` with `reason=NotSafeToDestroy`. + +3. **Capture OSDInfo.** Full details are in [CRD proposal](#crd-proposal). Most fields come from the OSD deployment's env. Two come off the host: + + ```bash + # databaseSizeMB and encrypted — lv_size and tags.ceph.encrypted from the OSD's DB LV: + ceph-volume lvm list /dev//osd-db- --format json + + # metadataSourceDevice — devices[0] of the db entry - fallback for older OSDs missing ROOK_METADATA_SOURCE_DEVICE: + ceph-volume lvm list --format json # runs on the OSD's node + ``` + +4. **Delete OSD deployment.** The operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. + +5. **Replace Job.** A single short-lived Job runs the destroy + prepare bash sequence in one container. The pod scaffold (volumes, security context, init containers, `DM_DISABLE_UDEV=1`) is reused from `c.provisionPodTemplateSpec`; without `DM_DISABLE_UDEV=1` `lvcreate`/`cryptsetup` hangs in `udev_wait` and `ceph-volume lvm prepare` fails with `RADOS permission denied`. The new DB LV's UUID is generated by the operator at Job creation and passed via env. + + The container's command: + +```bash +# Re-check safe-to-destroy (insurance against races between phase transition and Job start). +ceph osd safe-to-destroy 5 + +# Destroy in Ceph (preserves OSD ID 5 for reuse). +ceph osd destroy osd.5 --yes-i-really-mean-it + +# Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). +ceph config-key exists dm-crypt/osd//luks \ + && ceph config-key rm dm-crypt/osd//luks + +# Close DB-side LUKS mapping. +DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db- | awk '$2=="crypt"{print $1; exit}') +[ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ + && cryptsetup close "$DB_MAPPING" + +# Free the DB slot. +lvs /dev/ceph-metadata-vg-1/osd-db- >/dev/null 2>&1 \ + && lvremove -f /dev/ceph-metadata-vg-1/osd-db- + +# Zap the data LV (also handles the data-side dm-crypt mapping). +lvs /dev/ceph-data-vg-5/osd-block- >/dev/null 2>&1 \ + && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block- --destroy + +# Pre-allocate the new DB LV; skip if it already exists (retry-safe). +lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ + || lvcreate -L 1500M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y + +# Provision the new OSD with the preserved ID. --dmcrypt only when the record's +# `encrypted` field is true. +ceph-volume lvm prepare \ + --bluestore [--dmcrypt] \ + --osd-id 5 \ + --data /dev/vdh \ + --block.db /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... \ + --crush-device-class hdd +``` + + The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node CM Rook already uses to drive daemon creation). + +6. **Create new Deployment.** The cluster controller's existing path takes over: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) reads the per-node status CM the Replace Job wrote and creates the daemon Deployment. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — future replacement of this OSD won't need the older-version fallback. + +7. **Wait for `up+in`.** The controller polls `ceph osd tree` each reconcile until the new daemon is `up` AND `in`. Once visible, capture `osd_uuid` from `ceph osd metadata ` and transition to [Complete](#5-complete). + +#### 5. Complete + +Terminal phase. `.status.newFsid` and `.status.completedAt` are recorded; the `Ready` condition transitions to `True`. + +### Cancellation + +Cancel = delete the `CephOSDReplace` CR; a finalizer runs any cleanup needed. + +**Pending, Validating, Waiting.** Clean cancel — no Ceph or host state has been changed. The finalizer is a no-op aside from clearing the auto-provisioning gate label on the affected node. + +**Replacing — best-effort, deferred.** Once `Replacing` begins, the operator commits to running through to a terminal phase. Cancel intent is recorded but not acted on mid-flow: + +- If the Replace Job is in flight, the operator lets it complete. `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). +- On Job failure, the finalizer cleans up any partially-allocated DB LV and removes the CR. +- On Job success, cancel is not honored — the new OSD joins the cluster. + +## Notes on Scope + +### Multiple metadata devices on one node — works conditionally + +Rook supports per-device metadata-device pairing: + +```yaml +nodes: +- name: "node-1" + devices: + - name: "/dev/disk/by-path/...sda" + config: { metadataDevice: "nvme0n1" } + - name: "/dev/disk/by-path/...sdb" + config: { metadataDevice: "nvme0n1" } + - name: "/dev/disk/by-path/...sdc" + config: { metadataDevice: "nvme1n1" } # different metadata device on the same node +``` + +This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadataSourceDevice` is captured in its `osdInfo` at destroy time), with two caveats: + +- **Device-name validation must permit exact entries** — see [open question 6](#open-questions). +- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. + +### PVC-based OSD replacement — separate design + +PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. + +## Open questions + +1. **Controller placement.** Design leans toward a separate `CephOSDReplace` CRD; `spec.storage.replaceOSD` on CephCluster (mirroring `spec.storage.migration`) is a fallback — see [Open question: controller placement](#open-question-controller-placement). Maintainers' call. + +2. **Parallelism.** The proposed OSD replacement process is serial. Are there use-cases for parallel replacement we should support — multiple OSDs safe-to-destroy on the same node, all safe-to-destroy in the cluster at once, configurable concurrency? + +3. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (CR creation). Should there be a follow-up option for automated replacement that triggers the same flow when a failed OSD and a fresh disk are detected on a node? + +4. **Default values.** Proposed: `safeToDestroyTimeout: 1h`, `diskWaitTimeout: 24h`, disk-wait re-check interval `5 min` (cluster-config tunable). Reasonable, or do reviewers see a reason to change them? + +5. **Disk-swap responsiveness.** Can the design rely on Rook's existing discovery (rook-discover when enabled, otherwise the per-reconcile prepare-job inventory) to detect the replacement disk? + +6. **Device-name validation.** Proposed: reject kernel names (`/dev/sda`, `vda`), `by-id`, and `by-uuid` references; accept `useAllDevices`, `deviceFilter`, and `by-path` (with implicit same-slot expectation). Sample: + + ```yaml + spec: + storage: + nodes: + - name: node-1 + devices: + - name: /dev/sda # kernel name — rejected (not swap-stable) + - name: /dev/disk/by-path/... # by-path — accepted (same-slot only) + ``` + + Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? + +7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `metadataVG` lives on the original host. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? + diff --git a/osd-design.md b/osd-design.md deleted file mode 100644 index 882654642575..000000000000 --- a/osd-design.md +++ /dev/null @@ -1,392 +0,0 @@ -# Design: Single OSD replacement with a shared metadata device - -Issue: [rook/rook#13240](https://github.com/rook/rook/issues/13240) - -## Problem - -When an OSD's data and metadata live on different devices (per `spec.storage` `metadataDevice` config in the CephCluster CR), Rook today cannot replace a single failed OSD on its own. The user must either re-provision all OSDs sharing the same metadata device or run a multi-step manual workflow including scaling down the operator to zero. Both are slow and error-prone. - -This design proposes a workflow to replace a single failed OSD in place — preserving its OSD ID — without affecting other OSDs sharing the same metadata device. - -## Notation - -- **User** - the human cluster admin who edits the CR. -- **Operator** - the Rook controller process. -- **Data LV / data device** - the LV (or block device) holding an OSD's bulk data. One per OSD. -- **DB LV / metadata device** - the LV holding the OSD's rocksdb (`block.db`). One per OSD; multiple OSDs can share the same metadata device. - -## User story - -A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NVMe metadata device. The user marks `osd.5` for replacement on the CephCluster CR, swaps the physical disk in the chassis, and walks away. Rook destroys `osd.5`, frees its DB LV slot on the NVMe, provisions a new OSD on the replacement disk *with the same OSD ID 5*, and the other four OSDs on the same NVMe stay up the whole time. - -## Constraints - -Three facts about the environment shape every later choice in this design. - -### Replacement is same-host - -The new disk must go to the same host as the destroyed OSD. The captured `metadataVG` is host-local, and the Prepare Job runs `ceph-volume lvm prepare --block.db /dev//...` against it. Cross-host replacement is permitted by Ceph but out of scope here. - -### Rook cannot tell a replacement disk from a new disk - -When a fresh empty disk appears on a node, Rook has no way to tell it's the replacement for a failed OSD. The next CephCluster reconcile calls `startProvisioningOverNodes`, which spawns a prepare-job on each node. With `useAllDevices: true` (or a matching `deviceFilter`) the prepare-job auto-provisions a new OSD on the empty disk with a fresh ID; orphan resources for the failed OSD stay leaked. - -This is why the user must mark the OSD for replacement in the CR *before* swapping the disk. Otherwise, a reconcile triggered between the swap and the CR edit auto-provisions the new disk with a fresh ID instead of replacing osd.5. - -### Storage device config must tolerate device swap - -Rook lets users identify OSD data devices via `spec.storage`: - -- `useAllDevices: true` — match any empty disk on the node. -- `deviceFilter: ""` — match disks whose `lsblk` properties match a regex. -- `nodes[].devices[].name: ""` — match a specific path or name. Accepts a kernel name (`vdb`), a raw path (`/dev/sdc`), or a udev symlink (`/dev/disk/by-path/...`, `/dev/disk/by-id/...`). -- `nodes[].devices[].fullpath: ""` — explicit DevLinks match (`/dev/disk/by-id/...`, `/dev/disk/by-path/...`). Compared against discovered symlinks, not regex. - -Each shape interacts differently with the Linux device-naming interfaces: - -- **Kernel names** (`vdb`, `sdc`, `/dev/sdc`) are not guaranteed to be persistent (see [Arch Wiki: Persistent block device naming](https://wiki.archlinux.org/title/Persistent_block_device_naming)). [Ceph's own admin docs](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) use raw paths like `/dev/sdX` in their replacement examples, but the manual procedure can be re-validated at each step; an automated flow has fewer recovery options if the name has shifted. -- **`/dev/disk/by-path/...`** is built by udev rules from the sysfs port path. Same physical port → same `by-path` symlink. So `by-path` survives a *same-slot* swap and breaks on a different-slot swap. Same-slot replacement is **not** a Rook or Ceph requirement: [Ceph upstream is silent on slot semantics](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd); cephadm's `ceph orch device replace` is slot-agnostic. -- **`/dev/disk/by-id/...`** identifies the disk by hardware serial / WWN. Different disk → different `by-id`. Useless for replacement (the new disk *is* a different disk). -- **`/dev/disk/by-uuid/...`** identifies the filesystem/LV UUID. The replacement disk has a fresh UUID after provisioning. Same as `by-id`: useless here. - -The shapes that tolerate any swap (same-slot or different-slot, any new disk) are `useAllDevices` and `deviceFilter`. `by-path` tolerates only same-slot replacement. Kernel names tolerate only the lucky case where the kernel happens to assign the same name. `by-id`/`by-uuid` references in `name`/`fullpath` cannot work for a disk that hasn't been seen yet. - -The replacement flow must validate the affected OSD's CR references beforehand so the new disk is still resolvable under those references after the swap. - -## Current gaps - -Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration.confirmation`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs seven additional fixes: - -1. The replacement code path runs only in raw mode; LVM mode (required when a metadata device is configured) does not pass `--osd-id`, so the new OSD gets a new ID. (`initializeDevicesLVMMode`, [volume.go#L587-L847](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L587-L847)) -2. Destroy zaps only the data LV; the DB LV on the shared metadata disk stays as an orphan. (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) -3. The dm-crypt key in Ceph's config-key store is never removed, leading to LUKS collisions on retry of encrypted OSDs. (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) -4. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). -5. `OSDInfo.MetadataPath` is not populated by the prepare-job re-discovery path on LVM-mode OSDs (`GetCephVolumeLVMOSDs` walks only `[block]` entries from `ceph-volume lvm list`). The operator-side path (`getOSDInfo` at [osd.go#L748](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/osd.go#L748)) does populate it from `ROOK_METADATA_DEVICE` env, but anything that goes through re-discovery (a redeploy after CM loss) loses the field. ([volume.go#L1082-L1177](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1082-L1177)) -6. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. -7. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. - -## Proposed flow - -This flow orchestrates [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm prepare --osd-id` → `lvm activate`) across short-lived Kubernetes Jobs, with operator-side state for crash recovery and Rook-specific gates around auto-provisioning. cephadm — Ceph's container-orchestrator analogue — preserves OSD IDs by default ([cephadm OSD service docs](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd)); this design follows the same convention. - -Two short-lived jobs — Destroy Job and Prepare Job — separated by the wait for the replacement disk. The operator owns all phase transitions and the wait; jobs are workers observed via `Job.status.succeeded`. Replacements run serially cluster-wide. - -### Sequence - -```mermaid -sequenceDiagram - autonumber - actor User - participant CR as Rook CR - participant Op as Operator - participant OldPod as Old OSD pod - participant DJ as Destroy Job - participant PJ as Prepare Job - participant Ceph as Ceph - - User->>CR: create CR with osd id 5 - Op->>CR: read CR - Op->>CR: write phase=Validating - Op->>Ceph: ceph osd dump (validate exists, get fsid) - Op->>Ceph: safe-to-destroy 5 - Op->>OldPod: read deployment env - Op->>CR: update phase=Destroying + OSD info - Op->>OldPod: delete deployment - destroy OldPod - Op->>OldPod: wait for pod termination - Op->>+DJ: create - DJ->>Ceph: ceph osd destroy osd.5 - DJ->>Ceph: config-key rm dm-crypt key (if encrypted) - DJ->>DJ: cryptsetup close db mapping (if encrypted) - DJ->>DJ: lvremove db lv - DJ->>DJ: ceph-volume lvm zap data lv - DJ-->>-Op: Succeeded - Op->>CR: phase=Waiting - Note over Op: wait for replacement disk (non-blocking) - Note over User,Op: User swaps the failed disk - Op->>CR: phase=Preparing - Op->>+PJ: create from recorded OSD info - PJ->>PJ: lvcreate using LV name from Job env - PJ->>Ceph: ceph-volume lvm prepare --osd-id 5 - PJ->>PJ: write new OSD info to existing per-node status CM - PJ-->>-Op: Succeeded - create participant NewPod as New OSD pod - Op->>NewPod: create deployment with id 5 - NewPod->>Ceph: lvm activate, join cluster - Op->>Ceph: ceph osd metadata 5 until Ready - Op->>CR: phase=Completed -``` - -### Open question: controller placement - -The diagram doesn't pick a concrete CR or controller for the replacement reconcile logic. Two candidates: extend the existing CephCluster controller (which already hosts `spec.storage.migration`), or introduce a separate `CephOSDReplace` CRD with its own controller. The design leans toward the separate CRD for the following reasons: - -1. **CephCluster's `reconcileCephDaemons` is monolithic and synchronous** — mon, mgr, and osd reconcile run sequentially in one call ([`cluster.go#L116-L160`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/cluster.go#L116-L160)); `osd.Cluster.Start()` returns plain `error`, so there's no way to express terminal failure (bad CR rejected) vs. transient `RequeueAfter` (waiting for disk-swap or Job completion). Adding long-running multi-step logic to this path interferes with mon/mgr reconcile and lacks the return semantics the flow needs. -2. **Replacement state has to survive between reconciles**, and the cluster controller has no existing place to store sub-operation state — adding one (a side ConfigMap, or extending `CephCluster.status`) is part of the cost. - -Concrete shape of each candidate: - -- **Extend the cluster controller** — state in either a side ConfigMap (`osd-replacement-state`, mirroring `osd-migration-config`) or `CephCluster.status`. Same UX as `spec.storage.migration`. -- **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile goroutine; never touches the existing OSD path. Light coupling on the cluster side: skip auto-provisioning on affected nodes; surface an `OSDReplacementInProgress` condition. - -The rest of this design is based on a separate `CephOSDReplace` CRD, with implications for the cluster-CR fallback flagged inline. - -### State - -State lives on `CephOSDReplace.spec` and `.status`. `spec.cephCluster` and `spec.osdId` are immutable post-create. `.status` carries phase and conditions following the K8s operator pattern. - -```yaml -apiVersion: ceph.rook.io/v1 -kind: CephOSDReplace -metadata: - name: replace-osd-5 - namespace: rook-ceph -spec: - cephCluster: my-cluster # immutable; target cluster in this namespace - osdId: 5 # immutable - confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; typo guard against destroying the wrong OSD - autoOut: false # optional; if true, operator marks healthy OSD `out` automatically. Default: false (fail-fast on up+in) - safeToDestroyTimeout: 1h # optional; how long Validating tolerates EBUSY before Failed. Default: 1h - diskWaitTimeout: 24h # optional; how long Waiting tolerates a missing disk before Failed. Default: 24h - -status: - phase: Destroying # Pending | Validating | Destroying | Waiting | Preparing | Completed | Failed | Cancelled - conditions: - - type: Ready - status: "False" - reason: Destroying - message: Destroy Job in flight - observedGeneration: 1 - lastTransitionTime: "2026-05-05T12:00:00Z" - - # captured at the Validating → Destroying transition - osdInfo: - node: node-1 # OSD deployment NodeSelector; survives the deployment delete - dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH - dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE; absent for raw-mode OSDs - metadataSourceDevice: /dev/vdd # new env ROOK_METADATA_SOURCE_DEVICE (added by this design); absent for raw-mode OSDs - metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` on the OSD's host (metadata device must still be readable) - crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS - databaseSizeMB: 1500 # from `ceph-volume lvm list --format json` lv_size (bytes) / 1048576 - encrypted: true # from `ceph-volume lvm list --format json` tags.ceph.encrypted - osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # from `ceph osd dump --format json`: `.osds[id=].uuid` - - # populated on phase=Completed - newFsid: "" # recorded on Completed - completedAt: null -``` - -**Cancel and re-replace.** Cancel = delete the CR; a finalizer runs the operator's cleanup (delete partially-allocated DB LV if any; leave the OSD `destroyed` for the user to `ceph osd purge` manually). Re-replacement of the same OSD = create a new CR with a different name. Terminal CRs (`Completed`, `Cancelled`, `Failed`) are inert — keep them as audit trail or delete them; the operator requires neither. - -#### Coordination - -Replacements run serially per CephCluster as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. - -The queue is implemented via a `Pending` phase. Each reconcile, the controller lists peer `CephOSDReplace` CRs in the same namespace targeting the same cluster. If no earlier-`creationTimestamp` peer is in a non-terminal phase, this CR advances to `Validating`; otherwise it stays in `Pending` and re-checks next reconcile. UID breaks same-second ties. - -> Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. - -#### Auto-provisioning skip - -The Rook cluster controller spawns the prepare-job, which by default auto-discovers devices and provisions new OSDs. To make the replacement flow work, the cluster controller must run the prepare-job in "discover only" mode on a node where a replacement is running — discovery happens, provisioning doesn't. - -In the existing cluster controller, add a gate before each `runPrepareJob` call in `startProvisioningOverNodes` ([create.go#L345](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/create.go#L345)): list `CephOSDReplace` CRs whose `rook.io/osd-replacement-node` label equals the current node; if any is in a non-terminal phase, launch the Job with `ROOK_DISCOVER_ONLY=true` in its env. The replacement controller stamps this label on its CR at creation, reading the node from the target OSD's deployment. It clears the label on transition to a terminal phase; the deletion finalizer is a backup. - -In discover-only mode, the prepare-job runs the same discovery code as a normal run (`DiscoverDevicesWithFilter` + `getAvailableDevices` at [daemon.go#L341](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L341)) and writes the eligible-device list to the existing per-node status CM (`rook-ceph-osd--status`) — without running `lvm batch`. - -When the discovery DaemonSet is enabled (`ROOK_ENABLE_DISCOVERY_DAEMON=true`), `local-device-` is updated on udev events with seconds latency. The replacement controller may watch that CM as a fast wake-up signal, but treats the discover-only status CM as authoritative — `local-device-` is unfiltered (does not apply the cluster's `deviceFilter`/`useAllDevices`). - -> With the cluster-CR fallback (`spec.storage.replaceOSD` on CephCluster), the cluster controller reads its own spec field instead of listing CRs — same flag plumbing. - -#### Phase state machine - -``` - Pending ─→ Validating ─→ Destroying ─→ Waiting ─→ Preparing ─→ Completed - │ │ - ▼ ▼ - Failed/Cancelled Failed/Cancelled -``` - -> With the cluster-CR fallback, `Pending` is omitted (single field admits one in-flight); state offloads to a side ConfigMap similar to `osd-migration-config`. - -Per-phase behavior: - -| Phase | Normal exit | Transient failure (retried) | Terminal exit | -|---|---|---|---| -| (no record) | → `Pending` on CR create | — | — | -| `Pending` | → `Validating` once no earlier peer is in flight (one replacement per cluster at a time) | re-checks each reconcile while an earlier peer is in-flight | → `Cancelled` if user deletes the CR | -| `Validating` | → `Destroying` once all checks pass | `safe-to-destroy` returns EBUSY (peers backfilling) — re-checked each reconcile | → `Cancelled` on CR delete; → `Failed` on validation failure (target OSD invalid, swap-intolerant CR, up+in without `autoOut`, or `safe-to-destroy` timeout) | -| `Destroying` | → `Waiting` on Destroy Job success | Destroy Job retries on transient errors (Ceph unreachable, pod scheduling) | — | -| `Waiting` | → `Preparing` once replacement disk visible | inventory poll until disk visible | → `Cancelled` on CR delete; → `Failed` with `reason=ReplacementDiskMissing` after disk-swap wait expires | -| `Preparing` | → `Completed` when new daemon is Ready in Ceph | Prepare Job pod retries on transient errors; `lvcreate` precheck handles partial LV from a prior pod; Deployment creation retries | — | -| `Completed` | terminal — success | — | — | -| `Cancelled`, `Failed` | terminal | — | — | - -User-visible: `Ready=True` on `Completed`, `Ready=False` otherwise; `reason` carries the current phase or a typed terminal reason. - -### Step-by-step - -The walk-through uses the running example. - -#### 1. Trigger — user creates a `CephOSDReplace` CR - -Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` after `mon_osd_down_out_interval` (default 600s) and backfills; once the OSD is drained from every PG's acting set, `safe-to-destroy` clears and the flow proceeds. The user creates a `CephOSDReplace` CR and replaces the failed device in the datacenter. - -Healthy (up+in) OSDs require either `ceph osd out` first or `spec.autoOut: true` — see [Validate](#2-validate). - -On creation, the CR enters `Pending` and waits for any in-flight replacement to terminate. Once cleared, it advances to `Validating`. The disk can be swapped any time after the CR is applied — Destroy's capture step tolerates a missing data device. - -#### 2. Validate - -> ** Destroy is irreversible.** Once Validate passes, `osd.5` is destroyed on the next reconcile. If the user typed the wrong OSD ID, the wrong OSD is gone. - -Run each reconcile cycle until all checks pass or one fails terminally: - -1. **Confirmation matches.** `spec.confirmation` must equal `"yes-really-replace-osd-{spec.osdId}"`. → `Failed` with `reason=InvalidSpec` on mismatch (typo guard). -2. **Target OSD exists** in the OSD map. → `Failed` with `reason=InvalidSpec` if absent. -3. **Target OSD is destroyable.** If the OSD is `up && in`: with `spec.autoOut: false` (default), → `Failed` with `reason=OSDStillIn`. With `spec.autoOut: true`, the operator runs `ceph osd out ` once at entry and falls through to check 5. -4. **CR-level device matching is swap-tolerant.** The OSD's data device must be referenced via `useAllDevices`, `deviceFilter`, or a `/dev/disk/by-path/...` path (same-slot replacement only). Kernel names (`vdb`, `/dev/sda`), `by-id`, and `by-uuid` references are rejected — they can't resolve to a fresh disk. → `Failed` with `reason=InvalidSpec` on rejection. (Whether to make this configurable is [open question 6](#open-questions).) -5. **`safe-to-destroy ` returns OK.** Returns EBUSY while any PG still has the OSD in its acting set — the only safety gate (`down`/`out` alone is not sufficient because data may not have replicated). EBUSY → stay in `Validating`, re-check next reconcile. `spec.safeToDestroyTimeout` exceeded → `Failed` with `reason=NotSafeToDestroy`. - - -#### 3. Destroy - -Before deleting the deployment, the operator captures `.status.osdInfo` (sources per the YAML comments in [State](#state)). Most fields come from the OSD deployment's env. Two come off the host: - -- `databaseSizeMB` and `encrypted` — read from the OSD's DB LV (or a surviving sibling LV in the same VG if the OSD's own LV is missing). The live spec is not a source: a user-edited `spec.storage.config.databaseSizeMB` would size the new DB LV inconsistently with siblings. -- `metadataSourceDevice` — for OSDs created by older operator versions (env not yet plumbed), a one-shot `ceph-volume lvm list --format json` Job on the OSD's node fills it. The Job reads VG metadata from the surviving PV on the metadata device, so it works even after the data device has physically failed. - -Then the operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. The pod-gone wait is required: while the daemon runs, it holds the DB-side LUKS mapping open and the next step's `cryptsetup close` would fail. If the wait times out (transient NotReady node), the operator re-checks on the next reconcile. No force-delete — a stuck pod on a NotReady node may still hold the LUKS mapping when kubelet recovers. - -The Destroy Job's container invokes `DestroyOSD` ([`remove.go#L244-L290`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) — the same Go function the existing migration flow already calls from [`cmd/rook/ceph/osd.go#L272`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/cmd/rook/ceph/osd.go#L272). The bash below specifies what `DestroyOSD` must do (today it only handles the first step and a partial last step). Each operation is idempotent on retry. Pod profile is cloned from the existing `c.provisionPodTemplateSpec` — inherits `DM_DISABLE_UDEV=1` (required to bypass udev sync inside privileged containers) and the `cephx-keyring-update` init container. - -```bash -# Destroy in Ceph (preserves OSD ID 5 for reuse). -ceph osd destroy osd.5 --yes-i-really-mean-it - -# Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). -ceph config-key exists dm-crypt/osd//luks \ - && ceph config-key rm dm-crypt/osd//luks - -# Close DB-side LUKS mapping. -DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db- | awk '$2=="crypt"{print $1; exit}') -[ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ - && cryptsetup close "$DB_MAPPING" - -# Free the DB slot. -lvs /dev/ceph-metadata-vg-1/osd-db- >/dev/null 2>&1 \ - && lvremove -f /dev/ceph-metadata-vg-1/osd-db- - -# Zap the data LV (also handles the data-side dm-crypt mapping). -lvs /dev/ceph-data-vg-5/osd-block- >/dev/null 2>&1 \ - && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block- --destroy -``` - -After the Job completes, operator advances the record to `phase: Waiting`. The captured `osdInfo.dbLV` now refers to a removed LV — only its parent VG (`metadataVG`) is live; Prepare uses that VG to create a fresh DB LV. - -#### 4. Wait for replacement disk - -Each reconcile of the CR, the controller checks if the replacement disk is visible by reading the per-node status CM (`rook-ceph-osd--status`, populated by the cluster controller's prepare-job in discover-only mode — see [Auto-provisioning skip](#auto-provisioning-skip)). If the empty replacement disk isn't there yet, requeue. When it appears, advance to `Preparing`. - -Timeout per `spec.diskWaitTimeout` (default 24h) → `Failed` with `reason=ReplacementDiskMissing`. After timeout, the user can either insert the disk and create a new CR for the same OSD ID, or delete the CR (osd stays destroyed; `ceph osd purge` to free the slot). - -#### 5. Prepare - -Phase `Preparing` (entered when the replacement disk is visible). The operator generates a fresh UUID for the new DB LV and passes it as an env var on the Job (same pattern as `ROOK_REPLACE_OSD` in `provision_spec.go:317,322`); pod retries reuse the same env. Pod profile is cloned from `c.provisionPodTemplateSpec` (inherits `DM_DISABLE_UDEV=1` and the `cephx-keyring-update` init container — without these, `lvcreate`/`cryptsetup` hangs in `udev_wait` and `ceph-volume lvm prepare` fails with `RADOS permission denied`). The Job runs: - -```bash -# Pre-allocate the DB LV using the name from the Job env; skip if it already exists. -lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ - || lvcreate -L 1500M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y - -# Provision the new OSD with the preserved ID. --dmcrypt only when the record's -# `encrypted` field is true. -ceph-volume lvm prepare \ - --bluestore [--dmcrypt] \ - --osd-id 5 \ - --data /dev/vdh \ - --block.db /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... \ - --crush-device-class hdd -``` - -The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node prepare-job CM that Rook already uses to drive daemon creation). The phase stays `Preparing` while the operator creates the Deployment and waits for the new daemon to become Ready in Ceph. - -#### 6. Activate - -Reuses the existing path: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) sees the per-node status CM the Prepare Job wrote and creates the daemon Deployment from it. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — future replacement of this OSD won't need the older-version fallback in Destroy's capture step. - -#### 7. Complete - -While in `Preparing`, the controller checks readiness each reconcile via `ceph osd tree`: the OSD must be `up` AND `in`. Then captures `osd_uuid` from `ceph osd metadata ` and transitions to `phase: Completed` with `newFsid` and `completedAt` recorded. (`ceph osd metadata` populates as soon as the daemon registers — gating on `tree`'s `up`+`in` waits until the daemon has actually joined the cluster.) - -### Cancellation - -Cancel = delete the `CephOSDReplace` CR; a finalizer runs cleanup. Cancel is honored cleanly in `Pending`, `Validating`, and `Waiting` (after Destroy completes — `osd.5` stays `destroyed`; user runs `ceph osd purge 5` to free the slot). `Destroying` is short-lived and ignores cancel. Once the new OSD is provisioned (post-Prepare-Job-success or `Completed`), cancel makes no sense — removing the new OSD is an `out`+`purge` workflow, not a rollback. - -**Cancel during Validating with `autoOut: true`.** If the operator already marked the OSD `out`, the OSD stays `out` after cancel. User marks `in` manually to recover the original cluster layout. - -**Cancel during Waiting — ID-preserving retry unavailable.** Data and DB LVs were wiped at Destroy; a future `CephOSDReplace` for the same ID has no OSD info to capture and aborts. To re-add an OSD here, accept a fresh ID. - -**Cancel during Preparing, Job in flight.** `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). The operator records the cancel intent and acts at Job exit. On Job failure, the finalizer removes the CR; the partially-allocated DB LV is left as a named orphan. On Job success, cancel is not honored — the new OSD joins the cluster. - -## Out of scope - -### Multiple metadata devices on one node — works conditionally - -Rook supports per-device metadata-device pairing: - -```yaml -nodes: -- name: "node-1" - devices: - - name: "/dev/disk/by-path/...sda" - config: { metadataDevice: "nvme0n1" } - - name: "/dev/disk/by-path/...sdb" - config: { metadataDevice: "nvme0n1" } - - name: "/dev/disk/by-path/...sdc" - config: { metadataDevice: "nvme1n1" } # different metadata device on the same node -``` - -This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadataSourceDevice` is captured in its `osdInfo` at destroy time), with two caveats: - -- **Device-name validation must permit exact entries** — see [open question 6](#open-questions). -- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. - -The broader multi-metadata-device feature work (improvements to per-device `metadataDevice` UX, multi-NVMe-per-node setups) was scoped separately by maintainers in [#13240](https://github.com/rook/rook/issues/13240) (tracked by `zhucan`). This design does not add new logic for that setup — it just doesn't actively forbid replacement on it under permissive validation. - -### PVC-based OSD replacement — separate design - -PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. - -### Permanently-down host — different workflow - -If the OSD's host is gone, this flow cannot proceed (the Destroy step requires the host). Existing Rook node-decommission + OSD-purge flow handles it. - -## Open questions - -1. **Controller placement.** Design leans toward a separate `CephOSDReplace` CRD; `spec.storage.replaceOSD` on CephCluster (mirroring `spec.storage.migration`) is a fallback — see [Open question: controller placement](#open-question-controller-placement). Maintainers' call. - -2. **Parallelism.** The proposed OSD replacement process is serial. Are there use-cases for parallel replacement we should support — multiple OSDs safe-to-destroy on the same node, all safe-to-destroy in the cluster at once, configurable concurrency? - -3. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (CR creation). Should there be a follow-up option for automated replacement that triggers the same flow when a failed OSD and a fresh disk are detected on a node? - -4. **Default values.** Proposed: `safeToDestroyTimeout: 1h`, `diskWaitTimeout: 24h`, disk-wait re-check interval `5 min` (cluster-config tunable). Reasonable, or do reviewers see a reason to change them? - -5. **Disk-swap responsiveness.** Can the design rely on Rook's existing discovery (rook-discover when enabled, otherwise the per-reconcile prepare-job inventory) to detect the replacement disk? Expected latency? Caveats when the cluster uses `useAllDevices` vs. regex `deviceFilter` vs. exact `name:` entries? - -6. **Device-name validation.** Proposed: reject kernel names (`/dev/sda`, `vda`), `by-id`, and `by-uuid` references; accept `useAllDevices`, `deviceFilter`, and `by-path` (with implicit same-slot expectation). Sample: - - ```yaml - spec: - storage: - nodes: - - name: node-1 - devices: - - name: /dev/sda # kernel name — rejected (not swap-stable) - - name: /dev/disk/by-path/... # by-path — accepted (same-slot only) - ``` - - Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? - -7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `metadataVG` lives on the original host. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? - From 7646e7c5e177d3ec70c5f520c1ad6b13f3647053 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Thu, 7 May 2026 11:56:20 +0200 Subject: [PATCH 07/12] docs: cleanup Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md index 7ba7551c022f..c6383433e917 100644 --- a/design/ceph/osd-replacement.md +++ b/design/ceph/osd-replacement.md @@ -55,8 +55,9 @@ Rook has no automated flow for replacing a failed OSD today. The closest existin 1. `DestroyOSD` cleans up only the data LV. The DB LV on the shared metadata disk stays as an orphan, and the dm-crypt key in Ceph's config-key store is never removed (causing LUKS collisions on retry of encrypted OSDs). (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) 2. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). -3. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. -4. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. +3. **`ROOK_METADATA_DEVICE` is empty in OSD deployments.** `GetCephVolumeLVMOSDs` ([volume.go#L1082-L1182](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1082-L1182)) constructs `OSDInfo` without setting `MetadataPath`, so the deployment env is empty. The replacement flow's env-first capture path needs `MetadataPath` populated. +4. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. +5. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. ## Proposed flow @@ -76,6 +77,7 @@ sequenceDiagram User->>CR: create CR with osd id 5 Op->>CR: read CR + Note over Op: Pending phase elided — see Coordination Op->>CR: write phase=Validating Op->>Ceph: ceph osd dump (validate exists, get fsid) Op->>CR: write phase=Waiting @@ -132,10 +134,12 @@ kind: CephOSDReplace metadata: name: replace-osd-5 namespace: rook-ceph + labels: + rook.io/osd-replacement-node: node-1 # operator-managed; equals the target OSD's host node spec: cephCluster: my-cluster # immutable; target cluster in this namespace osdId: 5 # immutable - confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; typo guard against destroying the wrong OSD + confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; copy-paste guard against operating on the wrong OSD autoOut: false # optional; if true, operator marks healthy OSD `out` automatically (during Replacing). Default: false (fail-fast on up+in at Validating) safeToDestroyTimeout: 1h # optional; how long Replacing tolerates EBUSY on safe-to-destroy before Failed. Default: 1h diskWaitTimeout: 24h # optional; how long Waiting tolerates a missing disk before Failed. Default: 24h @@ -154,8 +158,8 @@ status: osdInfo: node: node-1 # OSD deployment NodeSelector; survives the deployment delete dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH - dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE; absent for raw-mode OSDs - metadataSourceDevice: /dev/vdd # new env ROOK_METADATA_SOURCE_DEVICE (added by this design); absent for raw-mode OSDs + dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE (populated by gap #3 fix) + metadataSourceDevice: /dev/vdd # new env ROOK_METADATA_SOURCE_DEVICE (added by this design) metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` on the OSD's host (metadata device must still be readable) crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS databaseSizeMB: 1500 # from `ceph-volume lvm list --format json` lv_size (bytes) / 1048576 @@ -185,7 +189,7 @@ The Rook cluster controller spawns the prepare-job, which by default auto-discov In the existing cluster controller, add a gate before each `runPrepareJob` call in `startProvisioningOverNodes` ([create.go#L345](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/create.go#L345)): list `CephOSDReplace` CRs whose `rook.io/osd-replacement-node` label equals the current node; if any is in a non-terminal phase, launch the Job with `ROOK_DISCOVER_ONLY=true` in its env. The replacement controller stamps this label on its CR at creation, reading the node from the target OSD's deployment. It clears the label on transition to a terminal phase; the deletion finalizer is a backup. -In discover-only mode, the prepare-job runs the same discovery code as a normal run (`DiscoverDevicesWithFilter` + `getAvailableDevices` at [daemon.go#L341](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L341)) and writes the eligible-device list to the existing per-node status CM (`rook-ceph-osd--status`) and stops without provisioning. +In discover-only mode, the prepare-job runs the same discovery code as a normal run (`DiscoverDevicesWithFilter` + `getAvailableDevices`) and writes the eligible-device list to the existing per-node status CM (`rook-ceph-osd--status`) and stops without provisioning. The optional discovery DaemonSet (`ROOK_ENABLE_DISCOVERY_DAEMON=true`) only inventories devices; it doesn't provision. When enabled, it updates `local-device-` on udev events with seconds latency. The replacement controller may watch that CM as a fast wake-up signal, but treats the discover-only status CM as authoritative — `local-device-` is unfiltered (does not apply the cluster's `deviceFilter`/`useAllDevices`). @@ -241,28 +245,31 @@ If `spec.diskWaitTimeout` (default 24h) is exceeded, transitions to `Failed` wit 2. **Wait for `safe-to-destroy` OK.** `ceph osd safe-to-destroy ` returns OK only after the OSD is fully drained from every PG's acting set. Requeued until OK. If `spec.safeToDestroyTimeout` (default 1h) is exceeded, transitions to `Failed` with `reason=NotSafeToDestroy`. -3. **Capture OSDInfo.** Full details are in [CRD proposal](#crd-proposal). Most fields come from the OSD deployment's env. Two come off the host: +3. **Capture OSDInfo.** Full details are in [CRD proposal](#crd-proposal). Most fields come from the OSD deployment's env. Two come off the host, via a one-shot privileged Job on the target node (same pod scaffold as the Replace Job): ```bash # databaseSizeMB and encrypted — lv_size and tags.ceph.encrypted from the OSD's DB LV: ceph-volume lvm list /dev//osd-db- --format json - # metadataSourceDevice — devices[0] of the db entry - fallback for older OSDs missing ROOK_METADATA_SOURCE_DEVICE: - ceph-volume lvm list --format json # runs on the OSD's node + # metadataSourceDevice — devices[0] of the db entry; fallback for older OSDs missing ROOK_METADATA_SOURCE_DEVICE: + ceph-volume lvm list --format json ``` 4. **Delete OSD deployment.** The operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. -5. **Replace Job.** A single short-lived Job runs the destroy + prepare bash sequence in one container. The pod scaffold (volumes, security context, init containers, `DM_DISABLE_UDEV=1`) is reused from `c.provisionPodTemplateSpec`; without `DM_DISABLE_UDEV=1` `lvcreate`/`cryptsetup` hangs in `udev_wait` and `ceph-volume lvm prepare` fails with `RADOS permission denied`. The new DB LV's UUID is generated by the operator at Job creation and passed via env. +5. **Replace Job.** A single short-lived Job runs the destroy + prepare bash sequence in one container. The pod scaffold is reused from `provisionPodTemplateSpec` (the cluster controller's existing prepare-job pod scaffold), which carries two settings the Job needs: `DM_DISABLE_UDEV=1` prevents `lvcreate`/`cryptsetup` from hanging in `udev_wait`, and the `cephx-keyring-update` init container provides a fresh `bootstrap-osd` keyring so `ceph-volume lvm prepare` doesn't fail with `RADOS permission denied`. The new DB LV's UUID is generated by the operator at Job creation and passed via env. The container's command: ```bash +set -euo pipefail + # Re-check safe-to-destroy (insurance against races between phase transition and Job start). ceph osd safe-to-destroy 5 -# Destroy in Ceph (preserves OSD ID 5 for reuse). -ceph osd destroy osd.5 --yes-i-really-mean-it +# Destroy in Ceph (preserves OSD ID 5 for reuse). Idempotent on retry. +ceph osd dump --format json | jq -e '.osds[] | select(.osd==5) | .state | contains(["destroyed"])' >/dev/null \ + || ceph osd destroy osd.5 --yes-i-really-mean-it # Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). ceph config-key exists dm-crypt/osd//luks \ @@ -314,7 +321,7 @@ Cancel = delete the `CephOSDReplace` CR; a finalizer runs any cleanup needed. **Replacing — best-effort, deferred.** Once `Replacing` begins, the operator commits to running through to a terminal phase. Cancel intent is recorded but not acted on mid-flow: - If the Replace Job is in flight, the operator lets it complete. `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). -- On Job failure, the finalizer cleans up any partially-allocated DB LV and removes the CR. +- On Job failure, the finalizer spawns a one-shot cleanup Job on the target node (same pod scaffold) to remove any partially-allocated DB LV before removing the CR. - On Job success, cancel is not honored — the new OSD joins the cluster. ## Notes on Scope From 3701a8e7d04e91b2a6776455f149ea96733aec01 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Thu, 7 May 2026 12:38:49 +0200 Subject: [PATCH 08/12] docs: add capture job to diagram Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 91 +++++++++++++++++++--------------- 1 file changed, 51 insertions(+), 40 deletions(-) diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md index c6383433e917..0bc06798b628 100644 --- a/design/ceph/osd-replacement.md +++ b/design/ceph/osd-replacement.md @@ -55,9 +55,7 @@ Rook has no automated flow for replacing a failed OSD today. The closest existin 1. `DestroyOSD` cleans up only the data LV. The DB LV on the shared metadata disk stays as an orphan, and the dm-crypt key in Ceph's config-key store is never removed (causing LUKS collisions on retry of encrypted OSDs). (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) 2. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). -3. **`ROOK_METADATA_DEVICE` is empty in OSD deployments.** `GetCephVolumeLVMOSDs` ([volume.go#L1082-L1182](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L1082-L1182)) constructs `OSDInfo` without setting `MetadataPath`, so the deployment env is empty. The replacement flow's env-first capture path needs `MetadataPath` populated. -4. The OSD deployment has no env carrying the DB LV's *source physical device* — only the LV path itself (`ROOK_METADATA_DEVICE`). At replacement time, the operator needs the source device to resolve the metadata VG and to identify the disk for the auto-provisioning skip. This design introduces a new `ROOK_METADATA_SOURCE_DEVICE` env, plumbed through the daemon-deployment spec. -5. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. +3. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. ## Proposed flow @@ -72,12 +70,11 @@ sequenceDiagram participant CR as Rook CR participant Op as Operator participant OldPod as Old OSD pod - participant RJ as Replace Job + participant J as Job participant Ceph as Ceph User->>CR: create CR with osd id 5 Op->>CR: read CR - Note over Op: Pending phase elided — see Coordination Op->>CR: write phase=Validating Op->>Ceph: ceph osd dump (validate exists, get fsid) Op->>CR: write phase=Waiting @@ -88,23 +85,28 @@ sequenceDiagram Op->>Ceph: ceph osd out 5 (if autoOut and up+in) Op->>Ceph: safe-to-destroy 5 (poll until OK) Op->>OldPod: read deployment env + Op->>+J: create Capture Job (env: osdId) + J->>J: ceph-volume lvm list --format json + J->>J: write captured fields to per-replacement CM + J-->>-Op: Succeeded + Op->>Ceph: ceph osd dump (osdFsid) Op->>CR: capture OSD info Op->>OldPod: delete deployment destroy OldPod Op->>OldPod: wait for pod termination - Op->>+RJ: create (env from osdInfo) - RJ->>Ceph: safe-to-destroy 5 (re-check) - RJ->>Ceph: ceph osd destroy osd.5 - RJ->>Ceph: config-key rm dm-crypt key (if encrypted) - RJ->>RJ: cryptsetup close db mapping (if encrypted) - RJ->>RJ: lvremove db lv - RJ->>RJ: ceph-volume lvm zap data lv - RJ->>RJ: lvcreate new db lv - RJ->>Ceph: ceph-volume lvm prepare --osd-id 5 - RJ->>RJ: write new OSD info to per-node status CM - RJ-->>-Op: Succeeded + Op->>+J: create Replace Job (env from osdInfo) + J->>Ceph: safe-to-destroy 5 (re-check) + J->>Ceph: ceph osd destroy osd.5 + J->>Ceph: config-key rm dm-crypt key (if encrypted) + J->>J: cryptsetup close db mapping (if encrypted) + J->>J: lvremove db lv + J->>J: ceph-volume lvm zap data lv + J->>J: lvcreate new db lv + J->>Ceph: ceph-volume lvm prepare --osd-id 5 + J->>J: write new OSD info to per-node status CM + J-->>-Op: Succeeded create participant NewPod as New OSD pod - Op->>NewPod: create deployment (id=5, dataLV, new dbLV, metadataSourceDevice, encrypted) + Op->>NewPod: create deployment (id=5, dataLV, new dbLV, encrypted) NewPod->>Ceph: join cluster Op->>Ceph: ceph osd metadata 5 until Ready Op->>CR: phase=Completed @@ -120,7 +122,7 @@ The diagram doesn't pick a concrete CR or controller for the replacement reconci Concrete shape of each candidate: - **Extend the cluster controller** — state in either a side ConfigMap (`osd-replacement-state`, similar to `osd-migration-config`) or `CephCluster.status`. -- **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile loop; never touches the existing OSD path. Light coupling on the cluster side: skip auto-provisioning on affected nodes. +- **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile loop; doesn't modify the existing OSD path (shares only the pod-scaffold builder, `provisionPodTemplateSpec`). Light coupling on the cluster side: skip auto-provisioning on affected nodes. The rest of this design is based on a separate `CephOSDReplace` CRD, with implications for the cluster-CR fallback flagged inline. @@ -156,15 +158,15 @@ status: # captured at start of Replacing, before deployment delete osdInfo: - node: node-1 # OSD deployment NodeSelector; survives the deployment delete + node: node-1 # OSD deployment NodeSelector (operator reads via K8s API) dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH - dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # OSD deployment env ROOK_METADATA_DEVICE (populated by gap #3 fix) - metadataSourceDevice: /dev/vdd # new env ROOK_METADATA_SOURCE_DEVICE (added by this design) - metadataVG: ceph-metadata-vg-1 # from `pvs --noheadings -o vg_name ` on the OSD's host (metadata device must still be readable) + dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # `[db].lv_path` from `ceph-volume lvm list` (Capture Job) + metadataSourceDevice: /dev/vdd # `[db].devices[0]` from `ceph-volume lvm list` (Capture Job) + metadataVG: ceph-metadata-vg-1 # `[db].vg_name` from `ceph-volume lvm list` (Capture Job) crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS - databaseSizeMB: 1500 # from `ceph-volume lvm list --format json` lv_size (bytes) / 1048576 - encrypted: true # from `ceph-volume lvm list --format json` tags.ceph.encrypted - osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # from `ceph osd dump --format json`: `.osds[id=].uuid` + databaseSizeMB: 1500 # `[db].lv_size` (bytes) / 1048576 from `ceph-volume lvm list` (Capture Job) + encrypted: true # `[db].tags.ceph.encrypted` from `ceph-volume lvm list` (Capture Job) + osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # `ceph osd dump`: `.osds[id=].uuid` (operator via mon) # populated on phase=Completed newFsid: "" # recorded on Completed @@ -245,31 +247,39 @@ If `spec.diskWaitTimeout` (default 24h) is exceeded, transitions to `Failed` wit 2. **Wait for `safe-to-destroy` OK.** `ceph osd safe-to-destroy ` returns OK only after the OSD is fully drained from every PG's acting set. Requeued until OK. If `spec.safeToDestroyTimeout` (default 1h) is exceeded, transitions to `Failed` with `reason=NotSafeToDestroy`. -3. **Capture OSDInfo.** Full details are in [CRD proposal](#crd-proposal). Most fields come from the OSD deployment's env. Two come off the host, via a one-shot privileged Job on the target node (same pod scaffold as the Replace Job): +3. **Capture OSDInfo.** Field sources (full schema in [CRD proposal](#crd-proposal)): - ```bash - # databaseSizeMB and encrypted — lv_size and tags.ceph.encrypted from the OSD's DB LV: - ceph-volume lvm list /dev//osd-db- --format json + - `node`, `dataLV`, `crushDeviceClass`: from the OSD deployment env, read by the operator via the K8s API. + - `dbLV`, `metadataVG`, `metadataSourceDevice`, `databaseSizeMB`, `encrypted`: from `ceph-volume lvm list --format json`'s `[db]` entry. Host-only call, run by a **Capture Job** the operator spawns on the target node (same pod scaffold as the Replace Job — same cephx auth caveat as in step 5 below). The Job assumes the metadata device is still readable from the target node; if it has failed too, replacement cannot proceed. The Job writes captured fields to a per-replacement ConfigMap named `rook-ceph-osd-replace-`, owned by the `CephOSDReplace` CR (cascade-deleted on CR delete) and kept for the CR's lifetime. The operator waits for Job success, reads the CM, and copies fields to `.status.osdInfo` before proceeding to step 4 below. + - `osdFsid`: from `ceph osd dump --format json` run by the operator via its mon connection. - # metadataSourceDevice — devices[0] of the db entry; fallback for older OSDs missing ROOK_METADATA_SOURCE_DEVICE: - ceph-volume lvm list --format json - ``` + Jobs spawned at this and later steps use a deterministic name including an attempt counter (e.g., `rook-ceph-osd-replace--capture-`, `-replace-`) to avoid name collisions on retry; the operator increments `` on each spawn after a prior attempt's terminal status. 4. **Delete OSD deployment.** The operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. -5. **Replace Job.** A single short-lived Job runs the destroy + prepare bash sequence in one container. The pod scaffold is reused from `provisionPodTemplateSpec` (the cluster controller's existing prepare-job pod scaffold), which carries two settings the Job needs: `DM_DISABLE_UDEV=1` prevents `lvcreate`/`cryptsetup` from hanging in `udev_wait`, and the `cephx-keyring-update` init container provides a fresh `bootstrap-osd` keyring so `ceph-volume lvm prepare` doesn't fail with `RADOS permission denied`. The new DB LV's UUID is generated by the operator at Job creation and passed via env. +5. **Replace Job.** Built on the existing prepare-job pattern (same one the OSD migration flow uses): a new rook subcommand on the prepare-job's pod scaffold (`provisionPodTemplateSpec`) runs the destroy + prepare bash sequence. Volume mounts, init containers, `DM_DISABLE_UDEV=1`, and cephx auth bootstrap are all inherited from that pattern. The new DB LV's UUID is generated by the operator at Job creation and passed via env. The container's command: ```bash set -euo pipefail -# Re-check safe-to-destroy (insurance against races between phase transition and Job start). -ceph osd safe-to-destroy 5 - -# Destroy in Ceph (preserves OSD ID 5 for reuse). Idempotent on retry. -ceph osd dump --format json | jq -e '.osds[] | select(.osd==5) | .state | contains(["destroyed"])' >/dev/null \ - || ceph osd destroy osd.5 --yes-i-really-mean-it +# Skip safe-to-destroy + destroy if osd.5 is already destroyed (idempotent on Job retry). +already_destroyed() { + ceph osd dump --format json | python3 -c " +import sys, json +osds = json.load(sys.stdin).get('osds', []) +o = next((o for o in osds if o.get('osd') == 5), None) +sys.exit(0 if o and 'destroyed' in (o.get('state') or []) else 1) +" +} + +if ! already_destroyed; then + # Re-check safe-to-destroy (insurance against races between phase transition and Job start). + ceph osd safe-to-destroy 5 + # Destroy in Ceph (preserves OSD ID 5 for reuse). + ceph osd destroy osd.5 --yes-i-really-mean-it +fi # Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). ceph config-key exists dm-crypt/osd//luks \ @@ -304,7 +314,7 @@ ceph-volume lvm prepare \ The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node CM Rook already uses to drive daemon creation). -6. **Create new Deployment.** The cluster controller's existing path takes over: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) reads the per-node status CM the Replace Job wrote and creates the daemon Deployment. The new deployment carries `ROOK_METADATA_DEVICE` and `ROOK_METADATA_SOURCE_DEVICE` directly — future replacement of this OSD won't need the older-version fallback. +6. **Create new Deployment.** The cluster controller's existing path takes over: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) reads the per-node status CM the Replace Job wrote and creates the daemon Deployment. 7. **Wait for `up+in`.** The controller polls `ceph osd tree` each reconcile until the new daemon is `up` AND `in`. Once visible, capture `osd_uuid` from `ceph osd metadata ` and transition to [Complete](#5-complete). @@ -379,3 +389,4 @@ PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, 7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `metadataVG` lives on the original host. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? +8. **Combine Capture and Replace into one Job.** The Capture step currently runs as a separate Job before the Replace Job. It could be merged into the Replace Job's first commands, with intermediate state persisted to a per-replacement ConfigMap (mirroring the prepare-job's CM hand-off pattern) so the Job remains retry-idempotent: on retry the Job reads the CM if present and skips re-capture, then proceeds with destroy + prepare. Saves one Job spawn per replacement; adds a CM-check branch at the top of the Job's bash and a CM-write helper invocation. From bab863f76f77eb525fc6774f025b929a59a8a744 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Thu, 7 May 2026 12:54:32 +0200 Subject: [PATCH 09/12] docs: cleanup Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md index 0bc06798b628..a681b1ce96b7 100644 --- a/design/ceph/osd-replacement.md +++ b/design/ceph/osd-replacement.md @@ -55,7 +55,7 @@ Rook has no automated flow for replacing a failed OSD today. The closest existin 1. `DestroyOSD` cleans up only the data LV. The DB LV on the shared metadata disk stays as an orphan, and the dm-crypt key in Ceph's config-key store is never removed (causing LUKS collisions on retry of encrypted OSDs). (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) 2. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). -3. **No discover-only mode in the prepare-job.** The prepare-job conflates discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. +3. **No discover-only mode in the prepare-job.** The prepare-job combines discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. ## Proposed flow @@ -250,14 +250,14 @@ If `spec.diskWaitTimeout` (default 24h) is exceeded, transitions to `Failed` wit 3. **Capture OSDInfo.** Field sources (full schema in [CRD proposal](#crd-proposal)): - `node`, `dataLV`, `crushDeviceClass`: from the OSD deployment env, read by the operator via the K8s API. - - `dbLV`, `metadataVG`, `metadataSourceDevice`, `databaseSizeMB`, `encrypted`: from `ceph-volume lvm list --format json`'s `[db]` entry. Host-only call, run by a **Capture Job** the operator spawns on the target node (same pod scaffold as the Replace Job — same cephx auth caveat as in step 5 below). The Job assumes the metadata device is still readable from the target node; if it has failed too, replacement cannot proceed. The Job writes captured fields to a per-replacement ConfigMap named `rook-ceph-osd-replace-`, owned by the `CephOSDReplace` CR (cascade-deleted on CR delete) and kept for the CR's lifetime. The operator waits for Job success, reads the CM, and copies fields to `.status.osdInfo` before proceeding to step 4 below. + - `dbLV`, `metadataVG`, `metadataSourceDevice`, `databaseSizeMB`, `encrypted`: from `ceph-volume lvm list --format json`'s `[db]` entry. Host-only call, run by a **Capture Job** the operator spawns on the target node. The Job assumes the metadata device is still readable from the target node; if it has failed too, replacement cannot proceed. The Job writes captured fields to a per-replacement ConfigMap named `rook-ceph-osd-replace-`, owned by the `CephOSDReplace` CR (cascade-deleted on CR delete) and kept for the CR's lifetime. The operator waits for Job success, reads the CM, and copies fields to `.status.osdInfo` before proceeding to step 4 below. - `osdFsid`: from `ceph osd dump --format json` run by the operator via its mon connection. Jobs spawned at this and later steps use a deterministic name including an attempt counter (e.g., `rook-ceph-osd-replace--capture-`, `-replace-`) to avoid name collisions on retry; the operator increments `` on each spawn after a prior attempt's terminal status. 4. **Delete OSD deployment.** The operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. -5. **Replace Job.** Built on the existing prepare-job pattern (same one the OSD migration flow uses): a new rook subcommand on the prepare-job's pod scaffold (`provisionPodTemplateSpec`) runs the destroy + prepare bash sequence. Volume mounts, init containers, `DM_DISABLE_UDEV=1`, and cephx auth bootstrap are all inherited from that pattern. The new DB LV's UUID is generated by the operator at Job creation and passed via env. +5. **Replace Job.** Built on the existing prepare-job pattern (same one the OSD migration flow uses): a new rook subcommand on the prepare-job's pod scaffold (`provisionPodTemplateSpec`) runs the destroy + prepare bash sequence. The new DB LV's UUID is generated by the operator at Job creation and passed via env. The container's command: From 0fe63224ed2d2f7757191374e5a99db2febdeda6 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Mon, 11 May 2026 15:09:36 +0200 Subject: [PATCH 10/12] docs: use lvm zap for metadata device in replace job Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 74 +++++++++++++++------------------- 1 file changed, 33 insertions(+), 41 deletions(-) diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md index a681b1ce96b7..2f2e7465ac90 100644 --- a/design/ceph/osd-replacement.md +++ b/design/ceph/osd-replacement.md @@ -23,7 +23,7 @@ A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NV ### Replacement is same-host -The new disk must go to the same host as the destroyed OSD: the DB slot freed by destroying the old OSD lives on a metadata device attached to that host, and the replacement OSD's DB must reuse it. Cross-host replacement is permitted by Ceph but out of scope here. +The new disk must go to the same host as the destroyed OSD: the DB LV reused by the replacement lives on a metadata device attached to that host. Cross-host replacement is permitted by Ceph but out of scope here. ### Rook cannot tell a replacement disk from a new disk @@ -98,15 +98,13 @@ sequenceDiagram J->>Ceph: safe-to-destroy 5 (re-check) J->>Ceph: ceph osd destroy osd.5 J->>Ceph: config-key rm dm-crypt key (if encrypted) - J->>J: cryptsetup close db mapping (if encrypted) - J->>J: lvremove db lv - J->>J: ceph-volume lvm zap data lv - J->>J: lvcreate new db lv - J->>Ceph: ceph-volume lvm prepare --osd-id 5 + J->>J: ceph-volume lvm zap db lv (in place; closes dm-crypt + wipes LUKS + clears tags) + J->>J: ceph-volume lvm zap data lv --destroy + J->>Ceph: ceph-volume lvm prepare --osd-id 5 --block.db J->>J: write new OSD info to per-node status CM J-->>-Op: Succeeded create participant NewPod as New OSD pod - Op->>NewPod: create deployment (id=5, dataLV, new dbLV, encrypted) + Op->>NewPod: create deployment (id=5, dataLV, dbLV, encrypted) NewPod->>Ceph: join cluster Op->>Ceph: ceph osd metadata 5 until Ready Op->>CR: phase=Completed @@ -161,10 +159,7 @@ status: node: node-1 # OSD deployment NodeSelector (operator reads via K8s API) dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # `[db].lv_path` from `ceph-volume lvm list` (Capture Job) - metadataSourceDevice: /dev/vdd # `[db].devices[0]` from `ceph-volume lvm list` (Capture Job) - metadataVG: ceph-metadata-vg-1 # `[db].vg_name` from `ceph-volume lvm list` (Capture Job) crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS - databaseSizeMB: 1500 # `[db].lv_size` (bytes) / 1048576 from `ceph-volume lvm list` (Capture Job) encrypted: true # `[db].tags.ceph.encrypted` from `ceph-volume lvm list` (Capture Job) osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # `ceph osd dump`: `.osds[id=].uuid` (operator via mon) @@ -250,66 +245,59 @@ If `spec.diskWaitTimeout` (default 24h) is exceeded, transitions to `Failed` wit 3. **Capture OSDInfo.** Field sources (full schema in [CRD proposal](#crd-proposal)): - `node`, `dataLV`, `crushDeviceClass`: from the OSD deployment env, read by the operator via the K8s API. - - `dbLV`, `metadataVG`, `metadataSourceDevice`, `databaseSizeMB`, `encrypted`: from `ceph-volume lvm list --format json`'s `[db]` entry. Host-only call, run by a **Capture Job** the operator spawns on the target node. The Job assumes the metadata device is still readable from the target node; if it has failed too, replacement cannot proceed. The Job writes captured fields to a per-replacement ConfigMap named `rook-ceph-osd-replace-`, owned by the `CephOSDReplace` CR (cascade-deleted on CR delete) and kept for the CR's lifetime. The operator waits for Job success, reads the CM, and copies fields to `.status.osdInfo` before proceeding to step 4 below. + - `dbLV`, `encrypted`: from `ceph-volume lvm list --format json`'s `[db]` entry. Host-only call, run by a **Capture Job** the operator spawns on the target node. The Job assumes the metadata device is still readable from the target node; if it has failed too, replacement cannot proceed. The Job writes captured fields to a per-replacement ConfigMap named `rook-ceph-osd-replace-`, owned by the `CephOSDReplace` CR (cascade-deleted on CR delete) and kept for the CR's lifetime. The operator waits for Job success, reads the CM, and copies fields to `.status.osdInfo` before proceeding to step 4 below. - `osdFsid`: from `ceph osd dump --format json` run by the operator via its mon connection. Jobs spawned at this and later steps use a deterministic name including an attempt counter (e.g., `rook-ceph-osd-replace--capture-`, `-replace-`) to avoid name collisions on retry; the operator increments `` on each spawn after a prior attempt's terminal status. 4. **Delete OSD deployment.** The operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. -5. **Replace Job.** Built on the existing prepare-job pattern (same one the OSD migration flow uses): a new rook subcommand on the prepare-job's pod scaffold (`provisionPodTemplateSpec`) runs the destroy + prepare bash sequence. The new DB LV's UUID is generated by the operator at Job creation and passed via env. +5. **Replace Job.** Built on the existing prepare-job pattern (same one the OSD migration flow uses): a new rook subcommand on the prepare-job's pod scaffold (`provisionPodTemplateSpec`) runs the destroy + prepare bash sequence. The Job's env carries the captured `osdInfo` fields plus the new data-device path from the per-node inventory CM. The container's command: ```bash set -euo pipefail -# Skip safe-to-destroy + destroy if osd.5 is already destroyed (idempotent on Job retry). +# Placeholders below match osdInfo field names: +# , , , , , plus +# (the path the operator picked up from the per-node inventory CM during Wait). + +# Skip safe-to-destroy + destroy if the OSD is already destroyed (idempotent on Job retry). already_destroyed() { ceph osd dump --format json | python3 -c " import sys, json osds = json.load(sys.stdin).get('osds', []) -o = next((o for o in osds if o.get('osd') == 5), None) +o = next((o for o in osds if o.get('osd') == ), None) sys.exit(0 if o and 'destroyed' in (o.get('state') or []) else 1) " } if ! already_destroyed; then # Re-check safe-to-destroy (insurance against races between phase transition and Job start). - ceph osd safe-to-destroy 5 - # Destroy in Ceph (preserves OSD ID 5 for reuse). - ceph osd destroy osd.5 --yes-i-really-mean-it + ceph osd safe-to-destroy + # Destroy in Ceph (preserves the OSD ID for reuse). + ceph osd destroy osd. --yes-i-really-mean-it fi # Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). ceph config-key exists dm-crypt/osd//luks \ && ceph config-key rm dm-crypt/osd//luks -# Close DB-side LUKS mapping. -DB_MAPPING=$(lsblk -nlo NAME,TYPE /dev/ceph-metadata-vg-1/osd-db- | awk '$2=="crypt"{print $1; exit}') -[ -n "$DB_MAPPING" ] && cryptsetup status "$DB_MAPPING" >/dev/null 2>&1 \ - && cryptsetup close "$DB_MAPPING" - -# Free the DB slot. -lvs /dev/ceph-metadata-vg-1/osd-db- >/dev/null 2>&1 \ - && lvremove -f /dev/ceph-metadata-vg-1/osd-db- - -# Zap the data LV (also handles the data-side dm-crypt mapping). -lvs /dev/ceph-data-vg-5/osd-block- >/dev/null 2>&1 \ - && ceph-volume lvm zap /dev/ceph-data-vg-5/osd-block- --destroy +# Wipe the existing DB LV in place; idempotent on a wiped LV. +ceph-volume lvm zap -# Pre-allocate the new DB LV; skip if it already exists (retry-safe). -lvs /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... >/dev/null 2>&1 \ - || lvcreate -L 1500M -n osd-db-12cf3a91-... ceph-metadata-vg-1 --wipesignatures y +# `zap --destroy` errors on a missing LV, so guard with `lvs` for retry-safety. +lvs >/dev/null 2>&1 \ + && ceph-volume lvm zap --destroy -# Provision the new OSD with the preserved ID. --dmcrypt only when the record's -# `encrypted` field is true. +# Provision the new OSD with the preserved ID, reusing the zapped DB LV. ceph-volume lvm prepare \ --bluestore [--dmcrypt] \ - --osd-id 5 \ - --data /dev/vdh \ - --block.db /dev/ceph-metadata-vg-1/osd-db-12cf3a91-... \ - --crush-device-class hdd + --osd-id \ + --data \ + --block.db \ + --crush-device-class ``` The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node CM Rook already uses to drive daemon creation). @@ -331,7 +319,7 @@ Cancel = delete the `CephOSDReplace` CR; a finalizer runs any cleanup needed. **Replacing — best-effort, deferred.** Once `Replacing` begins, the operator commits to running through to a terminal phase. Cancel intent is recorded but not acted on mid-flow: - If the Replace Job is in flight, the operator lets it complete. `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). -- On Job failure, the finalizer spawns a one-shot cleanup Job on the target node (same pod scaffold) to remove any partially-allocated DB LV before removing the CR. +- On Job failure, the operator retries the Replace Job with an incremented attempt counter (`…-replace-`). The destroy and zap sub-steps are idempotent; recovery from a crash mid-`lvm prepare` (partial VG on the new disk) is an implementation detail. The finalizer removes the CR only after the Job reaches a terminal phase. - On Job success, cancel is not honored — the new OSD joins the cluster. ## Notes on Scope @@ -352,11 +340,15 @@ nodes: config: { metadataDevice: "nvme1n1" } # different metadata device on the same node ``` -This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `metadataSourceDevice` is captured in its `osdInfo` at destroy time), with two caveats: +This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `dbLV` path encodes the metadata VG, which sits on exactly one physical device — so the replacement targets the correct metadata device by construction), with two caveats: - **Device-name validation must permit exact entries** — see [open question 6](#open-questions). - **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. +### Non-shared-metadata OSDs — same CRD, simpler bash + +The flow naturally covers OSDs without a separate metadata device — single-disk LVM-mode OSDs and `ceph-volume raw` OSDs. `osdInfo.dbLV` is empty for these cases, the Replace Job skips the DB-LV zap step and the `--block.db` arg, and the prepare call simplifies to `ceph-volume {lvm,raw} prepare --osd-id --data [--dmcrypt] --crush-device-class `. The CV mode (lvm vs raw) is already tracked in Rook's existing `OSDInfo`. Implementation may branch the Replace Job bash on `osdInfo.dbLV == ""` and on CV mode; the rest of the flow (CRD, state machine, Capture Job, auto-provisioning skip, Replace Job scaffold) is shared. + ### PVC-based OSD replacement — separate design PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. @@ -387,6 +379,6 @@ PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? -7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `metadataVG` lives on the original host. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? +7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `dbLV` lives on the original host's metadata VG. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? 8. **Combine Capture and Replace into one Job.** The Capture step currently runs as a separate Job before the Replace Job. It could be merged into the Replace Job's first commands, with intermediate state persisted to a per-replacement ConfigMap (mirroring the prepare-job's CM hand-off pattern) so the Job remains retry-idempotent: on retry the Job reads the CM if present and skips re-capture, then proceeds with destroy + prepare. Saves one Job spawn per replacement; adds a CM-check branch at the top of the Job's bash and a CM-write helper invocation. From 02542a6498fc643b1c0a16f0237862307c720650 Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Tue, 12 May 2026 14:59:05 +0200 Subject: [PATCH 11/12] docs: fix mermaid diagram syntax err Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md index 2f2e7465ac90..1dca5b8cf157 100644 --- a/design/ceph/osd-replacement.md +++ b/design/ceph/osd-replacement.md @@ -98,7 +98,7 @@ sequenceDiagram J->>Ceph: safe-to-destroy 5 (re-check) J->>Ceph: ceph osd destroy osd.5 J->>Ceph: config-key rm dm-crypt key (if encrypted) - J->>J: ceph-volume lvm zap db lv (in place; closes dm-crypt + wipes LUKS + clears tags) + J->>J: ceph-volume lvm zap db lv (in place, closes dm-crypt + wipes LUKS + clears tags) J->>J: ceph-volume lvm zap data lv --destroy J->>Ceph: ceph-volume lvm prepare --osd-id 5 --block.db J->>J: write new OSD info to per-node status CM From ac790e39b50205f06533d3cf2bd8ad060171175d Mon Sep 17 00:00:00 2001 From: Artem Torubarov Date: Wed, 27 May 2026 13:51:16 +0200 Subject: [PATCH 12/12] docs: osd replacement annotation-based design Signed-off-by: Artem Torubarov --- design/ceph/osd-replacement.md | 444 ++++++++++++++------------------- 1 file changed, 191 insertions(+), 253 deletions(-) diff --git a/design/ceph/osd-replacement.md b/design/ceph/osd-replacement.md index 1dca5b8cf157..d3cc1aa51597 100644 --- a/design/ceph/osd-replacement.md +++ b/design/ceph/osd-replacement.md @@ -10,56 +10,47 @@ This design proposes a workflow to replace a single failed OSD in place — pres ## Notation -- **User** - the human cluster admin who edits the CR. -- **Operator** - the Rook controller process. +- **User** - the human cluster admin who edits Kubernetes objects. +- **Operator** - the Rook controller process (CephCluster controller). - **Data LV / data device** - the LV (or block device) holding an OSD's bulk data. One per OSD. - **DB LV / metadata device** - the LV holding the OSD's rocksdb (`block.db`). One per OSD; multiple OSDs can share the same metadata device. ## User story -A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NVMe metadata device. The user marks `osd.5` for replacement in the Rook CR, swaps the physical disk in the chassis, and walks away. Rook destroys `osd.5`, frees its DB LV slot on the NVMe, provisions a new OSD on the replacement disk *with the same OSD ID 5*, and the other four OSDs on the same NVMe stay up the whole time. +A disk corresponding to `osd.5` fails on a node where five HDD OSDs share one NVMe metadata device. The user annotates the failed OSD's deployment: + +```sh +kubectl -n rook-ceph annotate deployment rook-ceph-osd-5 \ + rook.io/osd-replace=yes-really-replace-osd-5 +``` + +Rook drains and destroys `osd.5` in Ceph (preserving its CRUSH position and OSD ID — the slot is marked `destroyed` in the OSDMap), removes the data and DB LVs, and deletes the OSD deployment. The user swaps the physical disk in the chassis at any later time — minutes or days — and the next prepare-job reconcile on that node pairs the destroyed slot with the new disk, provisioning a new OSD with the preserved OSD ID 5 and a fresh DB LV in the existing metadata VG. The other four OSDs on the same NVMe stay up the whole time. + +The flow works for both already-failed disks (Ceph has already auto-marked the OSD `out` after `mon_osd_down_out_interval`) and misbehaving-but-still-running disks (slow IO, bad blocks, not yet auto-marked out). Rook unconditionally runs `ceph osd out` during drain — idempotent for already-out OSDs, and required for `safe-to-destroy` to ever pass on healthy OSDs. ## Constraints ### Replacement is same-host -The new disk must go to the same host as the destroyed OSD: the DB LV reused by the replacement lives on a metadata device attached to that host. Cross-host replacement is permitted by Ceph but out of scope here. +The new disk must go to the same host as the destroyed OSD. For shared-metadata layouts this is a hard requirement: the metadata VG holding the new DB LV lives on a device attached to that host. For other layouts there is no hard constraint, but preserving the OSD ID across hosts does not buy anything — CRUSH remaps PGs on daemon start anyway. Cephadm takes the same defensive approach and does not allow [cross-host replacement](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd). ### Rook cannot tell a replacement disk from a new disk -When a fresh empty disk appears on a node, Rook has no way to tell it's the replacement for a failed OSD. With `useAllDevices` or a matching `deviceFilter`, the next reconcile auto-provisions the new disk with a fresh ID and leaks the failed OSD's resources. The user must mark the OSD for replacement in the CR *before* swapping the disk. +Rook cannot reliably distinguish a freshly added disk from a replacement disk. The user must therefore mark the failed OSD for replacement *before* swapping the disk, so Rook can drive cleanup and reserve the OSD slot for the incoming disk. ### Storage device config must tolerate device swap -Rook lets users identify OSD data devices via `spec.storage`: - -- `useAllDevices: true` — match any empty disk on the node. -- `deviceFilter: ""` — match disks whose `lsblk` properties match a regex. -- `nodes[].devices[].name: ""` — match a specific path or name. Accepts a kernel name (`vdb`), a raw path (`/dev/sdc`), or a udev symlink (`/dev/disk/by-path/...`, `/dev/disk/by-id/...`). -- `nodes[].devices[].fullpath: ""` — explicit DevLinks match (`/dev/disk/by-id/...`, `/dev/disk/by-path/...`). Compared against discovered symlinks, not regex. - -Each shape interacts differently with the Linux device-naming interfaces: - -- **Kernel names** (`vdb`, `sdc`, `/dev/sdc`) are assigned by the kernel at boot and [not guaranteed to be persistent](https://wiki.archlinux.org/title/Persistent_block_device_naming). -- **`/dev/disk/by-path/...`** is a udev symlink built from the sysfs port path: same physical port, same symlink. -- **`/dev/disk/by-id/...`** is a udev symlink built from the disk's hardware serial / WWN, unique per physical disk. -- **`/dev/disk/by-uuid/...`** is a udev symlink built from the filesystem or LV UUID, assigned at provisioning time. +Rook's `spec.storage` accepts several device-reference shapes (`useAllDevices`, `deviceFilter`, kernel names, and udev symlinks under `/dev/disk/by-path`, `by-id`, `by-uuid`). Only some of them resolve to the new disk after a swap: `useAllDevices` and `deviceFilter` tolerate any swap, `by-path` tolerates same-slot replacement, and the rest cannot. The replacement flow validates the affected OSD's references before drain and rejects the non-tolerant shapes; see [open question 5](#open-questions) for the full table and trade-offs. -The shapes that tolerate any swap (same-slot or different-slot, any new disk) are `useAllDevices` and `deviceFilter`. `by-path` tolerates only same-slot replacement. Kernel names tolerate only the lucky case where the kernel happens to assign the same name. `by-id`/`by-uuid` references in `name`/`fullpath` cannot work for a disk that hasn't been seen yet. - -The replacement flow must validate the affected OSD's CR references beforehand so the new disk is still resolvable under those references after the swap. - -## Current gaps +## Proposed flow -Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the migration flow (`spec.storage.migration`), which recreates OSDs in place after encryption or store-type spec changes: it destroys the OSD and re-prepares with `ceph-volume raw prepare --osd-id` via the `ROOK_REPLACE_OSD` env var. Migration only covers raw-mode OSDs; the shared-metadata case needs the following fixes: +This flow drives [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`ok-to-stop` → `osd out` → `safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm batch --osd-ids`) from inside the existing CephCluster controller. The design adds one new Kubernetes Job — the Replace Job — that destroys the OSD in Ceph and zaps its data on the host. -1. `DestroyOSD` cleans up only the data LV. The DB LV on the shared metadata disk stays as an orphan, and the dm-crypt key in Ceph's config-key store is never removed (causing LUKS collisions on retry of encrypted OSDs). (`DestroyOSD`, [remove.go#L244-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L244-L290)) -2. **The prepare-pod can't find a shared metadata disk once any OSD lives on it.** Rook's disk-discovery (`DiscoverDevicesWithFilter`, [disk.go#L97-L111](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/clusterd/disk.go#L97-L111)) skips any disk with `len(deviceChild) > 1` as a guard against claiming a user-partitioned disk. From the first OSD onward — encrypted or not — that count is ≥ 2 (parent + LV, plus crypt mapping if encrypted), so the filter triggers and the prepare-pod's `initializeDevicesLVMMode` errors with `metadata device is not found`. Same root cause as upstream issues [#15868](https://github.com/rook/rook/issues/15868) and parts of [#17477](https://github.com/rook/rook/issues/17477). -3. **No discover-only mode in the prepare-job.** The prepare-job combines discovery and provisioning in a single sequential pass (`Provision`, [daemon.go#L159-L283](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/daemon.go#L159-L283)) — no way to inventory a node without auto-claiming any empty disks it finds. The replacement flow needs "scan but don't claim" so the empty replacement disk doesn't get auto-provisioned with a fresh ID before the operator can drive `ceph-volume lvm prepare --osd-id`. This design adds a `ROOK_DISCOVER_ONLY` mode to the prepare-job; the cluster controller passes it for nodes with an active `CephOSDReplace`. +State is read from Ceph and Kubernetes, not stored in Rook. Three markers carry it through the procedure: -## Proposed flow - -This flow orchestrates [Ceph's documented OSD-replacement procedure](https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#replacing-an-osd) (`safe-to-destroy` → `osd destroy` → `lvm zap` → `lvm prepare --osd-id` → `lvm activate`) inside a single short-lived Kubernetes Job, with state machine maintained in Rook CR status. `cephadm` — Ceph's container-orchestrator analogue — preserves OSD IDs by default ([cephadm OSD service docs](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd)); this design follows the same convention. +- **Annotation on the OSD Deployment.** Marks intent. Stays in place while the operator validates the request and runs `ok-to-stop`, `osd out`, and `safe-to-destroy`. Cleaned up automatically when the Deployment is deleted. +- **[Suspended](https://kubernetes.io/blog/2021/04/12/introducing-suspended-jobs/#api-changes) Replace Job.** The operator captures the OSD's fsid, CV mode, data-device path, and encryption flag into the Job's env, creates the Job suspended, and deletes the OSD Deployment. The Job is resumed once the OSD pod terminates. +- **Destroyed slot in the OSDMap.** The Replace Job runs `ceph osd destroy` to mark the OSD's slot as destroyed in the OSDMap and exits after cleanup is done. The user can physically swap the failed disk at any time after that — destroy and provision are decoupled. The existing prepare-job is extended to check the OSDMap for destroyed OSDs on the same node and reuse them when provisioning a new disk, the same primitive [cephadm uses for `orch osd rm --replace`](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd). ### Sequence @@ -67,260 +58,183 @@ This flow orchestrates [Ceph's documented OSD-replacement procedure](https://doc sequenceDiagram autonumber actor User - participant CR as Rook CR + participant Dep as OSD Deployment participant Op as Operator - participant OldPod as Old OSD pod participant J as Job participant Ceph as Ceph - User->>CR: create CR with osd id 5 - Op->>CR: read CR - Op->>CR: write phase=Validating - Op->>Ceph: ceph osd dump (validate exists, get fsid) - Op->>CR: write phase=Waiting - Note over Op: each reconcile, poll inventory CM and requeue until disk visible - Note over User,Op: User swaps the failed disk - Op->>CR: write phase=Replacing - Note over User,Ceph: from here, cancellation has side effects (see Cancellation) - Op->>Ceph: ceph osd out 5 (if autoOut and up+in) - Op->>Ceph: safe-to-destroy 5 (poll until OK) - Op->>OldPod: read deployment env - Op->>+J: create Capture Job (env: osdId) - J->>J: ceph-volume lvm list --format json - J->>J: write captured fields to per-replacement CM - J-->>-Op: Succeeded - Op->>Ceph: ceph osd dump (osdFsid) - Op->>CR: capture OSD info - Op->>OldPod: delete deployment - destroy OldPod - Op->>OldPod: wait for pod termination - Op->>+J: create Replace Job (env from osdInfo) - J->>Ceph: safe-to-destroy 5 (re-check) - J->>Ceph: ceph osd destroy osd.5 - J->>Ceph: config-key rm dm-crypt key (if encrypted) - J->>J: ceph-volume lvm zap db lv (in place, closes dm-crypt + wipes LUKS + clears tags) - J->>J: ceph-volume lvm zap data lv --destroy - J->>Ceph: ceph-volume lvm prepare --osd-id 5 --block.db - J->>J: write new OSD info to per-node status CM - J-->>-Op: Succeeded - create participant NewPod as New OSD pod - Op->>NewPod: create deployment (id=5, dataLV, dbLV, encrypted) - NewPod->>Ceph: join cluster - Op->>Ceph: ceph osd metadata 5 until Ready - Op->>CR: phase=Completed -``` - -### Open question: controller placement - -The diagram doesn't pick a concrete CR or controller for the replacement reconcile logic. Two candidates: extend the existing CephCluster controller (which already hosts `spec.storage.migration`), or introduce a separate `CephOSDReplace` CRD with its own controller. The design leans toward the separate CRD for the following reasons: - -1. **CephCluster's `reconcileCephDaemons` is monolithic and synchronous** — mon, mgr, and osd reconcile run sequentially in one call ([`cluster.go#L116-L160`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/cluster.go#L116-L160)); `osd.Cluster.Start()` returns plain `error`, so there's no way to express terminal failure (bad CR rejected) vs. transient `RequeueAfter` (waiting for disk-swap or Job completion). Adding long-running multi-step logic to this path interferes with mon/mgr reconcile and lacks the return semantics the flow needs. -2. **Replacement state has to survive between reconciles**, and the cluster controller has no existing place to store sub-operation state — adding one (a side ConfigMap, or extending `CephCluster.status`) is part of the cost. - -Concrete shape of each candidate: - -- **Extend the cluster controller** — state in either a side ConfigMap (`osd-replacement-state`, similar to `osd-migration-config`) or `CephCluster.status`. -- **New `CephOSDReplace` CRD + dedicated controller** — state on `.status`. Independent reconcile loop; doesn't modify the existing OSD path (shares only the pod-scaffold builder, `provisionPodTemplateSpec`). Light coupling on the cluster side: skip auto-provisioning on affected nodes. - -The rest of this design is based on a separate `CephOSDReplace` CRD, with implications for the cluster-CR fallback flagged inline. - -### CRD proposal - -Config lives on `CephOSDReplace.spec` and state in `.status`. `spec.cephCluster` and `spec.osdId` are immutable post-create. `.status` carries phase and conditions following the K8s operator pattern. - -```yaml -apiVersion: ceph.rook.io/v1 -kind: CephOSDReplace -metadata: - name: replace-osd-5 - namespace: rook-ceph - labels: - rook.io/osd-replacement-node: node-1 # operator-managed; equals the target OSD's host node -spec: - cephCluster: my-cluster # immutable; target cluster in this namespace - osdId: 5 # immutable - confirmation: yes-really-replace-osd-5 # must equal "yes-really-replace-osd-{osdId}"; copy-paste guard against operating on the wrong OSD - autoOut: false # optional; if true, operator marks healthy OSD `out` automatically (during Replacing). Default: false (fail-fast on up+in at Validating) - safeToDestroyTimeout: 1h # optional; how long Replacing tolerates EBUSY on safe-to-destroy before Failed. Default: 1h - diskWaitTimeout: 24h # optional; how long Waiting tolerates a missing disk before Failed. Default: 24h - -status: - phase: Replacing # Pending | Validating | Waiting | Replacing | Completed | Failed | Cancelled - conditions: - - type: Ready - status: "False" - reason: Replacing - message: Replace Job in flight - observedGeneration: 1 - lastTransitionTime: "2026-05-05T12:00:00Z" - - # captured at start of Replacing, before deployment delete - osdInfo: - node: node-1 # OSD deployment NodeSelector (operator reads via K8s API) - dataLV: /dev/ceph-data-vg-5/osd-block-aaa... # OSD deployment env ROOK_BLOCK_PATH - dbLV: /dev/ceph-metadata-vg-1/osd-db-bbb... # `[db].lv_path` from `ceph-volume lvm list` (Capture Job) - crushDeviceClass: hdd # OSD deployment env ROOK_OSD_CRUSH_DEVICE_CLASS - encrypted: true # `[db].tags.ceph.encrypted` from `ceph-volume lvm list` (Capture Job) - osdFsid: 07bb0602-5e27-4fcc-86b1-c1faa0bc20ac # `ceph osd dump`: `.osds[id=].uuid` (operator via mon) - - # populated on phase=Completed - newFsid: "" # recorded on Completed - completedAt: null + User->>Dep: kubectl annotate + Op->>Dep: watch + critical Validation - emit error and skip OSD if failed + Op->>Op: validate Cluster CR device names
and annotation OSD id + Op->>Ceph: ceph osd dump (osd.5 exists, state != destroyed) + Op->>Ceph: ceph osd ok-to-stop 5 + end + Op->>Ceph: ceph osd out 5 (idempotent) + Note left of Ceph: OSD up+out.
PGs migrate to peers. + Op->>Ceph: ceph osd safe-to-destroy 5 + Note left of Op: RequeueAfter 30s until pass.
1h timeout. + + Op->>Dep: read ROOK_OSD_UUID, ROOK_CV_MODE,
ROOK_BLOCK_PATH, encrypted label + Op->>+J: create Replace Job
(spec.suspend=true)
OSD_ID, OSD_FSID, OSD_CV_MODE
OSD_DATA_DEVICE, OSD_ENCRYPTED + + Op->>Dep: delete
(propagationPolicy=Foreground) + destroy Dep + Dep->>Op: deployment removed, reconcile triggered + Op->>J: patch spec.suspend=false + Note over J: Pod scheduled + + J->>Ceph: ceph osd dump (if osd.5 == destroyed, jump to zap) + J->>Ceph: ceph osd safe-to-destroy 5 (re-check) + J->>Ceph: ceph osd down osd.5 + J->>Ceph: ceph osd destroy osd.5 --yes-i-really-mean-it + Note over Ceph: OSD slot stays in CRUSH as destroyed. + + J->>Ceph: (lvm only):
ceph-volume lvm zap --osd-id 5 --destroy + Note over Ceph: walks ceph.osd_id LV tags.
closes dm-crypt mappings.
removes data and DB LVs + J->>Ceph: (raw mode + encrypted):
cryptsetup close ceph-OSD_FSID-vdc-block-dmcrypt + J->>Ceph: (raw mode):
ceph-volume lvm zap OSD_DATA_DEVICE --destroy + J-->>-Op: succeeded + + Note over User,Ceph: user swaps the failed disk on the host.
Any time after the Replace Job succeeds. No timeout.
End of replace flow. The rest is done in normal OSD provision flow. + + Op->>+J: next prepare-job reconcile on the node + J->>Ceph: ceph osd tree --states destroyed
(checks free slots on the node) + J->>Ceph: ceph-volume lvm batch --no-auto /dev/newDataDisk \
--db-devices /dev/metaDev --osd-ids 5 \
[--dmcrypt] --yes --no-systemd + Note over J: handoff to exiting flow
prepare-job writes OSDInfo to CM + J-->>-Op: succeeded + + create participant NewDep as new OSD Deployment + Op->>NewDep: create rook-ceph-osd-5 + Op->>Ceph: poll ceph osd tree until osd.5 is up+in ``` -Users cancel a replacement by deleting the `CephOSDReplace` CR. A finalizer gives the operator a chance to clean up before the CR is removed. If the user cancels before the operator picks up the swapped disk, no Ceph or host state has changed and the CR is removed cleanly. If the replacement has already started, the operator runs it to a terminal state (success or failure) before removing the CR — see [Cancellation](#cancellation). - -CR names are arbitrary. To re-replace the same OSD, the user creates a new CR with a different name. Terminal CRs (`Completed`, `Cancelled`, `Failed`) for the same `osdId` are ignored by the operator and can be deleted when no longer useful. - -#### Coordination - -Replacements run serially per CephCluster as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. Per-OSD `safe-to-destroy` only returns OK once the OSD is fully drained from every PG's acting set, so concurrent destroys of independently-safe OSDs are technically safe — but serial keeps the operational model simple. - -The queue is implemented via a `Pending` phase. Each reconcile, the controller lists peer `CephOSDReplace` CRs in the same namespace targeting the same cluster. If no earlier-`creationTimestamp` peer is in a non-terminal phase, this CR advances to `Validating`; otherwise it stays in `Pending` and re-checks next reconcile. UID breaks same-second ties. - -> Extending CephCluster with a `spec.storage.replaceOSD` field needs no coordination logic — a single field admits only one in-flight replacement. +### Step-by-step -#### Auto-provisioning skip +The walk-through uses the running example. -The Rook cluster controller spawns the prepare-job, which by default auto-discovers devices and provisions new OSDs. To make the replacement flow work, the cluster controller must run the prepare-job in "discover only" mode on a node where a replacement is running — discovery happens, provisioning doesn't. +#### 1. Trigger -In the existing cluster controller, add a gate before each `runPrepareJob` call in `startProvisioningOverNodes` ([create.go#L345](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/create.go#L345)): list `CephOSDReplace` CRs whose `rook.io/osd-replacement-node` label equals the current node; if any is in a non-terminal phase, launch the Job with `ROOK_DISCOVER_ONLY=true` in its env. The replacement controller stamps this label on its CR at creation, reading the node from the target OSD's deployment. It clears the label on transition to a terminal phase; the deletion finalizer is a backup. +User sets the annotation `rook.io/osd-replace=yes-really-replace-osd-` on the OSD deployment. The numeric `` in the value must equal the deployment's `ceph-osd-id` label (copy-paste guard against operating on the wrong OSD). -In discover-only mode, the prepare-job runs the same discovery code as a normal run (`DiscoverDevicesWithFilter` + `getAvailableDevices`) and writes the eligible-device list to the existing per-node status CM (`rook-ceph-osd--status`) and stops without provisioning. +A predicate carve-out on the CephCluster controller's owned-Deployment watch fires the reconcile when this annotation transitions from absent to present, value changes, or present to absent. Other Deployment updates remain suppressed. -The optional discovery DaemonSet (`ROOK_ENABLE_DISCOVERY_DAEMON=true`) only inventories devices; it doesn't provision. When enabled, it updates `local-device-` on udev events with seconds latency. The replacement controller may watch that CM as a fast wake-up signal, but treats the discover-only status CM as authoritative — `local-device-` is unfiltered (does not apply the cluster's `deviceFilter`/`useAllDevices`). +#### 2. Validate -> With the cluster-CR fallback (`spec.storage.replaceOSD` on CephCluster), the cluster controller reads its own spec field instead of listing CRs — same flag plumbing. +Cheap upfront checks. On the first failed check the operator returns from the reconcile without requeuing; `ReportReconcileResult` emits a Warning event on the CephCluster CR. See [open question 6](#open-questions) on where to report the validation result. -#### Phase state machine +1. **Confirmation matches.** Annotation value equals `yes-really-replace-osd-` where `` equals the deployment's `ceph-osd-id` label. +2. **Target OSD exists and is not already destroyed.** `ceph osd dump`: the OSD must exist and its `state` must not contain `destroyed`. +3. **Deployment is host-based.** Label `ceph.rook.io/pvc` must be absent. See [Notes on Scope](#pvc-based-osd-replacement--separate-design). +4. **Data-device reference is swap-tolerant.** The OSD's `spec.storage` reference must be `useAllDevices`, `deviceFilter`, or a `/dev/disk/by-path/...` path (same-slot replacement only). See [open question 5](#open-questions) on whether to make this configurable. +5. **Drain admission.** `ceph osd ok-to-stop ` must return OK. Failure means removing this OSD would push some PG below `min_size`; the user resolves the underlying cluster state before re-annotating. -``` - Pending ─→ Validating ─→ Waiting ─→ Replacing ─→ Completed - │ │ │ │ - ▼ ▼ ▼ ▼ - Cancelled / Failed -``` - -On operator restart, reconcile resumes from `.status.phase` plus observable state — Jobs by name, deployment presence, `osdInfo` populated, OSD `up+in` in `ceph osd tree`. Sub-step progress within `Replacing` is not persisted on the CR. +On all checks passing, the controller proceeds to drain. -### Step-by-step +#### 3. Drain -The walk-through uses the running example. +Drain leaves the OSD `up` but with no PGs, then waits for `safe-to-destroy`. -#### 1. Trigger — user creates a `CephOSDReplace` CR +1. **`ceph osd out `.** Unconditional. Idempotent for already-out OSDs and required for `safe-to-destroy` to ever pass on a healthy OSD. +2. **Poll `ceph osd safe-to-destroy ` until OK.** -Typical case is a failed disk: Ceph auto-marks the OSD `down` and `out` and rebalances the data. User creates a `CephOSDReplace` CR and replaces the failed disk in the datacenter. +Polling uses `RequeueAfter: 30s`, with a 1h cumulative timeout since drain start enforced via a wall-clock annotation `rook.io/osd-replace-started-at=` stamped on the deployment on the first drain reconcile. -Healthy (`up+in`) OSDs are rejected by [Validate](#2-validate) unless the user marks the OSD out manually first (`ceph osd out `) or sets `spec.autoOut: true`. +Retry on failure or expiry: the user removes the `rook.io/osd-replace` annotation and re-adds it. The annotation-flip predicate fires a fresh reconcile; the operator clears the stale `started-at` annotation on a fresh start. -On creation, the CR enters `Pending` and waits for any earlier in-flight replacement to terminate. Once cleared, it advances to `Validating`. +#### 4. Spawn the Replace Job -#### 2. Validate +Before deleting the deployment, the operator reads a small set of immutable fields off it and writes them to the Job's env. Kubernetes Job env is immutable across pod restarts within the same Job, so the Job never needs to re-derive these from Ceph state that `osd destroy` will zero out. -Cheap upfront checks. Each reconcile cycle runs the checks in order; the first failure ends the phase. +| Job env var | Source on the Deployment | Used by | +|-------------------|-----------------------------------------------|----------------------------------------------------| +| `OSD_ID` | annotation value (`yes-really-replace-osd-`) | all teardown commands | +| `OSD_FSID` | container env `ROOK_OSD_UUID` | raw-encrypted dm-crypt mapper name | +| `OSD_CV_MODE` | container env `ROOK_CV_MODE` | Replace Job step 5 branch (lvm zap vs raw zap) | +| `OSD_DATA_DEVICE` | container env `ROOK_BLOCK_PATH` | raw zap target | +| `OSD_ENCRYPTED` | label `encrypted` on the OSD deployment | raw-encrypted gate for `cryptsetup close` | -1. **Confirmation matches.** `spec.confirmation` must equal `"yes-really-replace-osd-{spec.osdId}"`. On mismatch: `Failed` with `reason=InvalidSpec` (typo guard). -2. **Target OSD exists.** If absent from the OSD map: `Failed` with `reason=InvalidSpec`. -3. **Target OSD is destroyable.** If `up && in` and `spec.autoOut: false`: `Failed` with `reason=OSDStillIn`. If `up && in` and `spec.autoOut: true`: accepted; the actual `ceph osd out` runs in [Replace](#4-replace), not here. -4. **CR-level device matching is swap-tolerant.** The OSD's data device must be referenced via `useAllDevices`, `deviceFilter`, or a `/dev/disk/by-path/...` path (same-slot replacement only). Kernel names (`vdb`, `/dev/sda`), `by-id`, and `by-uuid` references are rejected — they can't resolve to a fresh disk. On rejection: `Failed` with `reason=InvalidSpec`. (Whether to make this configurable is [open question 6](#open-questions).) +The Job is created with `spec.suspend=true` and a deterministic name (`rook-ceph-osd-replace-`). The suspended Job *is* the durable replacement marker until it completes. On retry, the operator deletes any prior completed/failed Job with the same name before creating a fresh one. The Job reuses the existing prepare-job pod scaffold. -On all checks passing, advances to `Waiting`. +#### 5. Delete the OSD deployment -#### 3. Wait for replacement disk +The operator deletes the OSD deployment with `propagationPolicy=Foreground` and returns from the reconcile without polling. Kubernetes holds the deployment in `Terminating` until the OSD pod is fully terminated and the daemon releases its hold on the data/DB LVs. (dm-crypt mappings stay open at this point — the Replace Job's step 5 closes them.) -Each reconcile, the controller checks whether the replacement disk is visible by reading the per-node status CM (`rook-ceph-osd--status`, populated by the prepare-job). If the empty replacement disk isn't there yet, requeue. When it appears, advance to `Replacing`. +The owned-Deployment watch fires the next reconcile when the deployment is finally removed from the API. -Cancel during `Waiting` is clean: no Ceph or host state has been changed, no LVs touched. Deletion of the CR ends the flow with no recovery needed. +#### 6. Unsuspend the Replace Job -If `spec.diskWaitTimeout` (default 24h) is exceeded, transitions to `Failed` with `reason=ReplacementDiskMissing`. After timeout, the user can insert the disk and create a new CR for the same OSD ID (the OSD is still alive in Ceph at this point), or delete the CR. +On the deletion-triggered reconcile, the operator observes a suspended Job and the now-absent deployment. It patches `Job.spec.suspend=false`; Kubernetes schedules the pod and the Job container begins execution. Job env is fixed at create time, so the values read off the deployment in step 4 survive the unsuspend. -#### 4. Replace +#### 7. Replace Job execution -`Replacing` runs the full set of state changes in sequence. On each reconcile, the operator inspects observable state (deployment presence, `status.osdInfo`, Replace Job status, `ceph osd tree`) and runs the next unfinished sub-step. +Steps 1-4 are mode-agnostic and idempotent on pod restart: -1. **autoOut (conditional).** If the OSD is `up && in` and `spec.autoOut: true`, run `ceph osd out `. For the typical failed-disk case the OSD is already `out` (Ceph auto-marked it after `mon_osd_down_out_interval`) and this step is a no-op. +1. **State check.** `ceph osd dump`: if `osd.` state already contains `destroyed`, skip steps 2-4 and jump to step 5. +2. **Defensive `safe-to-destroy ` re-check.** +3. **`ceph osd down osd.`.** Forces the mon view to `down`. Idempotent. Dodges heartbeat-lag `EBUSY` on the next step. +4. **`ceph osd destroy osd. --yes-i-really-mean-it`.** The mon's `KVMonitor::do_osd_destroy` clears `dm-crypt/osd//*` and `daemon-private/osd./*` keys ([KVMonitor.cc#L369-L387](https://github.com/ceph/ceph/blob/v19.2.2/src/mon/KVMonitor.cc#L369-L387)). CRUSH bucket and weight preserved; no explicit `config-key rm` needed. -2. **Wait for `safe-to-destroy` OK.** `ceph osd safe-to-destroy ` returns OK only after the OSD is fully drained from every PG's acting set. Requeued until OK. If `spec.safeToDestroyTimeout` (default 1h) is exceeded, transitions to `Failed` with `reason=NotSafeToDestroy`. +Step 5 is mode-specific: -3. **Capture OSDInfo.** Field sources (full schema in [CRD proposal](#crd-proposal)): +**LVM mode (`OSD_CV_MODE == lvm`).** `ceph-volume lvm zap --osd-id --destroy` walks `ceph.osd_id` LV tags, closes dm-crypt mappings, and removes data and DB LVs. Sibling DB LVs on a shared metadata VG are untouched — `zap --destroy` removes the parent VG only when at most one LV remains. `lvm zap --osd-id` is not idempotent, so it is guarded with `ceph-volume lvm list --format json` in case of Job retry. - - `node`, `dataLV`, `crushDeviceClass`: from the OSD deployment env, read by the operator via the K8s API. - - `dbLV`, `encrypted`: from `ceph-volume lvm list --format json`'s `[db]` entry. Host-only call, run by a **Capture Job** the operator spawns on the target node. The Job assumes the metadata device is still readable from the target node; if it has failed too, replacement cannot proceed. The Job writes captured fields to a per-replacement ConfigMap named `rook-ceph-osd-replace-`, owned by the `CephOSDReplace` CR (cascade-deleted on CR delete) and kept for the CR's lifetime. The operator waits for Job success, reads the CM, and copies fields to `.status.osdInfo` before proceeding to step 4 below. - - `osdFsid`: from `ceph osd dump --format json` run by the operator via its mon connection. +**Raw mode (`OSD_CV_MODE == raw`).** All required inputs are in the Job env. - Jobs spawned at this and later steps use a deterministic name including an attempt counter (e.g., `rook-ceph-osd-replace--capture-`, `-replace-`) to avoid name collisions on retry; the operator increments `` on each spawn after a prior attempt's terminal status. +- If `OSD_ENCRYPTED == true`: `cryptsetup close ceph---block-dmcrypt` closes the mapping; the raw zap path does not handle dm-crypt teardown. Non-idempotent — guarded by `cryptsetup status` (exits 4 for inactive mappings, 0 when active). +- `ceph-volume lvm zap --destroy` runs `ceph-bluestore-tool zap-device` + `wipefs` + `dd`. This is naturally idempotent on a raw device path: a second run re-zeroes the first 10 MB and exits 0. No retry gate is needed for raw mode. -4. **Delete OSD deployment.** The operator calls `k8sutil.DeleteDeployment` ([`deployment.go#L388`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/k8sutil/deployment.go#L388)) on `rook-ceph-osd-5` and polls until the pod is gone. +Implementation note for raw-encrypted: `ROOK_BLOCK_PATH` may point at the dm-crypt mapper path (e.g., `/dev/mapper/ceph---block-dmcrypt`), not the underlying disk. The destroy logic must resolve the underlying device before close + zap (mirroring how the PVC branch of `DestroyOSD` resolves the real device at [remove.go#L272-L277](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L272-L277)). -5. **Replace Job.** Built on the existing prepare-job pattern (same one the OSD migration flow uses): a new rook subcommand on the prepare-job's pod scaffold (`provisionPodTemplateSpec`) runs the destroy + prepare bash sequence. The Job's env carries the captured `osdInfo` fields plus the new data-device path from the per-node inventory CM. +The destroy and zap logic lives in Go as an extension of the existing [`DestroyOSD`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L245-L290) function (today called by the OSD migration path). The Replace Job's container invokes a destroy-only entry point — a `rook ceph osd destroy --id ` subcommand or env switch in `prepareOSD` that skips `NewAgent` and `Provision`, runs the extended `DestroyOSD`, and exits. See [Why a separate Replace Job](#why-a-separate-replace-job). - The container's command: +After the Job reaches `status.succeeded=1`, its marker role is done — the destroyed slot in `ceph osd tree` is now the steering signal. The operator may garbage-collect the Replace Job once `osd.` is observed `up+in` in the final reconcile. -```bash -set -euo pipefail +#### 8. Wait for disk swap -# Placeholders below match osdInfo field names: -# , , , , , plus -# (the path the operator picked up from the per-node inventory CM during Wait). +The destroyed slot lives in Ceph's OSDMap. No Rook-side state is kept between destroy and provision. -# Skip safe-to-destroy + destroy if the OSD is already destroyed (idempotent on Job retry). -already_destroyed() { - ceph osd dump --format json | python3 -c " -import sys, json -osds = json.load(sys.stdin).get('osds', []) -o = next((o for o in osds if o.get('osd') == ), None) -sys.exit(0 if o and 'destroyed' in (o.get('state') or []) else 1) -" -} +User contract: swap the failed disk only after the Replace Job completes (`Job.status.succeeded == 1`). No timeout — the slot can sit in destroyed state for days. There is no programmatic gate on early-swap by default; see [open question 1](#open-questions). -if ! already_destroyed; then - # Re-check safe-to-destroy (insurance against races between phase transition and Job start). - ceph osd safe-to-destroy - # Destroy in Ceph (preserves the OSD ID for reuse). - ceph osd destroy osd. --yes-i-really-mean-it -fi +#### 9. Provision and complete -# Remove dm-crypt key (no-op on Ceph v19+; defensive for older versions). -ceph config-key exists dm-crypt/osd//luks \ - && ceph config-key rm dm-crypt/osd//luks +On the next prepare-job reconcile after the user swaps the disk, the prepare-job runs the existing discovery pass plus a new per-node pre-step: -# Wipe the existing DB LV in place; idempotent on a wiped LV. -ceph-volume lvm zap +1. **Enumerate destroyed slots for this node.** `ceph osd tree --states destroyed --format json` filtered by the prepare-job's node name (read from `ROOK_NODE_NAME` env). +2. **Discovery finds the new empty data device.** +3. **Invoke `ceph-volume` with the destroyed slot's ID.** Per layout: + - **LVM with shared metadata device** (encrypted or not): `ceph-volume lvm batch --no-auto /dev/ --db-devices /dev/ --osd-ids [--dmcrypt] --yes --no-systemd`. + - **LVM single-disk** (no separate metadata device): same command with `--db-devices` omitted. + - **Raw single-disk**: `ceph-volume raw prepare --bluestore --osd-id --data /dev/ [--dmcrypt]`. -# `zap --destroy` errors on a missing LV, so guard with `lvs` for retry-safety. -lvs >/dev/null 2>&1 \ - && ceph-volume lvm zap --destroy +Invocation notes for `lvm batch`: -# Provision the new OSD with the preserved ID, reusing the zapped DB LV. -ceph-volume lvm prepare \ - --bluestore [--dmcrypt] \ - --osd-id \ - --data \ - --block.db \ - --crush-device-class -``` +- **`--osd-ids ` is what claims the destroyed slot.** Internally `ceph-volume` calls `ceph osd new `, binding a fresh OSD UUID to the existing slot. The OSD ID, CRUSH bucket, and weight are preserved; only the UUID changes. +- **The metadata device is passed as a raw block-device path** (e.g., `/dev/nvme0n1`), not as a specific LV. `ceph-volume` detects the existing metadata VG already on the device and creates a new DB LV inside it — sibling DB LVs (other OSDs sharing the metadata device) are untouched. This relies on the spec's `metadataDevice` still pointing at the same physical device the sibling OSDs use; see [open question 4](#open-questions) for the spec-drift risk. +- **`--no-systemd` is required.** Without it, if `ceph-volume` hits an activate failure its rollback runs `osd purge-new`, which would destroy the slot we just claimed. +- **Use `lvm batch /dev/ ...`, not `lvm batch --data /dev/ ...`.** `lvm batch` reads data devices as bare paths; the `--data` flag exists but collides with `--data-slots` / `--data-allocate-fraction`. - The Job writes the new OSD's info to `rook-ceph-osd--status` (the per-node CM Rook already uses to drive daemon creation). +The prepare-job writes OSDInfo to the existing per-node status CM `rook-ceph-osd--status`. The cluster controller's existing path takes over: [createOSDsForStatusMap](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324) reads the CM and creates the daemon Deployment. The controller polls `ceph osd tree` each reconcile until `osd.` is `up` AND `in`; once observed, it emits an `OSDReplaceCompleted` event on the new deployment. -6. **Create new Deployment.** The cluster controller's existing path takes over: `createOSDsForStatusMap` ([`status.go#L324`](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/cluster/osd/status.go#L324)) reads the per-node status CM the Replace Job wrote and creates the daemon Deployment. +### Coordination -7. **Wait for `up+in`.** The controller polls `ceph osd tree` each reconcile until the new daemon is `up` AND `in`. Once visible, capture `osd_uuid` from `ceph osd metadata ` and transition to [Complete](#5-complete). +Replacements run serially per CephCluster as a simplifying choice, matching cephadm's `osd rm` queue and Rook's existing OSD migration. The serialization signal is observable in-band: "an annotated OSD deployment exists OR a Replace Job exists in the namespace". Additional annotated OSDs are deferred to subsequent reconciles. -#### 5. Complete +Per-OSD `safe-to-destroy` returns OK only when the OSD has zero PGs in any acting/up set AND either the cluster is `active+clean` or the OSD reports zero stored PGs. So drain progress can be affected by unrelated cluster state — a drain that times out may be stuck on cluster-wide noise (another OSD lagging, backfill from a different failure) rather than this OSD's status. Concurrent destroys of independently-safe OSDs are technically safe, but serial keeps the operational model simple. -Terminal phase. `.status.newFsid` and `.status.completedAt` are recorded; the `Ready` condition transitions to `True`. +### Reconcile placement -### Cancellation +The replacement reconciler runs at the end of `osds.Start()` in the OSD subpackage (alongside the existing OSD migration code), after normal OSD provisioning. Since `osds.Start()` is itself the last step of `reconcileCephDaemons`, a `RequeueAfter` return from replacement does not skip earlier reconcile work (mon, mgr, OSD daemon reconcile). There is no synchronous polling inside the reconcile worker — each waiting step returns `RequeueAfter` and the controller-runtime queue handles re-entry. -Cancel = delete the `CephOSDReplace` CR; a finalizer runs any cleanup needed. +### Cancellation and retry -**Pending, Validating, Waiting.** Clean cancel — no Ceph or host state has been changed. The finalizer is a no-op aside from clearing the auto-provisioning gate label on the affected node. +Cancellation is done by removing the `rook.io/osd-replace` annotation. -**Replacing — best-effort, deferred.** Once `Replacing` begins, the operator commits to running through to a terminal phase. Cancel intent is recorded but not acted on mid-flow: +- **During validate or drain (before destroy).** Clean cancel: no Ceph or host state has been changed that the controller cannot leave as-is. The OSD has been marked `out`; the user re-`in`s it if they want to put the OSD back into service. The operator clears the `started-at` annotation. The Replace Job has not been created yet. +- **During Replace Job execution.** Not honored. `ceph-volume lvm zap` and adjacent destroy steps cannot be safely interrupted mid-call (partial dm-crypt + half-zapped LV). The Replace Job runs to a terminal state. On Job failure, the operator may retry the Job (deterministic name; prior completed/failed Job deleted first) — destroy and zap sub-steps are idempotent. +- **After destroy succeeds.** Not honored. The slot is already destroyed in the OSDMap; the only recovery is to either let the flow complete (provision a new OSD with the same ID) or run `ceph osd purge ` manually outside Rook to retire the slot. -- If the Replace Job is in flight, the operator lets it complete. `ceph-volume lvm prepare` cannot be safely interrupted mid-call (partial dm-crypt + half-LUKS LV). -- On Job failure, the operator retries the Replace Job with an incremented attempt counter (`…-replace-`). The destroy and zap sub-steps are idempotent; recovery from a crash mid-`lvm prepare` (partial VG on the new disk) is an implementation detail. The finalizer removes the CR only after the Job reaches a terminal phase. -- On Job success, cancel is not honored — the new OSD joins the cluster. +Retry after a terminal failure (validation rejection, drain timeout, Job failure that the operator does not auto-retry): the user removes the `rook.io/osd-replace` annotation and re-adds it. The annotation-flip predicate fires a fresh reconcile; the operator clears the stale `started-at` annotation on a fresh start. ## Notes on Scope @@ -340,45 +254,69 @@ nodes: config: { metadataDevice: "nvme1n1" } # different metadata device on the same node ``` -This setup requires exact `name:` (or `fullpath:`) references — the per-device `config:` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally (each OSD's `dbLV` path encodes the metadata VG, which sits on exactly one physical device — so the replacement targets the correct metadata device by construction), with two caveats: +This setup requires exact `name` (or `fullpath`) references — the per-device `config` block can only be attached to a specific device entry, not to a regex match. Replacement of a single OSD on this setup works structurally: the destroyed slot keeps the OSD ID, `lvm batch --osd-ids` reuses it, and the new DB LV lands in the existing metadata VG, so the per-device `config` block targets the correct metadata device — subject to the same spec-drift caveats discussed in [open question 4](#open-questions). Two caveats: + +- **Device-name validation must permit exact entries** — see [open question 5](#open-questions). +- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls at provisioning. -- **Device-name validation must permit exact entries** — see [open question 6](#open-questions). -- **Same-slot replacement is required** — `by-path` resolves only when the new disk is in the original slot. Different-slot replacement stalls in the Wait step. +### Non-shared-metadata OSDs — same flow, simpler teardown -### Non-shared-metadata OSDs — same CRD, simpler bash +The flow naturally covers OSDs without a separate metadata device: -The flow naturally covers OSDs without a separate metadata device — single-disk LVM-mode OSDs and `ceph-volume raw` OSDs. `osdInfo.dbLV` is empty for these cases, the Replace Job skips the DB-LV zap step and the `--block.db` arg, and the prepare call simplifies to `ceph-volume {lvm,raw} prepare --osd-id --data [--dmcrypt] --crush-device-class `. The CV mode (lvm vs raw) is already tracked in Rook's existing `OSDInfo`. Implementation may branch the Replace Job bash on `osdInfo.dbLV == ""` and on CV mode; the rest of the flow (CRD, state machine, Capture Job, auto-provisioning skip, Replace Job scaffold) is shared. +- **LVM single-disk:** the Replace Job's `lvm zap --osd-id` removes the single OSD LV. Provisioning uses `lvm batch` without `--db-devices`. +- **`ceph-volume raw` OSDs:** the Replace Job's raw branch closes the dm-crypt mapping (if encrypted) and runs `lvm zap` on the raw device path. Provisioning uses `raw prepare --osd-id `. + +The CV mode (lvm vs raw) is already carried on the OSD deployment via `ROOK_CV_MODE`; the Replace Job branches on `OSD_CV_MODE` in step 5, and the prepare-job picks the matching provisioning command. ### PVC-based OSD replacement — separate design -PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. +PVC-backed OSDs use a different code path (raw mode via `GetCephVolumeRawOSDs`, separate destroy plumbing). Issue #13240 is host-based storage; PVC replacement is a separate design. The validate step rejects PVC-backed deployments by the presence of the `ceph.rook.io/pvc` label. -## Open questions +## Implementation scope + +Rook has no automated flow for replacing a failed OSD today. The closest existing primitive is the OSD migration flow (`spec.storage.migration`), which recreates OSDs in place after encryption or store-type spec changes via a `ROOK_REPLACE_OSD=` env switch on the prepare-job that triggers `DestroyOSD` before provisioning. Migration is decoupled from disk replacement (no swap step) and `DestroyOSD` does not cover the lvm-mode shared-metadata case. Implementing the flow above requires two kinds of work: filling in capabilities missing from existing Rook code, and adding new logic. -1. **Controller placement.** Design leans toward a separate `CephOSDReplace` CRD; `spec.storage.replaceOSD` on CephCluster (mirroring `spec.storage.migration`) is a fallback — see [Open question: controller placement](#open-question-controller-placement). Maintainers' call. +**Missing in existing Rook code:** -2. **Parallelism.** The proposed OSD replacement process is serial. Are there use-cases for parallel replacement we should support — multiple OSDs safe-to-destroy on the same node, all safe-to-destroy in the cluster at once, configurable concurrency? +1. **Annotation flips on an OSD Deployment do not enqueue a reconcile.** `WatchPredicateForNonCRDObject` filters out all Deployment update events at the `appsv1.Deployment` switch case ([predicate.go#L252-L256](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/operator/ceph/controller/predicate.go#L252-L256)), so a `rook.io/osd-replace` annotation set by the user is invisible to the CephCluster controller. The replacement flow needs a carve-out: annotation transitions (absent to present, value change, present to absent) must trigger a reconcile, while other Deployment updates remain suppressed. -3. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (CR creation). Should there be a follow-up option for automated replacement that triggers the same flow when a failed OSD and a fresh disk are detected on a node? +2. **The prepare-job is not aware of destroyed OSD slots in the OSDMap.** Destroyed slots are exposed by `ceph osd tree --states destroyed`; the prepare-job needs to enumerate them by node and pass the IDs to `lvm batch` via a new `--osd-ids` argument ([volume.go#L597-L720](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/volume.go#L597-L720)). -4. **Default values.** Proposed: `safeToDestroyTimeout: 1h`, `diskWaitTimeout: 24h`, disk-wait re-check interval `5 min` (cluster-config tunable). Reasonable, or do reviewers see a reason to change them? +3. **`DestroyOSD` ([remove.go#L245-L290](https://github.com/rook/rook/blob/59ce48ae88e5ea59df44249b41a887af96a2806c/pkg/daemon/ceph/osd/remove.go#L245-L290)) doesn't handle shared-metadata layouts.** Needs extending to cover the shared-metadata DB LV. Reused by both migration and replacement after the extension. + +**New code:** + +All operator-side steps (validation, drain, Replace Job lifecycle) are added as a new reconcile step at the end of `reconcileCephDaemons`. See [Reconcile placement](#reconcile-placement). + +The Replace Job fully reuses the prepare-job's binary and pod scaffold. The new code adds a **destroy-only entry point** — a `rook ceph osd destroy --id ` subcommand or env as a flag to run only destroy logic and exit. The Job instance is per-OSD (`rook-ceph-osd-replace-`). + +Provisioning of the new disk hands back through the existing per-node status CM (`rook-ceph-osd--status`) consumed by `createOSDsForStatusMap` — no new return path. See [Why a separate Replace Job](#why-a-separate-replace-job) below for why the Replace Job is its own thing rather than folded into the prepare-job's `ROOK_REPLACE_OSD` path. + +### Why a separate Replace Job + +The existing OSD migration flow and this replacement flow both destroy an OSD and reuse its slot. They share the destroy primitive (`DestroyOSD`). Could they share the same Job? + +The two flows differ structurally in one place: when provision runs relative to destroy. + +- **Migration** is in-place: destroy and re-provision happen back-to-back in one prepare-job pod on the same disk. The disk is always present. +- **Replacement** is deferred: destroy runs now, but re-provision waits minutes to days for the user to physically swap the failed disk for a new one. Cephadm's [`orch osd rm --replace`](https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd) uses the same deferred shape. + +If replacement reused migration's "destroy then provision" pod sequence, `Provision` would run immediately after `DestroyOSD` and claim the destroyed slot back onto the freshly-zapped failed disk — defeating the replacement (post-zap, the failed disk presents as an empty kernel device, indistinguishable from a fresh one). + +The hole could be closed by extending migration with a destroy-only branch plus conditional gating in `Provision`. That works, but it entangles two distinct lifecycles in one code path with branches at each divergence — at the cost of migration's current focused scope. + +Keeping replacement as its own short-lived Job (`rook-ceph-osd-replace-`) invoking a destroy-only entry point keeps migration's code path unmodified and gives per-OSD observability (`kubectl get jobs` shows which OSD is mid-replacement). The genuinely shared work — the `DestroyOSD` extension — lives below the Job orchestration layer and is reused by both flows. The `--osd-ids` wiring and destroyed-slot enumeration are added for replacement and not used by migration. + +## Open questions -5. **Disk-swap responsiveness.** Can the design rely on Rook's existing discovery (rook-discover when enabled, otherwise the per-reconcile prepare-job inventory) to detect the replacement disk? +1. **Enforce the swap-after-Job contract programmatically or document only?** The flow defines a user contract: swap only after the Replace Job completes. Violation case: the prepare-job claims the new disk as a fresh OSD with a new ID, orphaning the in-flight replacement; recovery is manual `ceph osd purge`. Cephadm matches the contract approach (no programmatic gate). Alternative: suppress the prepare-job on the affected node while the `rook.io/osd-replace` annotation is set or the Replace Job exists, modeled on the existing `skipPreparePod` check in `startProvisioningOverPVCs` (no equivalent exists in `startProvisioningOverNodes` today). Cost: a few lines in `startProvisioningOverNodes`. Benefit: violation becomes impossible to trigger. -6. **Device-name validation.** Proposed: reject kernel names (`/dev/sda`, `vda`), `by-id`, and `by-uuid` references; accept `useAllDevices`, `deviceFilter`, and `by-path` (with implicit same-slot expectation). Sample: +2. **Auto-replace mode.** The proposed flow is always triggered explicitly by the user (annotation). Should there be a follow-up option for the operator to auto-annotate when a failed OSD and a fresh disk are detected on a node? - ```yaml - spec: - storage: - nodes: - - name: node-1 - devices: - - name: /dev/sda # kernel name — rejected (not swap-stable) - - name: /dev/disk/by-path/... # by-path — accepted (same-slot only) - ``` +3. **Default values.** Proposed: `safeToDestroyTimeout: 1h`; drain re-check interval `30s`; disk-swap wait has no timeout (the destroyed slot persists in the OSDMap indefinitely). Reasonable, or change them? - Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? +4. **Metadata device on a shared-metadata node: spec vs sibling.** The current proposal re-reads `spec.storage[].config.metadataDevice` at provision time, which can drift from what the destroyed OSD actually used (user spec edits, or kernel-name renumbering on host reboot — kernel names are not persistent). Drift silently splits the shared-metadata layout. Alternative: derive the metadata device from a surviving sibling's DB LV (e.g., `ceph-volume lvm list --format json` filtered on `type=db`). Downside: when no sibling survives on the node, this must fall back to spec anyway. -7. **Cross-host replacement for non-shared-metadata OSDs.** Same-host is required by this design because the captured `dbLV` lives on the original host's metadata VG. For OSDs without a metadata device this argument doesn't apply. Ceph itself permits cross-host replacement: `ceph osd destroy` retains no host info; CRUSH auto-relocates the OSD on daemon start at the cost of full PG remapping. Should this flow be supported by Rook osd replacement? +5. **Device-name validation.** Proposed: accept `useAllDevices`, `deviceFilter`, and `by-path` (same-slot only); reject kernel names, `by-id`, and `by-uuid` (see [persistent block device naming](https://wiki.archlinux.org/title/Persistent_block_device_naming) for background on which references survive a swap). Should this be configurable, more permissive (user takes responsibility for any name), or stricter (reject `by-path` too)? -8. **Combine Capture and Replace into one Job.** The Capture step currently runs as a separate Job before the Replace Job. It could be merged into the Replace Job's first commands, with intermediate state persisted to a per-replacement ConfigMap (mirroring the prepare-job's CM hand-off pattern) so the Job remains retry-idempotent: on retry the Job reads the CM if present and skips re-capture, then proceeds with destroy + prepare. Saves one Job spawn per replacement; adds a CM-check branch at the top of the Job's bash and a CM-write helper invocation. +6. **Where to report the validation result.** The current proposal emits a Warning event on the CephCluster CR (matching Rook convention). Should we also emit on the OSD Deployment for discoverability (it's the object the user just annotated), or emit only on the Deployment?