@@ -89,11 +89,60 @@ And then remove the host from inventory (usually in
8989Additional options/commands may be found in
9090`Host management <https://docs.ceph.com/en/latest/cephadm/host-management/ >`_
9191
92- Replacing a Failed Ceph Drive
93- -----------------------------
92+ Replacing failing drive
93+ -----------------------
9494
95- Once an OSD has been identified as having a hardware failure,
96- the affected drive will need to be replaced.
95+ A failing drive in a Ceph cluster will cause OSD daemon to crash.
96+ In this case Ceph will go into `HEALTH_WARN ` state.
97+ Ceph can report details about failed OSDs by running:
98+
99+ .. code-block :: console
100+ # From storage host
101+ sudo cephadm shell
102+ ceph health detail
103+
104+ .. note ::
105+
106+ Remember to run ceph/rbd commands from within ``cephadm shell``
107+ (preferred method) or after installing Ceph client. Details in the
108+ official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
109+ It is also required that the host where commands are executed has admin
110+ Ceph keyring present - easiest to achieve by applying
111+ `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
112+ label (Ceph MON servers have it by default when using
113+ `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
114+
115+ A failed OSD will also be reported as down by running:
116+
117+ .. code-block :: console
118+
119+ ceph osd tree
120+
121+ Note the ID of the failed OSD.
122+
123+ The failed disk is usually logged by the Linux kernel too:
124+
125+ .. code-block :: console
126+
127+ # From storage host
128+ dmesg -T
129+
130+ Cross-reference the hardware device and OSD ID to ensure they match.
131+ (Using `pvs ` and `lvs ` may help make this connection).
132+
133+ See upstream documentation:
134+ https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
135+
136+ In case where disk holding DB and/or WAL fails, it is necessary to recreate
137+ all OSDs that are associated with this disk - usually NVMe drive. The
138+ following single command is sufficient to identify which OSDs are tied to
139+ which physical disks:
140+
141+ .. code-block :: console
142+
143+ ceph device ls
144+
145+ Once OSDs on failed disks are identified, follow procedure below.
97146
98147If rebooting a Ceph node, first set ``noout `` to prevent excess data
99148movement:
@@ -130,25 +179,6 @@ spec before (``cephadm_osd_spec`` variable in ``etc/kayobe/cephadm.yml``).
130179Either set ``unmanaged: true `` to stop cephadm from picking up new disks or
131180modify it in some way that it no longer matches the drives you want to remove.
132181
133-
134- Operations
135- ==========
136-
137- Replacing drive
138- ---------------
139-
140- See upstream documentation:
141- https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd
142-
143- In case where disk holding DB and/or WAL fails, it is necessary to recreate
144- (using replacement procedure above) all OSDs that are associated with this
145- disk - usually NVMe drive. The following single command is sufficient to
146- identify which OSDs are tied to which physical disks:
147-
148- .. code-block :: console
149-
150- ceph device ls
151-
152182Host maintenance
153183----------------
154184
@@ -163,46 +193,6 @@ https://docs.ceph.com/en/latest/cephadm/upgrade/
163193Troubleshooting
164194===============
165195
166- Investigating a Failed Ceph Drive
167- ---------------------------------
168-
169- A failing drive in a Ceph cluster will cause OSD daemon to crash.
170- In this case Ceph will go into `HEALTH_WARN ` state.
171- Ceph can report details about failed OSDs by running:
172-
173- .. code-block :: console
174-
175- ceph health detail
176-
177- .. note ::
178-
179- Remember to run ceph/rbd commands from within ``cephadm shell``
180- (preferred method) or after installing Ceph client. Details in the
181- official `documentation <https://docs.ceph.com/en/latest/cephadm/install/#enable-ceph-cli>`__.
182- It is also required that the host where commands are executed has admin
183- Ceph keyring present - easiest to achieve by applying
184- `_admin <https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels>`__
185- label (Ceph MON servers have it by default when using
186- `StackHPC Cephadm collection <https://github.com/stackhpc/ansible-collection-cephadm>`__).
187-
188- A failed OSD will also be reported as down by running:
189-
190- .. code-block :: console
191-
192- ceph osd tree
193-
194- Note the ID of the failed OSD.
195-
196- The failed disk is usually logged by the Linux kernel too:
197-
198- .. code-block :: console
199-
200- # From storage host
201- dmesg -T
202-
203- Cross-reference the hardware device and OSD ID to ensure they match.
204- (Using `pvs ` and `lvs ` may help make this connection).
205-
206196Inspecting a Ceph Block Device for a VM
207197---------------------------------------
208198
0 commit comments