Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions doc/source/configuration/monitoring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,12 @@ present, the workaround is to go into each node running Grafana and manually
restart the process with ``systemctl restart kolla-grafana-container.service``
and then try the reconfigure command again.)

.. note::
If the environment defines additional Prometheus Node Exporter startup parameters
via ``prometheus_node_exporter_cmdline_extras``, the parameters should be updated
to include the textfile collector used by SMART monitoring:
``--collector.textfile.directory=/var/lib/node_exporter/textfile_collector``

Once the reconfigure has completed you can now run the custom playbook which
copies over the scripts and sets up the cron jobs to start SMART monitoring
on the overcloud hosts:
Expand All @@ -81,6 +87,27 @@ on the overcloud hosts:
SMART reporting should now be enabled along with a Prometheus alert for
unhealthy disks and a Grafana dashboard called ``Hardware Overview``.

Monitoring Drive Writes Per Day
-------------------------------

Drives can be monitored for the level of write intensity of the
workload, and alerts defined for drives that are persistently
exceeding their stated level of write endurance. To enable this
feature, set the flag ``create_dwpd_ratings``:

.. code-block:: console

(kayobe) [stack@node ~]$ cd etc/kayobe
(kayobe) [stack@node kayobe]$ kayobe playbook run ansible/deployment/smartmon-tools.yml -e create_dwpd_ratings=true

This flag scans for NVME/SSD devices in the system and creates a new
file, ``dwpd-ratings.yml``, in the directory of the current environment.

.. note::
The playbook assigns placeholder values for write endurance for each
drive model. These values should be updated with specifications from
vendor datasheets.

Alertmanager, Slack and Microsoft Teams
=======================================

Expand Down
Loading