diff --git a/doc/source/configuration/monitoring.rst b/doc/source/configuration/monitoring.rst index d875a2b5ed..889846b706 100644 --- a/doc/source/configuration/monitoring.rst +++ b/doc/source/configuration/monitoring.rst @@ -69,6 +69,12 @@ present, the workaround is to go into each node running Grafana and manually restart the process with ``systemctl restart kolla-grafana-container.service`` and then try the reconfigure command again.) +.. note:: + If the environment defines additional Prometheus Node Exporter startup parameters + via ``prometheus_node_exporter_cmdline_extras``, the parameters should be updated + to include the textfile collector used by SMART monitoring: + ``--collector.textfile.directory=/var/lib/node_exporter/textfile_collector`` + Once the reconfigure has completed you can now run the custom playbook which copies over the scripts and sets up the cron jobs to start SMART monitoring on the overcloud hosts: @@ -81,6 +87,27 @@ on the overcloud hosts: SMART reporting should now be enabled along with a Prometheus alert for unhealthy disks and a Grafana dashboard called ``Hardware Overview``. +Monitoring Drive Writes Per Day +------------------------------- + +Drives can be monitored for the level of write intensity of the +workload, and alerts defined for drives that are persistently +exceeding their stated level of write endurance. To enable this +feature, set the flag ``create_dwpd_ratings``: + +.. code-block:: console + + (kayobe) [stack@node ~]$ cd etc/kayobe + (kayobe) [stack@node kayobe]$ kayobe playbook run ansible/deployment/smartmon-tools.yml -e create_dwpd_ratings=true + +This flag scans for NVME/SSD devices in the system and creates a new +file, ``dwpd-ratings.yml``, in the directory of the current environment. + +.. note:: + The playbook assigns placeholder values for write endurance for each + drive model. These values should be updated with specifications from + vendor datasheets. + Alertmanager, Slack and Microsoft Teams =======================================