OpenStack control plane logs are aggregated from all servers by Monasca and stored in ElasticSearch. The control plane logs can be accessed from ElasticSearch using Kibana, which is available at the following URL: |kibana_url|
To login, use the kibana user. The password is auto-generated by
Kolla-Ansible and can be extracted from the encrypted passwords file
(|kolla_passwords|):
kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^kibanaMonasca metrics can be visualised in Grafana dashboards. Monasca Grafana can be found at the following address: |grafana_url|
Grafana uses Keystone authentication. To login, use valid OpenStack user credentials.
To visualise control plane metrics, you will need one of the following roles in
the monasca_control_plane project:
adminmonasca-usermonasca-read-only-usermonasca-editor
To see where all virtual machines are running on the hypervisors:
admin# openstack server list --all-projects --longTo move a virtual machine with shared storage or booted from volume from one hypervisor to another, for example to |hypervisor_hostname|:
admin# openstack --os-compute-api-version 2.30 server migrate --live-migration --host |hypervisor_hostname| 6a35592c-5a7e-4da3-9ab9-6765345641cbTo move a virtual machine with local disks:
admin# openstack --os-compute-api-version 2.30 server migrate --live-migration --block-migration --host |hypervisor_hostname| 6a35592c-5a7e-4da3-9ab9-6765345641cbAnsible is oriented towards adding or reconfiguring services, but removing a service is handled less well, because of Ansible's imperative style.
To remove a service, it is disabled in Kayobe's Kolla config, which prevents
other services from communicating with it. For example, to disable
cinder-backup, edit ${KAYOBE_CONFIG_PATH}/kolla.yml:
-enable_cinder_backup: true
+enable_cinder_backup: falseThen, reconfigure Cinder services with Kayobe:
kayobe# kayobe overcloud service reconfigure --kolla-tags cinderHowever, the service itself, no longer in Ansible's manifest of managed state, must be manually stopped and prevented from restarting.
On each controller:
kayobe# docker rm -f cinder_backupSome services may store data in a dedicated Docker volume, which can be removed
with docker volume rm.
To configure TLS for the first time, we write a PEM file to the secrets.yml
file as secrets_kolla_external_tls_cert. Use a command of this form:
kayobe# ansible-vault edit ${KAYOBE_CONFIG_PATH}/secrets.yml --vault-password-file=|vault_password_file_path|Concatenate the contents of the certificate and key files to create
secrets_kolla_external_tls_cert.
In ${KAYOBE_CONFIG_PATH}/kolla.yml, set the following:
kolla_enable_tls_external: True
kolla_external_tls_cert: "{{ secrets_kolla_external_tls_cert }}"To configure TLS, we need to reconfigure all services, as endpoint URLs need to be updated in Keystone:
kayobe# kayobe overcloud service reconfigureTo update an existing certificate, for example when it has reached expiration,
change the value of secrets_kolla_external_tls_cert and run the following
command:
kayobe# kayobe overcloud service reconfigure --kolla-tags haproxyTo take a hypervisor out of Nova scheduling, for example |hypervisor_hostname|:
admin# openstack compute service set --disable \
|hypervisor_hostname| nova-computeRunning instances on the hypervisor will not be affected, but new instances will not be deployed on it.
A reason for disabling a hypervisor can be documented with the
--disable-reason flag:
admin# openstack compute service set --disable \
--disable-reason "Broken drive" |hypervisor_hostname| nova-computeDetails about all hypervisors and the reasons they are disabled can be displayed with:
admin# openstack compute service list --longAnd then to enable a hypervisor again:
admin# openstack compute service set --enable \
|hypervisor_hostname| nova-computeIf the Docker registry becomes full, this can prevent container updates and (depending on the storage configuration of the seed host) could lead to other problems with services provided by the seed host.
To remove container images from the Docker Registry, follow this process:
- Reconfigure the registry container to allow deleting containers. This can be
done in
docker-registry.ymlwith Kayobe:
docker_registry_env:
REGISTRY_STORAGE_DELETE_ENABLED: "true"- For the change to take effect, run:
kayobe seed host configure- A helper script is useful, such as https://github.com/byrnedo/docker-reg-tool
(this requires
jq). To delete all images with a specific tag, use:
for repo in `./docker_reg_tool http://registry-ip:4000 list`; do
./docker_reg_tool http://registry-ip:4000 delete $repo $tag
done- Deleting the tag does not actually release the space. To actually free up space, run garbage collection:
seed# docker exec docker_registry bin/registry garbage-collect /etc/docker/registry/config.ymlThe seed host can also accrue a lot of data from building container images.
The images stored locally in the seed host can be seen using docker image ls.
Old and redundant images can be identified from their names and tags, and
removed using docker image rm.
As the backup procedure is constantly changing, it is normally best to check the upstream documentation for an up to date procedure. Here is a high level overview of the key things you need to backup:
The compute nodes can largely be thought of as ephemeral, but you do need to make sure you have migrated any instances and disabled the hypervisor before decommissioning or making any disruptive configuration change.
- Back up service VMs such as the seed VM
Monasca has been configured to collect logs and metrics across the control plane. It provides a single point where control plane monitoring and telemetry data can be analysed and correlated.
Metrics are collected per server via the Monasca Agent. The Monasca Agent is deployed and configured by Kolla Ansible.
Logging to Monasca is done via a Fluentd output plugin.
If you wish to generate alerts for specific log messages, you must first generate metrics from those log messages. Metrics are generated from the transformed logs queue in Kafka. The Monasca log metrics service reads log messages from this queue, transforms them into metrics and then writes them to the metrics queue.
The rules which govern this transformation are defined in the logstash config
file. This file can be configured via kayobe. To do this, edit
etc/kayobe/kolla/config/monasca/log-metrics.conf, for example:
# Create events from specific log signatures
filter {
if "Another thread already created a resource provider" in [log][message] {
mutate {
add_field => { "[log][dimensions][event]" => "hat" }
}
} else if "My string here" in [log][message] {
mutate {
add_field => { "[log][dimensions][event]" => "my_new_alert" }
}
}
Reconfigure Monasca:
kayobe# kayobe overcloud service reconfigure --kolla-tags monasca
Verify that logstash doesn't complain about your modification. On each node
running the monasca-log-metrics service, the logs can be inspected in the
Kolla logs directory, under the logstash folder:
/var/log/kolla/logstash.
Metrics will now be generated from the configured log messages. To generate alerts/notifications from your new metric, follow the next section.
Firstly, we will configure alarms and notifications. This should be done via the Monasca client. More detailed documentation is available in the Monasca API specification. This document provides an overview of common use-cases.
To create a Slack notification, first obtain the URL for the notification hook from Slack, and configure the notification as follows:
monasca# monasca notification-create stackhpc_slack SLACK https://hooks.slack.com/services/UUIDYou can view notifications at any time by invoking:
monasca# monasca notification-listTo create an alarm with an associated notification:
monasca# monasca alarm-definition-create multiple_nova_compute \
'(count(log.event.multiple_nova_compute{}, deterministic)>0)' \
--description "Multiple nova compute instances detected" \
--severity HIGH --alarm-actions $NOTIFICATION_IDBy default one alarm will be created for all hosts. This is typically useful
when you are looking at the overall state of some hosts. For example in the
screenshot below the db_mon_log_high_mem_usage alarm has previously
triggered on a number of hosts, but is currently below threshold.
If you wish to have an alarm created per host you can use the --match-by
option and specify the hostname dimension. For example:
monasca# monasca alarm-definition-create multiple_nova_compute \
'(count(log.event.multiple_nova_compute{}, deterministic)>0)' \
--description "Multiple nova compute instances detected" \
--severity HIGH --alarm-actions $NOTIFICATION_ID
--match-by hostnameCreating an alarm per host can be useful when alerting on one off events such as log messages which need to be actioned individually. Once the issue has been investigated and fixed, the alarm can be deleted on a per host basis.
For example, in the case of monitoring for file system corruption one might define a metric from the system logs alerting on XFS file system corruption, or ECC memory errors. These metrics may only be generated once, but it is important that they are not ignored. Therefore, in the example below, the last operator is used so that the alarm is evaluated against the last metric associated with the log message. Since for log metrics the value of this metric is always greater than 0, this alarm can only be reset by deleting it (which can be accomplished by clicking on the dustbin icon in Monasca Grafana). By ensuring that the alarm has to be manually deleted and will not reset to the OK status, important errors can be tracked.
monasca# monasca alarm-definition-create xfs_errors \
'(last(log.event.xfs_errors_detected{}, deterministic)>0)' \
--description "XFS errors detected on host" \
--severity HIGH --alarm-actions $NOTIFICATION_ID \
--match-by hostnameIt is also possible to update existing alarms. For example, to update, or add multiple notifications to an alarm:
monasca# monasca alarm-definition-patch $ALARM_ID --alarm-actions $NOTIFICATION_ID --alarm-actions $NOTIFICATION_ID_2- Verify integrity of clustered components (RabbitMQ, Galera, Keepalived). They should all report a healthy status.
- Put node into maintenance mode in bifrost to prevent it from automatically powering back on
- Shutdown down nodes one at a time gracefully using systemctl poweroff
If you are restarting the controllers, it is best to do this one controller at a time to avoid the clustered components losing quorum.
On each controller perform the following:
[stack@|controller0_hostname| ~]$ docker exec -i mariadb mysql -u root -p -e "SHOW STATUS LIKE 'wsrep_local_state_comment'"
Variable_name Value
wsrep_local_state_comment SyncedThe password can be found using:
kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml \
--vault-password-file |vault_password_file_path| | grep ^databaseRabbitMQ health is determined using the command rabbitmqctl cluster_status:
[stack@|controller0_hostname| ~]$ docker exec rabbitmq rabbitmqctl cluster_status
Cluster status of node rabbit@|controller0_hostname| ...
[{nodes,[{disc,['rabbit@|controller0_hostname|','rabbit@|controller1_hostname|',
'rabbit@|controller2_hostname|']}]},
{running_nodes,['rabbit@|controller1_hostname|','rabbit@|controller2_hostname|',
'rabbit@|controller0_hostname|']},
{cluster_name,<<"rabbit@|controller2_hostname|">>},
{partitions,[]},
{alarms,[{'rabbit@|controller1_hostname|',[]},
{'rabbit@|controller2_hostname|',[]},
{'rabbit@|controller0_hostname|',[]}]}]On (for example) three controllers:
[stack@|controller0_hostname| ~]$ docker logs keepalivedTwo instances should show:
VRRP_Instance(kolla_internal_vip_51) Entering BACKUP STATEand the other:
VRRP_Instance(kolla_internal_vip_51) Entering MASTER STATEThe Ansible control host is not enrolled in bifrost. This node may run services such as the seed virtual machine which will need to be gracefully powered down.
If you are shutting down a single hypervisor, to avoid down time to tenants it is advisable to migrate all of the instances to another machine. See :ref:`evacuating-all-instances`.
.. ifconfig:: deployment['ceph_managed'] Ceph ---- The following guide provides a good overview: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/8/html/director_installation_and_usage/sect-rebooting-ceph
kayobe# ssh stack@|seed_name| sudo systemctl poweroff
kayobe# virsh shutdown |seed_name|Follow separate :doc:`document <full_shutdown>`.
Example: Reboot all over the computes apart from |hypervisor_hostname|:
kayobe# kayobe overcloud host command run --limit 'compute:!|hypervisor_hostname|' -b --command "shutdown -r"- Remove the node from maintenance mode in bifrost
- Bifrost should automatically power on the node via IPMI
- Check that all docker containers are running
- Check Kibana for any messages with log level ERROR or equivalent
If all of the servers were shut down at the same time, it is necessary to run a script to recover the database once they have all started up. This can be done with the following command:
kayobe# kayobe overcloud database recoverThe Ansible control host is not enrolled in Bifrost and will have to be powered on manually.
The seed VM (and any other service VM) should start automatically when the seed hypervisor is powered on. If it does not, it can be started with:
kayobe# virsh start seed-0Follow separate :ref:`document <full-power-on>`.
Log into the monitoring host(s):
kayobe# ssh stack@|monitoring_host|Stop all Docker containers:
|monitoring_host|# for i in `docker ps -q`; do docker stop $i; doneShut down the node:
|monitoring_host|# sudo shutdown -hThe monitoring services containers will automatically start when the monitoring node is powered back on.
OS packages can be updated with:
kayobe # kayobe overcloud host package update --limit |hypervisor_hostname| --packages '*'
kayobe # kayobe overcloud seed package update --packages '*'See https://docs.openstack.org/kayobe/latest/administration/overcloud.html#updating-packages
- Pull latest changes from upstream stable branch to your own
kollafork (if applicable) - Update
kolla_openstack_releaseinetc/kayobe/kolla.yml(unless using default) - Update tags for the images in
etc/kayobe/kolla/globals.ymlto use the new value ofkolla_openstack_release - Rebuild container images
- Pull container images to overcloud hosts
- Run kayobe overcloud service upgrade
For more information, see: https://docs.openstack.org/kayobe/latest/upgrading.html
To test creating an instance on a specific hypervisor, as an admin-level user you can specify the hypervisor name as part of an extended availability zone description.
To see the list of hypervisor names:
admin# openstack hypervisor listTo boot an instance on a specific hypervisor, for example on |hypervisor_hostname|:
admin# openstack server create --flavor |flavor_name| --network |network_name| --key-name <key> --image CentOS8.2 --availability-zone nova::|hypervisor_hostname| vm-nameOpenStack services can sometimes fail to remove all resources correctly. This is the case with Magnum, which fails to clean up users in its domain after clusters are deleted. A patch has been submitted to stable branches. Until this fix becomes available, if Magnum is in use, administrators can perform the following cleanup procedure regularly:
admin# for user in $(openstack user list --domain magnum -f value -c Name | grep -v magnum_trustee_domain_admin); do
if openstack coe cluster list -c uuid -f value | grep -q $(echo $user | sed 's/_[0-9a-f]*$//'); then
echo "$user still in use, not deleting"
else
openstack user delete --domain magnum $user
fi
doneTo enable and alter default rotation values for Elasticsearch Curator edit ${KAYOBE_CONFIG_PATH}/kolla/globals.yml - This applies both to Monasca and Central Logging configurations.
# Allow Elasticsearch Curator to apply a retention policy to logs
enable_elasticsearch_curator: true
# Duration after which index is closed
elasticsearch_curator_soft_retention_period_days: 90
# Duration after which index is deleted
elasticsearch_curator_hard_retention_period_days: 180Reconfigure elasticsearch with new values:
kayobe overcloud service reconfigure --kolla-tags elasticsearch --kolla-skip-tags common --skip-precheckFor more information see upstream documentation