diff --git a/monitoring/prometheus-rules/cvi.state.yaml b/monitoring/prometheus-rules/cvi.state.yaml new file mode 100644 index 0000000000..781d54321b --- /dev/null +++ b/monitoring/prometheus-rules/cvi.state.yaml @@ -0,0 +1,126 @@ +- name: virtualization.vi.state + rules: + - alert: D8VirtualizationClusterVirtualImageStuckInPendingPhase + expr: d8_virtualization_clustervirtualimage_status_phase{phase="Pending"} == 1 + labels: + severity_level: "9" + tier: cluster + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_clustervirtualimage_state: "D8VirtualizationClusterVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_clustervirtualimage_state: "D8VirtualizationClusterVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: ClusterVirtualImage is stuck in the `Pending` phase for a long time. + description: | + The virtual image `{{ $labels.name }}` has been stuck in the `Pending` phase for more than 60 minutes. + + ### Common Causes + + - Missing or not ready ClusterVirtualImage, ClusterClusterVirtualImage, VirtualDisk or ClusterVirtualImageSnapshot + - Scheduling issues on the node + - Cluster resource shortage (CPU, memory) + - Exhausted quotas (e.g., CPU, memory limits) + + ### Recommended Actions + + 1. Check virtual image status: + ```bash + d8 k get cvi {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect conditions for details: + ```bash + d8 k get cvi {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Check related events: + ```bash + d8 k get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 4. Check if the source ClusterVirtualImage, ClusterClusterVirtualImage or ClusterVirtualImageSnapshot exists and is Ready: + ```bash + d8 k -A get vd, vi, cvi, vis + ``` + + + - alert: D8VirtualizationClusterVirtualImageStuckInWaitForUserUploadPhase + expr: d8_virtualization_clustervirtualimage_status_phase{phase="WaitForUserUpload"} == 1 + labels: + severity_level: "9" + tier: cluster + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_clustervirtualimage_state: "D8VirtualizationClusterVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_clustervirtualimage_state: "D8VirtualizationClusterVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: ClusterVirtualImage is stuck in the `WaitForUserUpload` phase for a long time. + description: | + The cluster virtual image `{{ $labels.name }}` has been waiting for a user image upload for more than 60 minutes. + + This means that no image was uploaded to provision the cluster virtual image. + + ### What You Need to Do + + Upload the required image image using one of the provided URLs: + + - From outside the cluster: + ```bash + d8 k get cvi {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.external}" + ``` + + - From inside the cluster (node): + ```bash + d8 k get cvi {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.inCluster}" + ``` + + - Use `curl`, `wget`, or any HTTP client with `PUT` method and appropriate content-type (`application/octet-stream`) to upload the image. + + Example: + ```bash + curl -X PUT --data-binary @image.qcow2 \ + -H "Content-Type: application/octet-stream" \ + $(d8 k get cvi {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.external}") + ``` + + + - alert: D8VirtualizationClusterVirtualImageFailed + expr: d8_virtualization_clustervirtualimage_status_phase{phase="Failed"} == 1 + labels: + severity_level: "6" + tier: cluster + for: 0m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_clustervirtualimage_state: "D8VirtualizationClusterVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_clustervirtualimage_state: "D8VirtualizationClusterVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: ClusterVirtualImage in the `Failed` phase. + description: | + The virtual image `{{ $labels.name }}` in the `Failed` phase. + + This may indicate one or more of the following issues: + + - Wrong image URL + - Wrong container image + - Network issues + - Storage issues + + ### Recommended Actions + + 1. Check the full status of the cluster virtual image: + ```bash + d8 k get cvi {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect the condition for details: + ```bash + d8 k get cvi {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Review events related to this cluster virtual image: + ```bash + d8 k get events --field-selector involvedObject.name={{ $labels.name }} + ``` diff --git a/monitoring/prometheus-rules/dvcr.yaml b/monitoring/prometheus-rules/dvcr.yaml index 19a0f15fe1..3a7189409e 100644 --- a/monitoring/prometheus-rules/dvcr.yaml +++ b/monitoring/prometheus-rules/dvcr.yaml @@ -14,9 +14,49 @@ plk_grouped_by__d8_virtualization_dvcr_health: "D8VirtualizationDVCRHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" summary: The dvcr Pod is NOT Ready. description: | - The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy dvcr` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=dvcr` + One or more Pods of the dvcr deployment in the d8-virtualization namespace are not in a Ready state. + + The dvcr component serves as a local registry for virtual machine images and disks. If its Pods are not ready, image uploads and VirtualDisk provisioning may be affected. + + Recommended diagnosis steps: + + 1. Check the status of dvcr Pods. + ```bash + d8 k -n d8-virtualization get pods -l app=dvcr + ``` + + 2. Describe the Deployment to check replicas and events. + ```bash + d8 k -n d8-virtualization describe deploy dvcr + ``` + + 3. Get detailed information about the Pod, including events and container statuses. + ```bash + d8 k -n d8-virtualization describe pod -l app=dvcr + ``` + + 4. View logs from the affected Pod. + ```bash + d8 k -n d8-virtualization logs + ``` + + 5. If the Pod has restarted, check logs from the previous instance. + ```bash + d8 k -n d8-virtualization logs --previous + ``` + + Recommended actions: + + - Investigate readiness probe failures, container crashes, or scheduling issues based on the output of the commands above. + + - If the issue persists, consider restarting the dvcr Deployment. + ```bash + d8 k -n d8-virtualization rollout restart deploy dvcr + ``` + + - Ensure that required storage volumes (such as PVCs) are available and healthy. + + - Verify that there are no node issues such as disk pressure or memory limits affecting scheduling. - alert: D8VirtualizationDVCRPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"dvcr-.*"}) @@ -31,6 +71,41 @@ plk_grouped_by__d8_virtualization_dvcr_health: "D8VirtualizationDVCRHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" summary: The dvcr Pod is NOT Running. description: | - The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy dvcr` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=dvcr` + No running Pods were found for the dvcr deployment in the d8-virtualization namespace for more than 2 minutes. + + The dvcr component serves as a local registry for virtual machine images and disks. Its unavailability may block image uploads and provisioning of new VirtualDisks or VirtualMachines. + + Recommended diagnosis steps: + + 1. Check if any dvcr Pods exist and what their status is. + ```bash + d8 k -n d8-virtualization get pods -l app=dvcr + ``` + + 2. Describe the Deployment to check replicas and events. + ```bash + d8 k -n d8-virtualization describe deploy dvcr + ``` + + 3. Get logs from previous Pods (if they have been restarted). + ```bash + d8 k -n d8-virtualization logs --previous + ``` + + 4. Get events related to the dvcr Deployment and Pods. + ```bash + d8 k -n d8-virtualization describe pod -l app=dvcr + ``` + + Recommended actions: + + - If the Deployment has zero replicas, scale it back up. + ```bash + d8 k -n d8-virtualization scale deploy dvcr --replicas=1 + ``` + + - If Pods are crashing, inspect their logs and events. + + - Ensure that required storage volumes (such as PVCs) are available and healthy. + + - Verify that there are no node issues such as disk pressure or memory limits affecting scheduling. diff --git a/monitoring/prometheus-rules/internal-virtualization-cdi-apiservier.yaml b/monitoring/prometheus-rules/internal-virtualization-cdi-apiservier.yaml index 38b079907b..e38cb4570a 100644 --- a/monitoring/prometheus-rules/internal-virtualization-cdi-apiservier.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-cdi-apiservier.yaml @@ -15,8 +15,8 @@ summary: The cdi-apiserver Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy cdi-apiserver` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l cdi.internal.virtualization.deckhouse.io=cdi-apiserver` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy cdi-apiserver` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l cdi.internal.virtualization.deckhouse.io=cdi-apiserver` - alert: D8InternalVirtualizationCDIAPIServerPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"cdi-apiserver-.*"}) @@ -32,5 +32,5 @@ summary: The cdi-apiserver Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy cdi-apiserver` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l cdi.internal.virtualization.deckhouse.io=cdi-apiserver` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy cdi-apiserver` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l cdi.internal.virtualization.deckhouse.io=cdi-apiserver` diff --git a/monitoring/prometheus-rules/internal-virtualization-cdi-deployment.yaml b/monitoring/prometheus-rules/internal-virtualization-cdi-deployment.yaml index fef308313e..7c5666b3dd 100644 --- a/monitoring/prometheus-rules/internal-virtualization-cdi-deployment.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-cdi-deployment.yaml @@ -15,8 +15,8 @@ summary: The cdi-deployment Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy cdi-deployment` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=containerized-data-importer` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy cdi-deployment` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=containerized-data-importer` - alert: D8InternalVirtualizationCDIDeploymentPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"cdi-deployment-.*"}) @@ -32,5 +32,5 @@ summary: The cdi-deployment Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy cdi-deployment` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=containerized-data-importer` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy cdi-deployment` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=containerized-data-importer` diff --git a/monitoring/prometheus-rules/internal-virtualization-cdi-operator.yaml b/monitoring/prometheus-rules/internal-virtualization-cdi-operator.yaml index 368553ed3d..5fa7b164e2 100644 --- a/monitoring/prometheus-rules/internal-virtualization-cdi-operator.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-cdi-operator.yaml @@ -15,8 +15,8 @@ summary: The cdi-operator Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy cdi-operator` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=cdi-operator` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy cdi-operator` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=cdi-operator` - alert: D8InternalVirtualizationCDIOperatorPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"cdi-operator-.*"}) @@ -32,5 +32,5 @@ summary: The cdi-operator Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy cdi-operator` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=cdi-operator` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy cdi-operator` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=cdi-operator` diff --git a/monitoring/prometheus-rules/internal-virtualization-virt-api.yaml b/monitoring/prometheus-rules/internal-virtualization-virt-api.yaml index 14610661e5..23790e7c6c 100644 --- a/monitoring/prometheus-rules/internal-virtualization-virt-api.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-virt-api.yaml @@ -15,8 +15,8 @@ summary: The virt-api Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virt-api` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-api` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virt-api` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-api` - alert: D8InternalVirtualizationVirtAPIPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"virt-api-.*"}) @@ -32,5 +32,5 @@ summary: The virt-api Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virt-api` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-api` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virt-api` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-api` diff --git a/monitoring/prometheus-rules/internal-virtualization-virt-controller.yaml b/monitoring/prometheus-rules/internal-virtualization-virt-controller.yaml index 17aa2160c3..04807edc7d 100644 --- a/monitoring/prometheus-rules/internal-virtualization-virt-controller.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-virt-controller.yaml @@ -15,8 +15,8 @@ summary: The virt-controller Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virt-controller` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-controller` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virt-controller` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-controller` - alert: D8InternalVirtualizationVirtControllerPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"virt-controller-.*"}) @@ -32,5 +32,5 @@ summary: The virt-controller Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virt-controller` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-controller` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virt-controller` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-controller` diff --git a/monitoring/prometheus-rules/internal-virtualization-virt-hander.yaml b/monitoring/prometheus-rules/internal-virtualization-virt-hander.yaml index 020330a650..c23149e511 100644 --- a/monitoring/prometheus-rules/internal-virtualization-virt-hander.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-virt-hander.yaml @@ -15,8 +15,8 @@ summary: The virt-handler Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe daemonset virt-handler` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod --field-selector=spec.nodeName={{ $labels.node }} -l kubevirt.internal.virtualization.deckhouse.io=virt-handler` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe daemonset virt-handler` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod --field-selector=spec.nodeName={{ $labels.node }} -l kubevirt.internal.virtualization.deckhouse.io=virt-handler` - alert: D8InternalVirtualizationVirtHandlerPodIsNotRunning expr: absent(avg by(node,pod,namespace)(kube_pod_info{}) * on(pod, namespace) group_right(node) kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"virt-handler-.*"}) diff --git a/monitoring/prometheus-rules/internal-virtualization-virt-operator.yaml b/monitoring/prometheus-rules/internal-virtualization-virt-operator.yaml index 655d0bd7d1..3b5059ea72 100644 --- a/monitoring/prometheus-rules/internal-virtualization-virt-operator.yaml +++ b/monitoring/prometheus-rules/internal-virtualization-virt-operator.yaml @@ -15,8 +15,8 @@ summary: The virt-operator Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virt-operator` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-operator` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virt-operator` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-operator` - alert: D8InternalVirtualizationVirtOperatorPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"virt-operator-.*"}) @@ -32,5 +32,5 @@ summary: The virt-operator Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virt-operator` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-operator` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virt-operator` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l kubevirt.internal.virtualization.deckhouse.io=virt-operator` diff --git a/monitoring/prometheus-rules/vd.state.yaml b/monitoring/prometheus-rules/vd.state.yaml new file mode 100644 index 0000000000..9688ecae18 --- /dev/null +++ b/monitoring/prometheus-rules/vd.state.yaml @@ -0,0 +1,133 @@ +- name: virtualization.vd.state + rules: + - alert: D8VirtualizationVirtualDiskStuckInPendingPhase + expr: d8_virtualization_virtualdisk_status_phase{phase="Pending"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualdisk_state: "D8VirtualizationVirtualDiskState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualdisk_state: "D8VirtualizationVirtualDiskState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualDisk is stuck in the `Pending` phase for a long time. + description: | + The virtual disk `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been stuck in the `Pending` phase for more than 60 minutes. + + ### Common Causes + + - Missing or not ready VirtualImage, ClusterVirtualImage or VirtualDiskSnapshot + - Scheduling issues on the node + - Cluster resource shortage (CPU, memory) + - Exhausted quotas (e.g., CPU, memory limits) + + + ### Recommended Actions + + 1. Check virtual disk status: + ```bash + d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect conditions for details: + ```bash + d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Check related events: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 4. Check if the source VirtualImage, ClusterVirtualImage or VirtualDiskSnapshot exists and is Ready: + ```bash + d8 k -n {{ $labels.namespace }} get vi, cvi, vds + ``` + + 5. Check cluster resource usage and quotas: + ```bash + d8 k get -n {{ $labels.namespace }} resourcequotas + ``` + + - alert: D8VirtualizationVirtualDiskStuckInWaitForUserUploadPhase + expr: d8_virtualization_virtualdisk_status_phase{phase="WaitForUserUpload"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualdisk_state: "D8VirtualizationVirtualDiskState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualdisk_state: "D8VirtualizationVirtualDiskState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualDisk is stuck in the `WaitForUserUpload` phase for a long time. + description: | + The virtual disk `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been waiting for a user image upload for more than 60 minutes. + + This means that no image was uploaded to provision the disk, and associated operations (e.g., starting a VM) may be blocked until this step is completed. + + ### What You Need to Do + + Upload the required disk image using one of the provided URLs: + + - From outside the cluster: + ```bash + d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.external}" + ``` + + - From inside the cluster (node): + ```bash + d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.inCluster}" + ``` + + - Use `curl`, `wget`, or any HTTP client with `PUT` method and appropriate content-type (`application/octet-stream`) to upload the image. + + Example: + ```bash + curl -X PUT --data-binary @disk-image.qcow2 \ + -H "Content-Type: application/octet-stream" \ + $(d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.external}") + ``` + + + - alert: D8VirtualizationVirtualDiskFailed + expr: d8_virtualization_virtualdisk_status_phase{phase="Failed"} == 1 + labels: + severity_level: "7" + tier: application + for: 0m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualdisk_state: "D8VirtualizationVirtualDiskState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualdisk_state: "D8VirtualizationVirtualDiskState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualDisk is in the `Failed` phase. + description: | + The virtual disk `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` in the `Failed` phase. + + This may indicate one or more of the following issues: + + - Wrong image URL + - Wrong container image + - Network issues + - Storage issues + + ### Recommended Actions + + 1. Check the full status of the virtual disk: + ```bash + d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect the condition for details: + ```bash + d8 k -n {{ $labels.namespace }} get vd {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Review events related to this virtual disk: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 4. Recreate virtual disk from a working source. diff --git a/monitoring/prometheus-rules/vi.state.yaml b/monitoring/prometheus-rules/vi.state.yaml new file mode 100644 index 0000000000..445e932ae1 --- /dev/null +++ b/monitoring/prometheus-rules/vi.state.yaml @@ -0,0 +1,130 @@ +- name: virtualization.vi.state + rules: + - alert: D8VirtualizationVirtualImageStuckInPendingPhase + expr: d8_virtualization_virtualimage_status_phase{phase="Pending"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualimage_state: "D8VirtualizationVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualimage_state: "D8VirtualizationVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualImage is stuck in the `Pending` phase for a long time. + description: | + The virtual image `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been stuck in the `Pending` phase for more than 60 minutes. + + ### Common Causes + + - Missing or not ready VirtualImage, ClusterVirtualImage, VirtualDisk or VirtualImageSnapshot + - Scheduling issues on the node + - Cluster resource shortage (CPU, memory) + - Exhausted quotas (e.g., CPU, memory limits) + + ### Recommended Actions + + 1. Check virtual image status: + ```bash + d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect conditions for details: + ```bash + d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Check related events: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 4. Check if the source VirtualImage, ClusterVirtualImage or VirtualImageSnapshot exists and is Ready: + ```bash + d8 k -n {{ $labels.namespace }} get vd, vi, cvi, vis + ``` + + 5. Check cluster resource usage and quotas: + ```bash + d8 k get -n {{ $labels.namespace }} resourcequotas + ``` + + - alert: D8VirtualizationVirtualImageStuckInWaitForUserUploadPhase + expr: d8_virtualization_virtualimage_status_phase{phase="WaitForUserUpload"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualimage_state: "D8VirtualizationVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualimage_state: "D8VirtualizationVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualImage is stuck in the `WaitForUserUpload` phase for a long time. + description: | + The virtual image `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been waiting for a user image upload for more than 60 minutes. + + This means that no image was uploaded to provision the image. + + ### What You Need to Do + + Upload the required image image using one of the provided URLs: + + - From outside the cluster: + ```bash + d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.external}" + ``` + + - From inside the cluster (node): + ```bash + d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.inCluster}" + ``` + + - Use `curl`, `wget`, or any HTTP client with `PUT` method and appropriate content-type (`application/octet-stream`) to upload the image. + + Example: + ```bash + curl -X PUT --data-binary @image.qcow2 \ + -H "Content-Type: application/octet-stream" \ + $(d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status.imageUploadURLs.external}") + ``` + + + - alert: D8VirtualizationVirtualImageFailed + expr: d8_virtualization_virtualimage_status_phase{phase="Failed"} == 1 + labels: + severity_level: "7" + tier: application + for: 0m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualimage_state: "D8VirtualizationVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualimage_state: "D8VirtualizationVirtualImageState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualImage is in the `Failed` phase. + description: | + The virtual image `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` in the `Failed` phase. + + This may indicate one or more of the following issues: + + - Wrong image URL + - Wrong container image + - Network issues + - Storage issues + + ### Recommended Actions + + 1. Check the full status of the virtual image: + ```bash + d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect the condition for details: + ```bash + d8 k -n {{ $labels.namespace }} get vi {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Review events related to this virtual image: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` diff --git a/monitoring/prometheus-rules/virtualization-api.yaml b/monitoring/prometheus-rules/virtualization-api.yaml index 55b36219b6..5e576ba7f2 100644 --- a/monitoring/prometheus-rules/virtualization-api.yaml +++ b/monitoring/prometheus-rules/virtualization-api.yaml @@ -1,4 +1,4 @@ -- name: kubernetes.virtualization.api_state +- name: virtualization.api_state rules: - alert: D8VirtualizationAPIPodIsNotReady expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virtualization-api-.*"}) != 1 @@ -15,8 +15,8 @@ summary: The virtualization-api Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virtualization-api` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=virtualization-api` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virtualization-api` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=virtualization-api` - alert: D8VirtualizationAPIPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"virtualization-api-.*"}) @@ -32,5 +32,5 @@ summary: The virtualization-api Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virtualization-api` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=virtualization-api` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virtualization-api` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=virtualization-api` diff --git a/monitoring/prometheus-rules/virtualization-controller.yaml b/monitoring/prometheus-rules/virtualization-controller.yaml index 0395d3bbd0..e21175ee8f 100644 --- a/monitoring/prometheus-rules/virtualization-controller.yaml +++ b/monitoring/prometheus-rules/virtualization-controller.yaml @@ -1,4 +1,4 @@ -- name: kubernetes.virtualization.controller_state +- name: virtualization.controller_state rules: - alert: D8VirtualizationControllerTargetDown expr: max by (job) (up{job="scrapeconfig/d8-monitoring/virtualization-controller"}) == 0 @@ -15,8 +15,8 @@ summary: Prometheus cannot scrape the virtualization-controller metrics. description: | The recommended course of action: - 1. Check the Pod status: `kubectl -n d8-virtualization get pod -l app=virtualization-controller` - 2. Or check the Pod logs: `kubectl -n d8-virtualization logs deploy/virtualization-controller` + 1. Check the Pod status: `d8 k -n d8-virtualization get pod -l app=virtualization-controller` + 2. Or check the Pod logs: `d8 k -n d8-virtualization logs deploy/virtualization-controller` - alert: D8VirtualizationControllerTargetAbsent expr: absent(up{job="scrapeconfig/d8-monitoring/virtualization-controller"}) == 1 @@ -33,8 +33,8 @@ summary: There is no `virtualization-controller` target in Prometheus. description: | The recommended course of action: - 1. Check the Pod status: `kubectl -n d8-virtualization get pod -l app=virtualization-controller` - 2. Or check the Pod logs: `kubectl -n d8-virtualization logs deploy/virtualization-controller` + 1. Check the Pod status: `d8 k -n d8-virtualization get pod -l app=virtualization-controller` + 2. Or check the Pod logs: `d8 k -n d8-virtualization logs deploy/virtualization-controller` - alert: D8VirtualizationControllerPodIsNotReady expr: min by (pod) (kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"virtualization-controller-.*"}) != 1 @@ -51,8 +51,8 @@ summary: The virtualization-controller Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virtualization-controller` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=virtualization-controller` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virtualization-controller` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=virtualization-controller` - alert: D8VirtualizationControllerPodIsNotRunning expr: absent(kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"virtualization-controller-.*"}) @@ -68,5 +68,5 @@ summary: The virtualization-controller Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe deploy virtualization-controller` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod -l app=virtualization-controller` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe deploy virtualization-controller` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod -l app=virtualization-controller` diff --git a/monitoring/prometheus-rules/vm-route-forge.yaml b/monitoring/prometheus-rules/vm-route-forge.yaml index cfd2fdc1e4..31e59600b8 100644 --- a/monitoring/prometheus-rules/vm-route-forge.yaml +++ b/monitoring/prometheus-rules/vm-route-forge.yaml @@ -1,4 +1,4 @@ -- name: kubernetes.virtualization.vm_route_forge_state +- name: virtualization.vm_route_forge_state rules: - alert: D8InternalVirtualizationVirtHandlerPodIsNotReady expr: min by (pod) (avg by(node,pod,namespace)(kube_pod_info{}) * on(pod, namespace) group_right(node) kube_pod_status_ready{condition="true", namespace="d8-virtualization", pod=~"vm-route-forge-.*"}) != 1 @@ -15,8 +15,8 @@ summary: The vm-route-forge Pod is NOT Ready. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe daemonset vm-route-forge` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod --field-selector=spec.nodeName={{ $labels.node }} -l app=vm-route-forge` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe daemonset vm-route-forge` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod --field-selector=spec.nodeName={{ $labels.node }} -l app=vm-route-forge` - alert: D8InternalVirtualizationVirtHandlerPodIsNotRunning expr: absent(avg by(node,pod,namespace)(kube_pod_info{}) * on(pod, namespace) group_right(node) kube_pod_status_phase{namespace="d8-virtualization",phase="Running",pod=~"vm-route-forge-.*"}) @@ -32,5 +32,5 @@ summary: The vm-route-forge Pod is NOT Running. description: | The recommended course of action: - 1. Retrieve details of the Deployment: `kubectl -n d8-virtualization describe daemonset vm-route-forge` - 2. View the status of the Pod and try to figure out why it is not running: `kubectl -n d8-virtualization describe pod --field-selector=spec.nodeName={{ $labels.node }} -l app=vm-route-forge` + 1. Retrieve details of the Deployment: `d8 k -n d8-virtualization describe daemonset vm-route-forge` + 2. View the status of the Pod and try to figure out why it is not running: `d8 k -n d8-virtualization describe pod --field-selector=spec.nodeName={{ $labels.node }} -l app=vm-route-forge` diff --git a/monitoring/prometheus-rules/vm.compute.yaml b/monitoring/prometheus-rules/vm.compute.yaml new file mode 100644 index 0000000000..8cfd6af165 --- /dev/null +++ b/monitoring/prometheus-rules/vm.compute.yaml @@ -0,0 +1,268 @@ +- name: virtualization.vm.compute + rules: + - alert: D8VirtualizationVirtualMachineHighCPULoad + expr: | + ( + ( + sum by (namespace,name) (rate(d8_virtualization_virtualmachine_cpu_usage_milliseconds_total[1m]) / 1000) + / + sum by (namespace,name) (d8_virtualization_virtualmachine_cpu_cores) + ) + * + ( + 1 / sum by (namespace,name) (d8_virtualization_virtualmachine_cpu_core_fraction / 100) + ) + ) * 100 + > 85 + for: 5m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine has high average CPU load across all cores. + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has an average CPU load of more than 85% across all its cores over the last minute, and this has been sustained for more than 5 minutes. + + This may indicate insufficient CPU resources allocated for VM workload. + + Recommended actions: + + - Consider increasing CPU cores (`.spec.cpu.cores`) or coreFraction (`.spec.cpu.coreFraction`) for the VirtualMachine if workload is expected. + - Investigate applications for possible inefficiencies or leaks. + - Monitor memory and disk I/O to detect correlated bottlenecks. + + - alert: D8VirtualizationVirtualMachineHighMemoryUsage + expr: | + ( + avg by (namespace, name) + (d8_virtualization_virtualmachine_os_memory_total_bytes - d8_virtualization_virtualmachine_os_memory_potentially_free_bytes) + ) + / + ( + avg by (namespace, name) + (d8_virtualization_virtualmachine_memory_size_bytes) + ) + * 100 > 85 + for: 5m + labels: + severity_level: "6" + tier: application + annotations: + summary: VirtualMachine has high memory usage for a long time. + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been using more than 85% of its allocated memory over the last 5 minutes. + + This may indicate insufficient memory resources allocated for VM workload. + + Recommended actions: + + - Consider increasing memory `.spec.memory.size` if workload is expected to grow. + - Investigate top memory-consuming processes inside the guest OS. + - If memory pressure is frequent, consider enabling swap space (if acceptable for performance). + + - alert: D8VirtualizationVirtualMachineNetworkPacketsDropped + expr: | + ( + sum by (namespace, name, network) ( + rate(d8_virtualization_virtualmachine_network_receive_packets_dropped_total[1m]) + + rate(d8_virtualization_virtualmachine_network_transmit_packets_dropped_total[1m]) + ) + / + sum by (namespace, name, network) ( + rate(d8_virtualization_virtualmachine_network_receive_packets_total[1m]) + + rate(d8_virtualization_virtualmachine_network_transmit_packets_total[1m]) + ) + ) * 100 > 1 + for: 0m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine is dropping network packets (> 1% per second). + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been dropping more than 1% packets per second (combined receive + transmit) on network interface `{{ $labels.network }}`. + + This may indicate: + - Network congestion or oversubscription. + - Insufficient CPU or resources to process packets in time. + + - alert: D8VirtualizationVirtualMachineMemoryUnderMemoryPressure + expr: | + ( + rate(d8_virtualization_virtualmachine_os_memory_pgmajfault_total[5m]) + ) > 1000 + for: 0m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine is under memory pressure (high major page faults). + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` is experiencing more than major 1000 page faults per second. + + This may indicate: + - High memory pressure inside the VM. + - Application is frequently accessing new memory regions. + - Swap usage is increasing (`pgmajfault`). + - Memory fragmentation or inefficient memory access patterns. + + ### Recommended actions + + - Investigate top memory-faulting processes inside the guest OS. + - Consider increasing memory `.spec.memory.size` if workload requires it. + - If `pgmajfault` is high, evaluate swap usage and performance impact. + + - alert: D8VirtualizationVirtualMachineFileSystemAlmostFull + expr: | + ( + sum by (name, namespace, mount_point) + (d8_virtualization_virtualmachine_filesystem_used_bytes{type!="cloudinit"}) + / + sum by (name, namespace, mount_point) + (d8_virtualization_virtualmachine_filesystem_capacity_bytes) + ) * 100 > 95 + for: 5m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: Filesystem in VirtualMachine is almost full. + description: | + The filesystem mounted at `{{ $labels.mount_point }}` in the VirtualMachine `{{ $labels.name }}` (namespace `{{ $labels.namespace }}`) has reached more than 95% of its capacity. + + This may indicate: + - Disk space exhaustion inside the guest OS. + - Incorrect disk sizing during VM provisioning. + + Consider increasing disk size if workload requires it. + + - alert: D8VirtualizationVirtualMachineHighSwapUsage + expr: | + ( + sum by (name, namespace)( + rate(d8_virtualization_virtualmachine_os_memory_swap_in_traffic_bytes[5m]) + ) + + + sum by (name, namespace)( + rate(d8_virtualization_virtualmachine_os_memory_swap_out_traffic_bytes[5m]) + ) + ) > 10 * 1024 * 1024 # 10 MB/s + for: 5m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine has high swap usage. + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` is actively using swap memory, with more than 10 MB/s of data being swapped in or out over the last 5 minutes. + + This may indicate: + - Memory exhaustion inside the VM. + - Application misbehaviour or memory leak. + - Insufficient memory allocation. + - Performance degradation due to disk-based memory swapping. + + ### Recommended Actions + + - Consider increasing memory `.spec.memory.size` if workload requires it. + - Investigate and optimize memory usage in the guest OS. + - Disable swap (if acceptable) to prevent performance degradation. + + - alert: D8VirtualizationVirtualMachineHighDiskLatency + expr: | + ( + sum by (namespace, name, block_device_name) + ( + rate(d8_virtualization_virtualmachine_block_device_read_times_seconds_total[1m]) + + + rate(d8_virtualization_virtualmachine_block_device_write_times_seconds_total[1m]) + ) + / + sum by (namespace, name, block_device_name) + ( + rate(d8_virtualization_virtualmachine_block_device_iops_read_total[1m]) + + + rate(d8_virtualization_virtualmachine_block_device_iops_write_total[1m]) + ) + ) * 1000 > 20 + for: 5m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine has high disk I/O latency (>20ms average) + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` is experiencing **disk I/O latency higher than 20ms** on device `{{ $labels.block_device_name }}`. + + This may indicate: + - Storage backend performance degradation. + - Disk saturation or contention. + - Overloaded node or storage system. + + ### Recommended Actions + + - Investigate storage backend. + - Consider moving disk to a faster storage class. + - Monitor other VMs on the same node/storage for similar issues. + + - alert: D8VirtualizationVirtualMachineMigrationTooLongDueToDirtyMemory + expr: d8_virtualization_virtualmachine_migration_dirty_memory_rate_bytes > 0 + for: 30m + labels: + severity_level: "6" + tier: application + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_health: "D8VirtualizationVirtualmachineHealth,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine migration is taking too long due to continuous memory changes. + description: | + The VirtualMachine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been in migration state for more than 30 minutes, and the dirty memory rate (`migration_dirty_memory_rate_bytes`) remains above zero. This means that memory inside the VM is still actively changing. + + This may indicate: + - Applications inside the VM are continuously modifying memory. + - Migration cannot complete final sync stage. + + ### Recommended Actions + + - Temporarily suspend or reduce the workload on the VM to minimize frequent memory renewal. + - Restart the migration manually with force if it gets stuck: + + 1. Remove the VMOP resource responsible for migration: + ```bash + d8 k -n {{ $labels.namespace }} delete vmop + ``` + + 2. Force migration for VM if it's possible: + ```bash + d8 v -n {{ $labels.namespace }} evict {{ $labels.name }} --force + ``` diff --git a/monitoring/prometheus-rules/vm.state.yaml b/monitoring/prometheus-rules/vm.state.yaml index 4bbc6e9813..96e217f915 100644 --- a/monitoring/prometheus-rules/vm.state.yaml +++ b/monitoring/prometheus-rules/vm.state.yaml @@ -37,3 +37,178 @@ ``` > Simpler, but causes downtime unless guest OS supports ACPI shutdown/restart. 3. After migration or reboot, the VM will use the updated firmware automatically. + + - alert: D8VirtualizationVirtualMachineStuckInPendingPhase + expr: d8_virtualization_virtualmachine_status_phase{phase="Pending"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine is stuck in the `Pending` phase for a long time. + description: | + The virtual machine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been stuck in the `Pending` phase for more than 60 minutes. + + ### Common Causes + + - Missing VirtualMachineClass + - Not ready or invalid VirtualDisks + - Used VirtualDisks already attached to another virtual machine + - Missing secret (for syeprep, cloud-init, etc.)** + + ### Recommended Actions + + 1. Check virtual machine status: + ```bash + d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect `*.Ready` conditions for details: + ```bash + d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status.conditions}" | jq + ``` + + 3. Check related events: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 4. Check if used disks are available or occupied by other VMs: + ```bash + d8 k -n {{ $labels.namespace }} get vd + ``` + + 5. Check secrets used by the VM: + ```bash + d8 k -n {{ $labels.namespace }} get secrets + d8 k -n {{ $labels.namespace }} describe secret + ``` + + 6. Check cluster resource usage and quotas: + ```bash + d8 k -n {{ $labels.namespace }} get resourcequotas + ``` + + - alert: D8VirtualizationVirtualMachineStuckInStartingPhase + expr: d8_virtualization_virtualmachine_status_phase{phase="Starting"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine is stuck in the `Starting` phase for a long time. + description: | + The virtual machine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been stuck in the `Starting` phase for more than 60 minutes. + + This may indicate one or more of the following issues: + + - Scheduling issues on the node + - Cluster resource shortage (CPU, memory) + - Exhausted quotas (e.g., pods, CPU, memory limits) + - Node selector or taint/toleration mismatch + + ### Recommended Actions + + 1. Check the full status of the virtual machine: + ```bash + d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. Inspect the `Running` condition for details: + ```bash + d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status.conditions[?(@.type=='Running')]}" | jq + ``` + + 3. Review events related to this virtual machine: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 4. Check logs of the associated pod (if running): + ```bash + POD_NAME=$(d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status.virtualMachinePods[?(@.active==true)].name}") + d8 k -n {{ $labels.namespace }} logs $POD_NAME + ``` + + 5. Check cluster resource usage and quotas: + ```bash + d8 k get -n {{ $labels.namespace }} resourcequotas + ``` + + - alert: D8VirtualizationVirtualMachineStuckInStoppingPhase + expr: d8_virtualization_virtualmachine_status_phase{phase="Stopping"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: VirtualMachine is stuck in the `Stopping` phase for a long time. + description: | + The virtual machine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been stuck in the `Stopping` phase for more than 60 minutes. + + This may indicate issues with graceful shutdown, hanging processes inside the VM, or problems with the underlying infrastructure or controller. + + ### Recommended Actions + + 1. **Check the status of the virtual machine**: + ```bash + d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status}" | jq + ``` + + 2. **Review events related to the virtual machine**: + ```bash + d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + ``` + + 3. **Inspect logs of the virtual machine's pod (if it's still running)**: + ```bash + POD_NAME=$(d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status.virtualMachinePods[?(@.active==true)].name}") + d8 k -n {{ $labels.namespace }} logs $POD_NAME + ``` + + # - alert: D8VirtualizationVirtualMachineIsTakingTooLongToMigrate + # expr: d8_virtualization_virtualmachine_status_phase{phase="Migrating"} == 1 + # labels: + # severity_level: "6" + # tier: application + # for: 30m + # annotations: + # plk_protocol_version: "1" + # plk_markup_format: "markdown" + # plk_create_group_if_not_exists__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + # plk_grouped_by__d8_virtualization_virtualmachine_state: "D8VirtualizationVirtualMachineState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + # summary: Virtual machine is taking too long to migrate. + # description: | + # The virtual machine `{{ $labels.name }}` in the namespace `{{ $labels.namespace }}` has been stuck in the _Migrating_ phase for more than 30 minutes. + + # This may indicate an issue with the migration process such as network problems, resource constraints, or errors in the virtual machine's configuration or state. + + # ### Recommended Actions + + # 1. **Check the logs of the virtual machine's active pod**: + # ```bash + # POD_NAME=$(d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status.virtualMachinePods[?(@.active==true)].name}") + # d8 k -n {{ $labels.namespace }} logs $POD_NAME + # ``` + + # 2. **Inspect the virtual machine status**: + # ```bash + # d8 k -n {{ $labels.namespace }} get vm {{ $labels.name }} -o jsonpath="{.status}" | jq + # ``` + + # 3. **Check related events in the namespace**: + # ```bash + # d8 k -n {{ $labels.namespace }} get events --field-selector involvedObject.name={{ $labels.name }} + # ``` diff --git a/monitoring/prometheus-rules/vmop.state.yaml b/monitoring/prometheus-rules/vmop.state.yaml new file mode 100644 index 0000000000..250da4f04d --- /dev/null +++ b/monitoring/prometheus-rules/vmop.state.yaml @@ -0,0 +1,46 @@ +- name: virtualization.vmop.state + rules: + - alert: D8VirtualizationVirtualMachineOperationStuckInProgressPhase + expr: d8_virtualization_virtualmachineoperation_status_phase{phase="InProgress"} == 1 + labels: + severity_level: "9" + tier: application + for: 60m + annotations: + plk_protocol_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__d8_virtualization_vmop_state: "D8VirtualizationVirtualMachineOperationState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__d8_virtualization_vmop_state: "D8VirtualizationVirtualMachineOperationState,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: The VirtualMachineOperation stuck in InProgress phase for a long time. + description: | + The `VirtualMachineOperation` object `{{ $labels.name }}` in namespace `{{ $labels.namespace }}` has been stuck in the `InProgress` phase for more than 60 minutes. + + This may indicate that the operation (e.g., restart, evict, stop, start) was not completed successfully and is now stalled. + + ### Possible Causes + + - The underlying virtual machine is unreachable or in an inconsistent state. + - Node issues (e.g., network problems, node downtime). + + ### Diagnosis + + 1. Get details of the affected VirtualMachineOperation: + ```bash + d8 k -n {{ $labels.namespace }} get vmop {{ $labels.name }} -o wide + ``` + + 2. Check related VM status: + ```bash + d8 k -n {{ $labels.namespace }} get vm -o jsonpath="{.status}" | jq + ``` + + ### Recommended Actions + + If the operation can be safely retried, delete the `VirtualMachineOperation` object: + ```bash + d8 k -n {{ $labels.namespace }} delete vmop {{ $labels.name }} + ``` + Then re-initiate the required action (e.g., restart, evict, etc). + ```bash + d8 v + ```