Skip to content

Commit 01c70a3

Browse files
vuvkarclaude
andauthored
[MOPU-312] Add related links to monitor templates (DataDog#23245)
* [MOPU-288] Add related links to kubernetes monitor templates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [MOPU-288] Add related links to nginx monitor templates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [MOPU-288] Add related links to postgres and redis monitor templates Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix broken Infrastructure links in monitor templates The /infrastructure?filters=... links pointed to a non-existent path with an unsupported query param, and used template variables not in each monitor's group-by. - nginx (4xx, 5xx, upstream_peer_fails): remove (upstream is not a host/pod/container resource) - k8s deployments_replicas, statefulset_replicas, pods_failed_state: remove (no host/pod template var in group-by) - k8s node_unavailable: replace with Hosts page scoped to kube_cluster_name - k8s pod_crashloopbackoff, pod_imagepullbackoff, pod_oomkilled, pods_restarting: replace with Pod Explorer scoped to pod_name Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent b5362c7 commit 01c70a3

14 files changed

Lines changed: 28 additions & 28 deletions

kubernetes/assets/monitors/monitor_deployments_replicas.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-07-28",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Kubernetes Deployment Replicas are failing",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "Kubernetes replicas are clones that facilitate self-healing for pods. Each pod has a desired number of replica Pods that should be running at any given time. This monitor tracks the number of replicas that are failing per deployment.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nThere are at least 2 or more missing replicas for Deployment {{kube_namespace.name}}/{{kube_deployment.name}} over the last 15 minutes.\n\n{{/is_alert}}",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere are at least 2 or more missing replicas for Deployment {{kube_namespace.name}}/{{kube_deployment.name}} over the last 15 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+kube_deployment:{{kube_deployment.name}}+kube_namespace:{{kube_namespace.name}})\n- [Metrics Explorer (kubernetes_state.deployment.replicas_desired)](/metric/explorer?exp_metric=kubernetes_state.deployment.replicas_desired&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_deployment:{{kube_deployment.name}},kube_namespace:{{kube_namespace.name}}&exp_agg=avg&exp_type=line)\n- [Metrics Explorer (kubernetes_state.deployment.replicas_available)](/metric/explorer?exp_metric=kubernetes_state.deployment.replicas_available&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_deployment:{{kube_deployment.name}},kube_namespace:{{kube_namespace.name}}&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}",
1212
"name": "[Kubernetes] Monitor Kubernetes Deployments Replica Pods",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_node_unavailable.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-07-28",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Nodes are unavailable",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "Kubernetes nodes can either be schedulable or unschedulable. When unschedulable, the node prevents the scheduler from placing new pods onto that node. This monitor tracks the percentage of schedulable nodes.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nThe percentage of schedulable nodes is below 80% for status:schedulable on ({{kube_cluster_name.name}} cluster over the last 15 minutes.\n\n{{/is_alert}}\n\n Keep in mind that this might be expected based on your infrastructure.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThe percentage of schedulable nodes is below 80% for status:schedulable on ({{kube_cluster_name.name}} cluster over the last 15 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+status:schedulable)\n- [Hosts](/infrastructure/hosts?scope=kube_cluster_name:{{kube_cluster_name.name}})\n- [Metrics Explorer (kubernetes_state.node.status)](/metric/explorer?exp_metric=kubernetes_state.node.status&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},status:schedulable&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}\n\n Keep in mind that this might be expected based on your infrastructure.",
1212
"name": "[Kubernetes] Monitor Unschedulable Kubernetes Nodes",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pod_crashloopbackoff.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-07-28",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Pod is in a CrashloopBackOff state",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "The status CrashloopBackOff means that a container in the Pod is started, crashes, and is restarted, over and over again. This monitor tracks when a pod is in a CrashloopBackOff state for your Kubernetes integration.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nAt least one container in pod {{pod_name.name}} on {{kube_namespace.name}} is in a waiting state due to reason crashloopbackoff in the last 10 minutes.\n\n{{/is_alert}}\n\n This alert could generate several alerts for a bad deployment. Adjust the thresholds of the query to suit your infrastructure.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nAt least one container in pod {{pod_name.name}} on {{kube_namespace.name}} is in a waiting state due to reason crashloopbackoff in the last 10 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+kube_namespace:{{kube_namespace.name}}+pod_name:{{pod_name.name}}+reason:crashloopbackoff)\n- [Pod Explorer](/orchestration/explorer/pod?query={{pod_name.name}})\n- [Metrics Explorer (kubernetes_state.container.status_report.count.waiting)](/metric/explorer?exp_metric=kubernetes_state.container.status_report.count.waiting&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_namespace:{{kube_namespace.name}},pod_name:{{pod_name.name}},reason:crashloopbackoff&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}\n\n This alert could generate several alerts for a bad deployment. Adjust the thresholds of the query to suit your infrastructure.",
1212
"name": "[Kubernetes] Pod {{pod_name.name}} is CrashloopBackOff on namespace {{kube_namespace.name}}",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pod_imagepullbackoff.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-09-15",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Pod is in an ImagePullBackOff state",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "The status ImagePullBackOff means that a container could not start because Kubernetes could not pull a container image. This monitor tracks when a pod is in an ImagePullBackOff state for your Kubernetes integration.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nAt least one container in pod {{pod_name.name}} on namespace {{kube_namespace.name}} is in a waiting state due to an ImagePullBackOff error in the last 10 minutes.\n\n{{/is_alert}}\n\n This could happen for several reasons, for example a bad image path or tag or if the credentials for pulling images are not configured properly.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nAt least one container in pod {{pod_name.name}} on namespace {{kube_namespace.name}} is in a waiting state due to an ImagePullBackOff error in the last 10 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+kube_namespace:{{kube_namespace.name}}+pod_name:{{pod_name.name}}+reason:imagepullbackoff)\n- [Pod Explorer](/orchestration/explorer/pod?query={{pod_name.name}})\n- [Metrics Explorer (kubernetes_state.container.status_report.count.waiting)](/metric/explorer?exp_metric=kubernetes_state.container.status_report.count.waiting&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_namespace:{{kube_namespace.name}},pod_name:{{pod_name.name}},reason:imagepullbackoff&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}\n\n This could happen for several reasons, for example a bad image path or tag or if the credentials for pulling images are not configured properly.",
1212
"name": "[Kubernetes] Pod {{pod_name.name}} is ImagePullBackOff on namespace {{kube_namespace.name}}",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pod_oomkilled.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2025-09-15",
4-
"last_updated_at": "2025-09-15",
4+
"last_updated_at": "2026-04-09",
55
"title": "Pod is in an OOMKilled state",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "The status OOMKilled means that a container was killed because it exceeded memory limits or the node ran out of available memory. This monitor tracks when a pod is in an OOMKilled state for your Kubernetes integration.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nThere has been at least one container terminated in pod {{pod_name.name}} on namespace {{kube_namespace.name}} with reason oomkilled in the last 10 minutes.\n\n{{/is_alert}}\n\n This could happen for several reasons, for example insufficient memory limits, memory leaks in the application, or the node running out of available memory.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere has been at least one container terminated in pod {{pod_name.name}} on namespace {{kube_namespace.name}} with reason oomkilled in the last 10 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+kube_namespace:{{kube_namespace.name}}+pod_name:{{pod_name.name}}+reason:oomkilled)\n- [Pod Explorer](/orchestration/explorer/pod?query={{pod_name.name}})\n- [Metrics Explorer (kubernetes.containers.state.terminated)](/metric/explorer?exp_metric=kubernetes.containers.state.terminated&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_namespace:{{kube_namespace.name}},pod_name:{{pod_name.name}},reason:oomkilled&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}\n\n This could happen for several reasons, for example insufficient memory limits, memory leaks in the application, or the node running out of available memory.",
1212
"name": "[Kubernetes] Pod {{pod_name.name}} is OOMKilled on namespace {{kube_namespace.name}}",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pods_failed_state.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-07-28",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Pods are failing",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "When a pod is failing it means the container either exited with non-zero status or was terminated by the system. This monitor tracks when more than 10 pods are failing for a given Kubernetes cluster.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nThe number of failed pods has increased by more than 10 in ({{kube_cluster_name.name}} cluster in the last 5 minutes.\n\n{{/is_alert}}\n\n The threshold of ten pods varies depending on your infrastructure. Change the threshold to suit your needs.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThe number of failed pods has increased by more than 10 in ({{kube_cluster_name.name}} cluster in the last 5 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+kube_namespace:{{kube_namespace.name}}+pod_phase:failed)\n- [Metrics Explorer (kubernetes_state.pod.status_phase)](/metric/explorer?exp_metric=kubernetes_state.pod.status_phase&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_namespace:{{kube_namespace.name}},pod_phase:failed&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}\n\n The threshold of ten pods varies depending on your infrastructure. Change the threshold to suit your needs.",
1212
"name": "[Kubernetes] Monitor Kubernetes Failed Pods in Namespaces",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pods_restarting.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-07-28",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Pods are restarting",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "Kubernetes pods restart according to the restart policy. A restarting container can indicate problems with memory, CPU usage, or an application exiting prematurely. This monitor tracks when pods are restarting multiple times.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nThere has been an increase of more than 5 container restarts in the pod {{pod_name.name}} in the last 5 minutes.\n\n{{/is_alert}}",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere has been an increase of more than 5 container restarts in the pod {{pod_name.name}} in the last 5 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+pod_name:{{pod_name.name}})\n- [Pod Explorer](/orchestration/explorer/pod?query={{pod_name.name}})\n- [Metrics Explorer (kubernetes.containers.restarts)](/metric/explorer?exp_metric=kubernetes.containers.restarts&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},pod_name:{{pod_name.name}}&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}",
1212
"name": "[Kubernetes] Monitor Kubernetes Pods Restarting",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_statefulset_replicas.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-07-28",
4-
"last_updated_at": "2025-06-12",
4+
"last_updated_at": "2026-04-09",
55
"title": "Kubernetes Statefulset Replicas are failing",
66
"tags": [
77
"integration:kubernetes"
88
],
99
"description": "Kubernetes replicas are clones that facilitate self-healing for pods. Each pod has a desired number of replica Pods that should be running at any given time. This monitor tracks when the number of replicas per statefulset is falling.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n\n## What's happening?\nThere are at least 2 desired replicas that are not ready for {{kube_namespace.name}}/{{kube_stateful_set.name}} StatefulSet over the last 15 minutes.\n\n{{/is_alert}}\n\n This might present an unsafe situation for any further manual operations, such as killing other pods.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere are at least 2 desired replicas that are not ready for {{kube_namespace.name}}/{{kube_stateful_set.name}} StatefulSet over the last 15 minutes.\n\n## Related Links\n\n- [Logs](/logs?query=kube_cluster_name:{{kube_cluster_name.name}}+kube_namespace:{{kube_namespace.name}}+kube_stateful_set:{{kube_stateful_set.name}})\n- [Metrics Explorer (kubernetes_state.statefulset.replicas_desired)](/metric/explorer?exp_metric=kubernetes_state.statefulset.replicas_desired&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_namespace:{{kube_namespace.name}},kube_stateful_set:{{kube_stateful_set.name}}&exp_agg=avg&exp_type=line)\n- [Metrics Explorer (kubernetes_state.statefulset.replicas_ready)](/metric/explorer?exp_metric=kubernetes_state.statefulset.replicas_ready&exp_scope=kube_cluster_name:{{kube_cluster_name.name}},kube_namespace:{{kube_namespace.name}},kube_stateful_set:{{kube_stateful_set.name}}&exp_agg=avg&exp_type=line)\n\n{{/is_alert}}\n\n This might present an unsafe situation for any further manual operations, such as killing other pods.",
1212
"name": "[Kubernetes] Monitor Kubernetes Statefulset Replicas",
1313
"options": {
1414
"escalation_message": "",

nginx/assets/monitors/4xx.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
{
22
"version": 2,
33
"created_at": "2020-09-16",
4-
"last_updated_at": "2026-03-09",
4+
"last_updated_at": "2026-04-09",
55
"title": "Upstream 4xx errors are high",
66
"tags": [
77
"integration:nginx"
88
],
99
"description": "NGINX sends requests to upstream peers that can fail eventually. This monitor tracks the count of 4xx HTTP responses to identify issues in the communication between NGINX and the backend servers.",
1010
"definition": {
11-
"message": "{{#is_alert}}\n## 🚨 What's happening\n\nAn anomaly has been detected in the number of 4xx HTTP responses from NGINX upstream **{{upstream.name}}** (anomaly score: `{{value}}`, threshold: `{{threshold}}`). The 4xx response rate is significantly higher than normal, indicating that a notable portion of incoming requests are being rejected with client-side error codes.\n\nFirst triggered at **{{first_triggered_at}}**, active for **{{triggered_duration_sec}}** seconds.\n{{/is_alert}}{{#is_recovery}}\n## ✅ Recovered\n\nThe 4xx anomaly for upstream **{{upstream.name}}** has resolved. Current value: `{{value}}`.\n{{/is_recovery}}\n{{^is_recovery}}\n***\n\n## 📈 Impact\n\nElevated 4xx error rates can result in failed requests for end users and may expose misconfigurations or broken routes. Services and clients relying on this NGINX upstream may experience partial or complete degradation of functionality.\n\n***\n\n## Runbook\n\n### Initial Troubleshooting Steps\n\n1. **Identify the affected upstream** from the alert (`{{upstream.name}}`).\n2. Open [**Metrics Explorer**](/metric/explorer) and inspect `nginx.upstream.peers.responses.4xx` broken down by `upstream`.\n3. Review NGINX access logs for specific endpoints and status codes:\n ```bash\n tail -f /var/log/nginx/access.log | grep \" 4[0-9][0-9] \"\n ```\n4. Correlate the spike with recent configuration changes, upstream deployments, or traffic shifts.\n\n### Cause and Resolution\n\n| Cause | Resolution |\n| ----- | ---------- |\n| Invalid or removed request paths (404) | Verify routes in NGINX configuration; update upstream routing rules to reflect the current backend state. |\n| Authentication or authorization failures (401/403) | Review auth configuration; check if credentials or access tokens have expired or been revoked. |\n| Malformed client requests (400) | Inspect incoming request headers and payloads; check client-side request construction. |\n| Rate limiting triggered (429) | Review rate limit thresholds; consider scaling upstream services or relaxing limits. |\n| Upstream endpoints renamed or removed | Update NGINX upstream configuration to reflect the current backend service endpoints. |\n\n### Related links\n\n* [Documentation](https://docs.datadoghq.com/integrations/nginx/)\n* [Metrics Explorer](/metric/explorer)\n* [Log Explorer](/logs?query=source%3Anginx)\n\n### Who should be notified?\n\nAssign the appropriate notification handle for this alert (e.g., `@slack-infra`, `@pagerduty-nginx`):\n`@your-team-handle`\n{{/is_recovery}}",
11+
"message": "{{#is_alert}}\n## 🚨 What's happening\n\nAn anomaly has been detected in the number of 4xx HTTP responses from NGINX upstream **{{upstream.name}}** (anomaly score: `{{value}}`, threshold: `{{threshold}}`). The 4xx response rate is significantly higher than normal, indicating that a notable portion of incoming requests are being rejected with client-side error codes.\n\nFirst triggered at **{{first_triggered_at}}**, active for **{{triggered_duration_sec}}** seconds.\n{{/is_alert}}{{#is_recovery}}\n## ✅ Recovered\n\nThe 4xx anomaly for upstream **{{upstream.name}}** has resolved. Current value: `{{value}}`.\n{{/is_recovery}}\n{{^is_recovery}}\n***\n\n## 📈 Impact\n\nElevated 4xx error rates can result in failed requests for end users and may expose misconfigurations or broken routes. Services and clients relying on this NGINX upstream may experience partial or complete degradation of functionality.\n\n***\n\n## Runbook\n\n### Initial Troubleshooting Steps\n\n1. **Identify the affected upstream** from the alert (`{{upstream.name}}`).\n2. Open [**Metrics Explorer**](/metric/explorer) and inspect `nginx.upstream.peers.responses.4xx` broken down by `upstream`.\n3. Review NGINX access logs for specific endpoints and status codes:\n ```bash\n tail -f /var/log/nginx/access.log | grep \" 4[0-9][0-9] \"\n ```\n4. Correlate the spike with recent configuration changes, upstream deployments, or traffic shifts.\n\n### Cause and Resolution\n\n| Cause | Resolution |\n| ----- | ---------- |\n| Invalid or removed request paths (404) | Verify routes in NGINX configuration; update upstream routing rules to reflect the current backend state. |\n| Authentication or authorization failures (401/403) | Review auth configuration; check if credentials or access tokens have expired or been revoked. |\n| Malformed client requests (400) | Inspect incoming request headers and payloads; check client-side request construction. |\n| Rate limiting triggered (429) | Review rate limit thresholds; consider scaling upstream services or relaxing limits. |\n| Upstream endpoints renamed or removed | Update NGINX upstream configuration to reflect the current backend service endpoints. |\n\n### Related links\n\n* [Documentation](https://docs.datadoghq.com/integrations/nginx/)\n* [Logs](/logs?query=upstream:{{upstream.name}})\n* [Metrics Explorer (nginx.upstream.peers.responses.4xx)](/metric/explorer?exp_metric=nginx.upstream.peers.responses.4xx&exp_scope=upstream:{{upstream.name}}&exp_agg=avg&exp_type=line)\n\n### Who should be notified?\n\nAssign the appropriate notification handle for this alert (e.g., `@slack-infra`, `@pagerduty-nginx`):\n`@your-team-handle`\n{{/is_recovery}}",
1212
"name": "[NGINX] 4xx Errors higher than usual",
1313
"options": {
1414
"escalation_message": "",

0 commit comments

Comments
 (0)