Skip to content

Commit 289746c

Browse files
authored
Implement observability alertgroups (#778)
* feat: implement observability alertgroups * review changes
1 parent 44103a1 commit 289746c

File tree

11 files changed

+1987
-4
lines changed

11 files changed

+1987
-4
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
# generated by https://github.com/hashicorp/terraform-plugin-docs
3+
page_title: "stackit_observability_alertgroup Data Source - stackit"
4+
subcategory: ""
5+
description: |-
6+
Observability alert group resource schema. Must have a region specified in the provider configuration.
7+
---
8+
9+
# stackit_observability_alertgroup (Data Source)
10+
11+
Observability alert group resource schema. Must have a `region` specified in the provider configuration.
12+
13+
## Example Usage
14+
15+
```terraform
16+
data "stackit_observability_alertgroup" "example" {
17+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
18+
instance_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
19+
name = "example-alert-group"
20+
}
21+
```
22+
23+
<!-- schema generated by tfplugindocs -->
24+
## Schema
25+
26+
### Required
27+
28+
- `instance_id` (String) Observability instance ID to which the alert group is associated.
29+
- `name` (String) The name of the alert group. Is the identifier and must be unique in the group.
30+
- `project_id` (String) STACKIT project ID to which the alert group is associated.
31+
32+
### Read-Only
33+
34+
- `id` (String) Terraform's internal resource ID. It is structured as "`project_id`,`instance_id`,`name`".
35+
- `interval` (String) Specifies the frequency at which rules within the group are evaluated. The interval must be at least 60 seconds and defaults to 60 seconds if not set. Supported formats include hours, minutes, and seconds, either singly or in combination. Examples of valid formats are: '5h30m40s', '5h', '5h30m', '60m', and '60s'.
36+
- `rules` (Attributes List) (see [below for nested schema](#nestedatt--rules))
37+
38+
<a id="nestedatt--rules"></a>
39+
### Nested Schema for `rules`
40+
41+
Read-Only:
42+
43+
- `alert` (String) The name of the alert rule. Is the identifier and must be unique in the group.
44+
- `annotations` (Map of String) A map of key:value. Annotations to add or overwrite for each alert
45+
- `expression` (String) The PromQL expression to evaluate. Every evaluation cycle this is evaluated at the current time, and all resultant time series become pending/firing alerts.
46+
- `for` (String) Alerts are considered firing once they have been returned for this long. Alerts which have not yet fired for long enough are considered pending. Default is 0s
47+
- `labels` (Map of String) A map of key:value. Labels to add or overwrite for each alert
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
page_title: "Alerting with Kube-State-Metrics in STACKIT Observability"
3+
---
4+
# Alerting with Kube-State-Metrics in STACKIT Observability
5+
6+
## Overview
7+
8+
This guide explains how to configure the STACKIT Observability product to send alerts using metrics gathered from kube-state-metrics.
9+
10+
1. **Set Up Providers**
11+
12+
Begin by configuring the STACKIT and Kubernetes providers to connect to the STACKIT services.
13+
14+
```hcl
15+
provider "stackit" {
16+
region = "eu01"
17+
}
18+
19+
provider "kubernetes" {
20+
host = yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.server
21+
client_certificate = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-certificate-data)
22+
client_key = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-key-data)
23+
cluster_ca_certificate = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.certificate-authority-data)
24+
}
25+
26+
provider "helm" {
27+
kubernetes {
28+
host = yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.server
29+
client_certificate = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-certificate-data)
30+
client_key = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-key-data)
31+
cluster_ca_certificate = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.certificate-authority-data)
32+
}
33+
}
34+
```
35+
36+
2. **Create SKE Cluster and Kubeconfig Resource**
37+
38+
Set up a STACKIT SKE Cluster and generate the associated kubeconfig resource.
39+
40+
```hcl
41+
resource "stackit_ske_cluster" "example" {
42+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
43+
name = "example"
44+
kubernetes_version = "1.31"
45+
node_pools = [
46+
{
47+
name = "standard"
48+
machine_type = "c1.4"
49+
minimum = "3"
50+
maximum = "9"
51+
max_surge = "3"
52+
availability_zones = ["eu01-1", "eu01-2", "eu01-3"]
53+
os_version_min = "4081.2.1"
54+
os_name = "flatcar"
55+
volume_size = 32
56+
volume_type = "storage_premium_perf6"
57+
}
58+
]
59+
maintenance = {
60+
enable_kubernetes_version_updates = true
61+
enable_machine_image_version_updates = true
62+
start = "01:00:00Z"
63+
end = "02:00:00Z"
64+
}
65+
}
66+
67+
resource "stackit_ske_kubeconfig" "example" {
68+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
69+
cluster_name = stackit_ske_cluster.example.name
70+
refresh = true
71+
}
72+
```
73+
74+
3. **Create Observability Instance and Credentials**
75+
76+
Establish a STACKIT Observability instance and its credentials to handle alerts.
77+
78+
```hcl
79+
locals {
80+
alert_config = {
81+
route = {
82+
receiver = "EmailStackit",
83+
repeat_interval = "1m",
84+
continue = true
85+
}
86+
receivers = [
87+
{
88+
name = "EmailStackit",
89+
email_configs = [
90+
{
91+
to = "<email>"
92+
}
93+
]
94+
}
95+
]
96+
}
97+
}
98+
99+
resource "stackit_observability_instance" "example" {
100+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
101+
name = "example"
102+
plan_name = "Observability-Large-EU01"
103+
alert_config = local.alert_config
104+
}
105+
106+
resource "stackit_observability_credential" "example" {
107+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
108+
instance_id = stackit_observability_instance.example.instance_id
109+
}
110+
```
111+
112+
4. **Install Prometheus Operator**
113+
114+
Use the Prometheus Helm chart to install kube-state-metrics and transfer metrics to the STACKIT Observability instance. Customize the helm values as needed for your deployment.
115+
116+
```yaml
117+
# helm values
118+
# save as prom-values.tftpl
119+
prometheus:
120+
enabled: true
121+
agentMode: true
122+
prometheusSpec:
123+
enableRemoteWriteReceiver: true
124+
scrapeInterval: 60s
125+
evaluationInterval: 60s
126+
replicas: 1
127+
storageSpec:
128+
volumeClaimTemplate:
129+
spec:
130+
storageClassName: premium-perf4-stackit
131+
accessModes: ['ReadWriteOnce']
132+
resources:
133+
requests:
134+
storage: 80Gi
135+
remoteWrite:
136+
- url: ${metrics_push_url}
137+
queueConfig:
138+
batchSendDeadline: '5s'
139+
# both values need to be configured according to your observability plan
140+
capacity: 30000
141+
maxSamplesPerSend: 3000
142+
writeRelabelConfigs:
143+
- sourceLabels: ['__name__']
144+
regex: 'apiserver_.*|etcd_.*|prober_.*|storage_.*|workqueue_(work|queue)_duration_seconds_bucket|kube_pod_tolerations|kubelet_.*|kubernetes_feature_enabled|instance_scrape_target_status'
145+
action: 'drop'
146+
- sourceLabels: ['namespace']
147+
regex: 'example'
148+
action: 'keep'
149+
basicAuth:
150+
username:
151+
key: username
152+
name: ${secret_name}
153+
password:
154+
key: password
155+
name: ${secret_name}
156+
157+
grafana:
158+
enabled: false
159+
160+
defaultRules:
161+
create: false
162+
163+
alertmanager:
164+
enabled: false
165+
166+
nodeExporter:
167+
enabled: true
168+
169+
kube-state-metrics:
170+
enabled: true
171+
customResourceState:
172+
enabled: true
173+
collectors:
174+
- deployments
175+
- pods
176+
```
177+
178+
```hcl
179+
resource "kubernetes_namespace" "monitoring" {
180+
metadata {
181+
name = "monitoring"
182+
}
183+
}
184+
185+
resource "kubernetes_secret" "argus_prometheus_authorization" {
186+
metadata {
187+
name = "argus-prometheus-credentials"
188+
namespace = kubernetes_namespace.monitoring.metadata[0].name
189+
}
190+
191+
data = {
192+
username = stackit_observability_credential.example.username
193+
password = stackit_observability_credential.example.password
194+
}
195+
}
196+
197+
resource "helm_release" "prometheus_operator" {
198+
name = "prometheus-operator"
199+
repository = "https://prometheus-community.github.io/helm-charts"
200+
chart = "kube-prometheus-stack"
201+
version = "60.1.0"
202+
namespace = kubernetes_namespace.monitoring.metadata[0].name
203+
204+
values = [
205+
templatefile("prom-values.tftpl", {
206+
metrics_push_url = stackit_observability_instance.example.metrics_push_url
207+
secret_name = kubernetes_secret.argus_prometheus_authorization.metadata[0].name
208+
})
209+
]
210+
}
211+
```
212+
213+
5. **Create Alert Group**
214+
215+
Define an alert group with a rule to notify when a pod is running in the "example" namespace.
216+
217+
```hcl
218+
resource "stackit_observability_alertgroup" "example" {
219+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
220+
instance_id = stackit_observability_instance.example.instance_id
221+
name = "TestAlertGroup"
222+
interval = "2h"
223+
rules = [
224+
{
225+
alert = "SimplePodCheck"
226+
expression = "sum(kube_pod_status_phase{phase=\"Running\", namespace=\"example\"}) > 0"
227+
for = "60s"
228+
labels = {
229+
severity = "critical"
230+
},
231+
annotations = {
232+
summary = "Test Alert is working"
233+
description = "Test Alert"
234+
}
235+
},
236+
]
237+
}
238+
```
239+
240+
6. **Deploy Test Pod**
241+
242+
Deploy a test pod; doing so should trigger an email notification, as the deployment satisfies the conditions defined in the alert group rule. In a real-world scenario, you would typically configure alerts to monitor pods for error states instead.
243+
244+
```hcl
245+
resource "kubernetes_namespace" "example" {
246+
metadata {
247+
name = "example"
248+
}
249+
}
250+
251+
resource "kubernetes_pod" "example" {
252+
metadata {
253+
name = "nginx"
254+
namespace = kubernetes_namespace.example.metadata[0].name
255+
labels = {
256+
app = "nginx"
257+
}
258+
}
259+
260+
spec {
261+
container {
262+
image = "nginx:latest"
263+
name = "nginx"
264+
}
265+
}
266+
}
267+
```
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
# generated by https://github.com/hashicorp/terraform-plugin-docs
3+
page_title: "stackit_observability_alertgroup Resource - stackit"
4+
subcategory: ""
5+
description: |-
6+
Observability alert group resource schema. Must have a region specified in the provider configuration.
7+
---
8+
9+
# stackit_observability_alertgroup (Resource)
10+
11+
Observability alert group resource schema. Must have a `region` specified in the provider configuration.
12+
13+
## Example Usage
14+
15+
```terraform
16+
resource "stackit_observability_alertgroup" "example" {
17+
project_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
18+
instance_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
19+
name = "example-alert-group"
20+
interval = "60s"
21+
rules = [
22+
{
23+
alert = "example-alert-name"
24+
expression = "kube_node_status_condition{condition=\"Ready\", status=\"false\"} > 0"
25+
for = "60s"
26+
labels = {
27+
severity = "critical"
28+
},
29+
annotations = {
30+
summary : "example summary"
31+
description : "example description"
32+
}
33+
},
34+
{
35+
alert = "example-alert-name-2"
36+
expression = "kube_node_status_condition{condition=\"Ready\", status=\"false\"} > 0"
37+
for = "1m"
38+
labels = {
39+
severity = "critical"
40+
},
41+
annotations = {
42+
summary : "example summary"
43+
description : "example description"
44+
}
45+
},
46+
]
47+
}
48+
```
49+
50+
<!-- schema generated by tfplugindocs -->
51+
## Schema
52+
53+
### Required
54+
55+
- `instance_id` (String) Observability instance ID to which the alert group is associated.
56+
- `name` (String) The name of the alert group. Is the identifier and must be unique in the group.
57+
- `project_id` (String) STACKIT project ID to which the alert group is associated.
58+
- `rules` (Attributes List) Rules for the alert group (see [below for nested schema](#nestedatt--rules))
59+
60+
### Optional
61+
62+
- `interval` (String) Specifies the frequency at which rules within the group are evaluated. The interval must be at least 60 seconds and defaults to 60 seconds if not set. Supported formats include hours, minutes, and seconds, either singly or in combination. Examples of valid formats are: '5h30m40s', '5h', '5h30m', '60m', and '60s'.
63+
64+
### Read-Only
65+
66+
- `id` (String) Terraform's internal resource ID. It is structured as "`project_id`,`instance_id`,`name`".
67+
68+
<a id="nestedatt--rules"></a>
69+
### Nested Schema for `rules`
70+
71+
Required:
72+
73+
- `alert` (String) The name of the alert rule. Is the identifier and must be unique in the group.
74+
- `expression` (String) The PromQL expression to evaluate. Every evaluation cycle this is evaluated at the current time, and all resultant time series become pending/firing alerts.
75+
76+
Optional:
77+
78+
- `annotations` (Map of String) A map of key:value. Annotations to add or overwrite for each alert
79+
- `for` (String) Alerts are considered firing once they have been returned for this long. Alerts which have not yet fired for long enough are considered pending. Default is 0s
80+
- `labels` (Map of String) A map of key:value. Labels to add or overwrite for each alert

0 commit comments

Comments
 (0)