You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/04-For Operators/05-monitoring.md
+75-47Lines changed: 75 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,22 +10,41 @@ sidebar_position: 5
10
10
11
11

12
12
13
+
The diagram above shows the full monitoring and logging stack: partition hosts ship logs to Loki and expose metrics for Prometheus scraping; control-plane and Gardener seed Alloy instances push both logs and self-metrics centrally; Grafana provides unified dashboards and alerting across all tiers.
14
+
13
15
## Logging
14
16
15
-
Logs are being collected by
16
-
[Grafana Alloy](https://grafana.com/docs/alloy/latest/) and pushed
17
-
to a [Loki](https://grafana.com/docs/loki/latest/) instance running in the
Loki is deployed in [monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/) and with storage type `'filesystem'`.
21
+
You can find all logging related configuration parameters for the control plane in the [logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md) role.
22
+
23
+
In the partitions, Alloy can be deployed inside a systemd-managed Docker container on management servers and switches.
24
+
Configuration parameters can be found in the partition's [alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md) role.
25
+
26
+
### Control-Plane Log Sources
27
+
28
+
In the control plane, Alloy runs as a Kubernetes DaemonSet and collects logs from two sources:
| Pod logs | Collected from all pods via the Kubernetes API (`loki.source.kubernetes`) |`cluster`, `namespace`, `pod`, `container`, `pod_uid`, `node_name`, `app`, `instance`, `component`, `job`|
33
+
| Kubernetes events | Collected natively via `loki.source.kubernetes_events` — no separate event-exporter required |`cluster`, `job=monitoring/event-exporter`, `namespace`|
34
+
35
+
All control-plane log entries carry a `cluster` label (configured via `logging_alloy_cluster_label`) identifying the control-plane stage.
36
+
37
+
#### Gardener
38
+
39
+
The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an additional Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances collect pod logs from their respective clusters and forward them to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the shooted seed name, enabling per-seed filtering in Grafana.
40
+
41
+
### Control-Plane: Querying Logs in Grafana
42
+
43
+
-`{cluster="<stage-name>"}` — all logs from a control-plane stage
44
+
-`{namespace="<namespace>"}` — all logs from a specific namespace
45
+
-`{job="<namespace>/<app>"}` — logs from a specific application
-`{cluster="<seed-name>"}` — all logs from a specific Gardener shooted seed
29
48
30
49
### Partition Log Sources
31
50
@@ -38,25 +57,23 @@ Alloy is configured through snippets that define what logs are collected. The fo
38
57
| Hosts without journald |`syslog`| Tails `/var/log/syslog`|`job=syslog`|
39
58
| Hosts running Docker |`docker`| Collects logs from all Docker containers via the Docker socket |`job=docker`, `container`|
40
59
41
-
All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition.
42
-
43
-
### Querying Logs in Grafana
60
+
### Partition: Querying Logs in Grafana
44
61
45
-
Logs can be explored in Grafana using the **Explore** view with the Loki data source. Useful label filters:
62
+
All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition.
46
63
47
64
-`{partition="<partition-id>"}` — all logs from a partition
48
65
-`{host="<hostname>"}` — all logs from a specific host
49
66
-`{job="docker", container="<name>"}` — logs from a specific Docker container
50
67
-`{job="systemd-journal", unit="<unit>.service"}` — logs from a specific systemd unit
51
68
-`{job="systemd-journal", level="error"}` — error-level journal entries across all units
52
69
53
-
:::note Migrating from promtail
70
+
:::note Migrating from Promtail
71
+
72
+
The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the respective migration guides for step-by-step instructions:
54
73
55
-
The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the
56
-
[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail)
57
-
section of the partition alloy role's README and the
58
-
[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail)
59
-
section of the control-plane logging role's README for step-by-step instructions.
74
+
-[Partition](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail) — partition alloy role
75
+
-[Control-plane](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail) — control-plane logging role
76
+
-[Gardener](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md#migration-from-promtail) — gardener-logging role
60
77
61
78
:::
62
79
@@ -65,31 +82,20 @@ section of the control-plane logging role's README for step-by-step instructions
and a [Thanos](https://thanos.io/tip/thanos/getting-started.md/) instance in the
68
-
control plane. Metrics for the control plane are supplied by
85
+
control plane.
69
86
70
-
-`metal-metrics-exporter`
71
-
-`rethindb-exporter`
72
-
-`event-exporter`
73
-
-`gardener-metrics-exporter`
87
+
### Control-Plane Metrics
74
88
75
-
To query and visualize logs, metrics and alerts we deploy several grafana
76
-
dashboards to the controlplane:
89
+
In-cluster components are scraped by Prometheus via `ServiceMonitor` resources (pull model).
90
+
Alloy self-metrics use a different approach: the control-plane Alloy DaemonSet and all Gardener seed Alloy instances push their metrics via `prometheus.remote_write` to Thanos Receive (`monitoring_thanos_receive_enabled: true`), removing the need for Prometheus to reach into each cluster.
77
91
78
-
-`grafana-dashboard-alertmanager`
79
-
-`grafana-dashboard-machine-capacity`
80
-
-`grafana-dashboard-metal-api`
81
-
-`grafana-dashboard-rethinkdb`
82
-
-`grafana-dashboard-sonic-exporter`
83
-
84
-
and also some gardener related dashboards:
92
+
Metrics are supplied by
85
93
86
-
-`grafana-dashboard-gardener-overview`
87
-
-`grafana-dashboard-shoot-cluster`
88
-
-`grafana-dashboard-shoot-customizations`
89
-
-`grafana-dashboard-shoot-details`
90
-
-`grafana-dashboard-shoot-states`
94
+
-`metal-metrics-exporter`
95
+
-`rethinkdb-exporter`
96
+
-`gardener-metrics-exporter`
91
97
92
-
The following `ServiceMonitors` are also deployed:
98
+
The following `ServiceMonitors` are deployed:
93
99
94
100
-`gardener-metrics-exporter`
95
101
-`ipam-db`
@@ -105,24 +111,46 @@ found in the control plane's
0 commit comments