Skip to content

Commit 2247606

Browse files
committed
docs(monitoring): control-plane alloy
1 parent 2976a07 commit 2247606

3 files changed

Lines changed: 1463 additions & 48 deletions

File tree

docs/04-For Operators/05-monitoring.md

Lines changed: 75 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -10,22 +10,41 @@ sidebar_position: 5
1010

1111
![Monitoring Stack](monitoring-stack.svg)
1212

13+
The diagram above shows the full monitoring and logging stack: partition hosts ship logs to Loki and expose metrics for Prometheus scraping; control-plane and Gardener seed Alloy instances push both logs and self-metrics centrally; Grafana provides unified dashboards and alerting across all tiers.
14+
1315
## Logging
1416

15-
Logs are being collected by
16-
[Grafana Alloy](https://grafana.com/docs/alloy/latest/) and pushed
17-
to a [Loki](https://grafana.com/docs/loki/latest/) instance running in the
18-
control plane. Loki is deployed in
19-
[monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/)
20-
and with storage type `'filesystem'`. You can find all logging related
21-
configuration parameters for the control plane in the control plane's
22-
[logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md)
23-
role.
17+
[Grafana Alloy](https://grafana.com/docs/alloy/latest/) collects and pushes logs to a [Loki](https://grafana.com/docs/loki/latest/) instance
18+
running in the control plane.
2419

25-
In the partitions, Alloy can be deployed inside a systemd-managed Docker
26-
container on management servers and switches. Configuration parameters can be found in the partition's
27-
[alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md)
28-
role.
20+
Loki is deployed in [monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/) and with storage type `'filesystem'`.
21+
You can find all logging related configuration parameters for the control plane in the [logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md) role.
22+
23+
In the partitions, Alloy can be deployed inside a systemd-managed Docker container on management servers and switches.
24+
Configuration parameters can be found in the partition's [alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md) role.
25+
26+
### Control-Plane Log Sources
27+
28+
In the control plane, Alloy runs as a Kubernetes DaemonSet and collects logs from two sources:
29+
30+
| Source | Description | Key labels |
31+
| ----------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
32+
| Pod logs | Collected from all pods via the Kubernetes API (`loki.source.kubernetes`) | `cluster`, `namespace`, `pod`, `container`, `pod_uid`, `node_name`, `app`, `instance`, `component`, `job` |
33+
| Kubernetes events | Collected natively via `loki.source.kubernetes_events` — no separate event-exporter required | `cluster`, `job=monitoring/event-exporter`, `namespace` |
34+
35+
All control-plane log entries carry a `cluster` label (configured via `logging_alloy_cluster_label`) identifying the control-plane stage.
36+
37+
#### Gardener
38+
39+
The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an additional Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances collect pod logs from their respective clusters and forward them to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the shooted seed name, enabling per-seed filtering in Grafana.
40+
41+
### Control-Plane: Querying Logs in Grafana
42+
43+
- `{cluster="<stage-name>"}` — all logs from a control-plane stage
44+
- `{namespace="<namespace>"}` — all logs from a specific namespace
45+
- `{job="<namespace>/<app>"}` — logs from a specific application
46+
- `{job="monitoring/event-exporter"}` — Kubernetes events
47+
- `{cluster="<seed-name>"}` — all logs from a specific Gardener shooted seed
2948

3049
### Partition Log Sources
3150

@@ -38,25 +57,23 @@ Alloy is configured through snippets that define what logs are collected. The fo
3857
| Hosts without journald | `syslog` | Tails `/var/log/syslog` | `job=syslog` |
3958
| Hosts running Docker | `docker` | Collects logs from all Docker containers via the Docker socket | `job=docker`, `container` |
4059

41-
All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition.
42-
43-
### Querying Logs in Grafana
60+
### Partition: Querying Logs in Grafana
4461

45-
Logs can be explored in Grafana using the **Explore** view with the Loki data source. Useful label filters:
62+
All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition.
4663

4764
- `{partition="<partition-id>"}` — all logs from a partition
4865
- `{host="<hostname>"}` — all logs from a specific host
4966
- `{job="docker", container="<name>"}` — logs from a specific Docker container
5067
- `{job="systemd-journal", unit="<unit>.service"}` — logs from a specific systemd unit
5168
- `{job="systemd-journal", level="error"}` — error-level journal entries across all units
5269

53-
:::note Migrating from promtail
70+
:::note Migrating from Promtail
71+
72+
The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the respective migration guides for step-by-step instructions:
5473

55-
The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the
56-
[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail)
57-
section of the partition alloy role's README and the
58-
[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail)
59-
section of the control-plane logging role's README for step-by-step instructions.
74+
- [Partition](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail) — partition alloy role
75+
- [Control-plane](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail) — control-plane logging role
76+
- [Gardener](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md#migration-from-promtail) — gardener-logging role
6077

6178
:::
6279

@@ -65,31 +82,20 @@ section of the control-plane logging role's README for step-by-step instructions
6582
For monitoring we deploy the
6683
[kube-prometheus-stack](https://github.com/prometheus-operator/kube-prometheus)
6784
and a [Thanos](https://thanos.io/tip/thanos/getting-started.md/) instance in the
68-
control plane. Metrics for the control plane are supplied by
85+
control plane.
6986

70-
- `metal-metrics-exporter`
71-
- `rethindb-exporter`
72-
- `event-exporter`
73-
- `gardener-metrics-exporter`
87+
### Control-Plane Metrics
7488

75-
To query and visualize logs, metrics and alerts we deploy several grafana
76-
dashboards to the control plane:
89+
In-cluster components are scraped by Prometheus via `ServiceMonitor` resources (pull model).
90+
Alloy self-metrics use a different approach: the control-plane Alloy DaemonSet and all Gardener seed Alloy instances push their metrics via `prometheus.remote_write` to Thanos Receive (`monitoring_thanos_receive_enabled: true`), removing the need for Prometheus to reach into each cluster.
7791

78-
- `grafana-dashboard-alertmanager`
79-
- `grafana-dashboard-machine-capacity`
80-
- `grafana-dashboard-metal-api`
81-
- `grafana-dashboard-rethinkdb`
82-
- `grafana-dashboard-sonic-exporter`
83-
84-
and also some gardener related dashboards:
92+
Metrics are supplied by
8593

86-
- `grafana-dashboard-gardener-overview`
87-
- `grafana-dashboard-shoot-cluster`
88-
- `grafana-dashboard-shoot-customizations`
89-
- `grafana-dashboard-shoot-details`
90-
- `grafana-dashboard-shoot-states`
94+
- `metal-metrics-exporter`
95+
- `rethinkdb-exporter`
96+
- `gardener-metrics-exporter`
9197

92-
The following `ServiceMonitors` are also deployed:
98+
The following `ServiceMonitors` are deployed:
9399

94100
- `gardener-metrics-exporter`
95101
- `ipam-db`
@@ -105,24 +111,46 @@ found in the control plane's
105111
[monitoring](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/monitoring/README.md)
106112
role.
107113

108-
Partition metrics are supplied by
114+
### Partition Metrics
115+
116+
Partition metrics are collected via Prometheus scraping (pull model). Exporters running on partition hosts supply the metrics:
109117

110118
- `node-exporter`
111119
- `blackbox-exporter`
112120
- `ipmi-exporter`
113121
- `sonic-exporter`
114122
- `metal-core`
115123
- `frr-exporter`
124+
- `alloy`
116125

117-
and scraped by Prometheus. For each of these exporters, the target hosts can be
118-
defined by
126+
Target hosts for each exporter are defined by
119127

120128
- `prometheus_node_exporter_targets`
121129
- `prometheus_blackbox_exporter_targets`
122-
- `prometheus_frr_exporter_targets`
130+
- `prometheus_ipmi_exporter_targets`
123131
- `prometheus_sonic_exporter_targets`
124132
- `prometheus_metal_core_targets`
125133
- `prometheus_frr_exporter_targets`
134+
- `prometheus_alloy_targets`
135+
136+
### Dashboards
137+
138+
To query and visualize logs, metrics and alerts we deploy several grafana
139+
dashboards to the control plane:
140+
141+
- `grafana-dashboard-alertmanager`
142+
- `grafana-dashboard-machine-capacity`
143+
- `grafana-dashboard-metal-api`
144+
- `grafana-dashboard-rethinkdb`
145+
- `grafana-dashboard-sonic-exporter`
146+
147+
and also some Gardener related dashboards:
148+
149+
- `grafana-dashboard-gardener-overview`
150+
- `grafana-dashboard-shoot-cluster`
151+
- `grafana-dashboard-shoot-customizations`
152+
- `grafana-dashboard-shoot-details`
153+
- `grafana-dashboard-shoot-states`
126154

127155
## Alerting
128156

docs/04-For Operators/monitoring-stack.drawio.svg

Lines changed: 1387 additions & 0 deletions
Loading

docs/04-For Operators/monitoring-stack.svg

Lines changed: 1 addition & 1 deletion
Loading

0 commit comments

Comments
 (0)