From 614bb66702befff224c2de7ddd5493f958af3c68 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Wed, 6 May 2026 16:02:35 +0200 Subject: [PATCH 01/12] docs(monitoring): use alloy instead of promtail --- docs/04-For Operators/05-monitoring.md | 20 +++++++++++++++----- docs/04-For Operators/monitoring-stack.svg | 2 +- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index ca456ec9..4dbec5e1 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -13,7 +13,7 @@ sidebar_position: 5 ## Logging Logs are being collected by -[Promtail](https://grafana.com/docs/loki/latest/send-data/promtail/) and pushed +[Grafana Alloy](https://grafana.com/docs/alloy/latest/) and pushed to a [Loki](https://grafana.com/docs/loki/latest/) instance running in the control plane. Loki is deployed in [monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/) @@ -22,11 +22,21 @@ configuration parameters for the control plane in the control plane's [logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md) role. -In the partitions, Promtail is deployed inside a systemd-managed Docker +In the partitions, Alloy is deployed inside a systemd-managed Docker container. Configuration parameters can be found in the partition's -[promtail](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/promtail/README.md) -role. Which hosts Promtail collects from can be configured via the -`prometheus_promtail_targets` variable. +[alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md) +role. Which hosts Alloy collects from can be configured via the +`prometheus_alloy_targets` variable. + +:::note Migrating from promtail + +The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the +[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail) +section of the partition alloy role's README and the +[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail) +section of the control-plane logging role's README for step-by-step instructions. + +::: ## Monitoring diff --git a/docs/04-For Operators/monitoring-stack.svg b/docs/04-For Operators/monitoring-stack.svg index 9ece9891..b661a9f3 100644 --- a/docs/04-For Operators/monitoring-stack.svg +++ b/docs/04-For Operators/monitoring-stack.svg @@ -1 +1 @@ -
Management Servers
Management Servers
Promtail
Promtail
Prometheus
Prometheus
node_exporter
node_exporter
ipmi_exporter
ipmi_exporter
blackbox_exporter
blackbox_exporter
Exporters
Exporters
Switches
Switches
Promtail
Promtail
Exporters
Exporters
node_exporter
node_exporter
sonic_exporter
sonic_exporter
blackbox_exporter
blackbox_exporter
Machines
Machines
BMC
BMC
Metal Partition
Metal Partition
GCS
GCS
shoot-states
shoot-states
shoot-details
shoot-details
shoot-customizations
shoot-customizations
shoot-cluster
shoot-cluster
gardener-overview
gardener-overview
alertmanager
alertmanager
sonic-exporter
sonic-exporter
rethinkdb
rethinkdb
metal-api
metal-api
machine-capacity
machine-capacity
Gardener Dashboards
Gardener Dashboards
Grafana Dashboards
Grafana Dashboards
Metal Control Plane
Metal Control Plane
Promtail
Promtail
filesystem
filesystem
Loki
Loki
Exporters
Exporters
gardener-metrics-exporter
gardener-metrics-exporter
metal-metrics-exporter
metal-metrics-exporter
event-exporter
event-exporter
rethinkdb-exporter
rethinkdb-exporter
ServiceMonitors
ServiceMonitors
gardener-metrics-exporter
gardener-metrics-exporter
ipam-db
ipam-db
masterdata-api
masterdata-api
masterdata-db
masterdata-db
metal-db
metal-db
rethinkdb-exporter
rethinkdb-exporter
metal-metrics-exporter
metal-metrics-exporter
metal-api
metal-api
prometheus-operator
prometheus-operator
kube-prometheus
kube-prometheus
node_exporter
node_exporter
blackbox_exporter
blackbox_exporter
prometheus-adapter
prometheus-adapter
Grafana
Grafana
kube-state-metrics
kube-state-metrics
Prometheus
Prometheus
alertmanager
alertmanager
Thanos
Thanos
Text is not SVG - cannot display
\ No newline at end of file +
Management Servers
Management Servers
Alloy
Alloy
Prometheus
Prometheus
node_exporter
node_exporter
ipmi_exporter
ipmi_exporter
blackbox_exporter
blackbox_exporter
Exporters
Exporters
Switches
Switches
Alloy
Alloy
Exporters
Exporters
node_exporter
node_exporter
sonic_exporter
sonic_exporter
blackbox_exporter
blackbox_exporter
Machines
Machines
BMC
BMC
Metal Partition
Metal Partition
GCS
GCS
shoot-states
shoot-states
shoot-details
shoot-details
shoot-customizations
shoot-customizations
shoot-cluster
shoot-cluster
gardener-overview
gardener-overview
alertmanager
alertmanager
sonic-exporter
sonic-exporter
rethinkdb
rethinkdb
metal-api
metal-api
machine-capacity
machine-capacity
Gardener Dashboards
Gardener Dashboards
Grafana Dashboards
Grafana Dashboards
Metal Control Plane
Metal Control Plane
Alloy
Alloy
filesystem
filesystem
Loki
Loki
Exporters
Exporters
gardener-metrics-exporter
gardener-metrics-exporter
metal-metrics-exporter
metal-metrics-exporter
event-exporter
event-exporter
rethinkdb-exporter
rethinkdb-exporter
ServiceMonitors
ServiceMonitors
gardener-metrics-exporter
gardener-metrics-exporter
ipam-db
ipam-db
masterdata-api
masterdata-api
masterdata-db
masterdata-db
metal-db
metal-db
rethinkdb-exporter
rethinkdb-exporter
metal-metrics-exporter
metal-metrics-exporter
metal-api
metal-api
prometheus-operator
prometheus-operator
kube-prometheus
kube-prometheus
node_exporter
node_exporter
blackbox_exporter
blackbox_exporter
prometheus-adapter
prometheus-adapter
Grafana
Grafana
kube-state-metrics
kube-state-metrics
Prometheus
Prometheus
alertmanager
alertmanager
Thanos
Thanos
Text is not SVG - cannot display
From 2976a07f6261d0013ccdf8ce4fba63eb5a620938 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Thu, 7 May 2026 17:26:09 +0200 Subject: [PATCH 02/12] docs(monitoring): mention labels and typical setup --- docs/04-For Operators/05-monitoring.md | 30 ++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 4dbec5e1..0c990da5 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -22,11 +22,33 @@ configuration parameters for the control plane in the control plane's [logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md) role. -In the partitions, Alloy is deployed inside a systemd-managed Docker -container. Configuration parameters can be found in the partition's +In the partitions, Alloy can be deployed inside a systemd-managed Docker +container on management servers and switches. Configuration parameters can be found in the partition's [alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md) -role. Which hosts Alloy collects from can be configured via the -`prometheus_alloy_targets` variable. +role. + +### Partition Log Sources + +Alloy is configured through snippets that define what logs are collected. The following snippets are typically used: + +| Host type | Snippet | Description | Key labels | +| ---------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------- | +| Leaves, spines, exits | `journal` | Collects logs from the systemd journal; auto-discovers both volatile (`/run/log/journal`) and persistent (`/var/log/journal`) storage | `job=systemd-journal`, `unit`, `level` | +| Management servers | `journal-file` | Collects logs from the persistent systemd journal at a configurable path; supports migrating cursor position from promtail | `job=systemd-journal`, `unit`, `level` | +| Hosts without journald | `syslog` | Tails `/var/log/syslog` | `job=syslog` | +| Hosts running Docker | `docker` | Collects logs from all Docker containers via the Docker socket | `job=docker`, `container` | + +All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition. + +### Querying Logs in Grafana + +Logs can be explored in Grafana using the **Explore** view with the Loki data source. Useful label filters: + +- `{partition=""}` — all logs from a partition +- `{host=""}` — all logs from a specific host +- `{job="docker", container=""}` — logs from a specific Docker container +- `{job="systemd-journal", unit=".service"}` — logs from a specific systemd unit +- `{job="systemd-journal", level="error"}` — error-level journal entries across all units :::note Migrating from promtail From 2247606cb07c601cec926f904d95ed98fd28281e Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Tue, 19 May 2026 18:07:52 +0200 Subject: [PATCH 03/12] docs(monitoring): control-plane alloy --- docs/04-For Operators/05-monitoring.md | 122 +- .../monitoring-stack.drawio.svg | 1387 +++++++++++++++++ docs/04-For Operators/monitoring-stack.svg | 2 +- 3 files changed, 1463 insertions(+), 48 deletions(-) create mode 100644 docs/04-For Operators/monitoring-stack.drawio.svg diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 0c990da5..40f71fea 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -10,22 +10,41 @@ sidebar_position: 5 ![Monitoring Stack](monitoring-stack.svg) +The diagram above shows the full monitoring and logging stack: partition hosts ship logs to Loki and expose metrics for Prometheus scraping; control-plane and Gardener seed Alloy instances push both logs and self-metrics centrally; Grafana provides unified dashboards and alerting across all tiers. + ## Logging -Logs are being collected by -[Grafana Alloy](https://grafana.com/docs/alloy/latest/) and pushed -to a [Loki](https://grafana.com/docs/loki/latest/) instance running in the -control plane. Loki is deployed in -[monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/) -and with storage type `'filesystem'`. You can find all logging related -configuration parameters for the control plane in the control plane's -[logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md) -role. +[Grafana Alloy](https://grafana.com/docs/alloy/latest/) collects and pushes logs to a [Loki](https://grafana.com/docs/loki/latest/) instance +running in the control plane. -In the partitions, Alloy can be deployed inside a systemd-managed Docker -container on management servers and switches. Configuration parameters can be found in the partition's -[alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md) -role. +Loki is deployed in [monolithic mode](https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/) and with storage type `'filesystem'`. +You can find all logging related configuration parameters for the control plane in the [logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md) role. + +In the partitions, Alloy can be deployed inside a systemd-managed Docker container on management servers and switches. +Configuration parameters can be found in the partition's [alloy](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md) role. + +### Control-Plane Log Sources + +In the control plane, Alloy runs as a Kubernetes DaemonSet and collects logs from two sources: + +| Source | Description | Key labels | +| ----------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | +| Pod logs | Collected from all pods via the Kubernetes API (`loki.source.kubernetes`) | `cluster`, `namespace`, `pod`, `container`, `pod_uid`, `node_name`, `app`, `instance`, `component`, `job` | +| Kubernetes events | Collected natively via `loki.source.kubernetes_events` — no separate event-exporter required | `cluster`, `job=monitoring/event-exporter`, `namespace` | + +All control-plane log entries carry a `cluster` label (configured via `logging_alloy_cluster_label`) identifying the control-plane stage. + +#### Gardener + +The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an additional Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances collect pod logs from their respective clusters and forward them to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the shooted seed name, enabling per-seed filtering in Grafana. + +### Control-Plane: Querying Logs in Grafana + +- `{cluster=""}` — all logs from a control-plane stage +- `{namespace=""}` — all logs from a specific namespace +- `{job="/"}` — logs from a specific application +- `{job="monitoring/event-exporter"}` — Kubernetes events +- `{cluster=""}` — all logs from a specific Gardener shooted seed ### Partition Log Sources @@ -38,11 +57,9 @@ Alloy is configured through snippets that define what logs are collected. The fo | Hosts without journald | `syslog` | Tails `/var/log/syslog` | `job=syslog` | | Hosts running Docker | `docker` | Collects logs from all Docker containers via the Docker socket | `job=docker`, `container` | -All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition. - -### Querying Logs in Grafana +### Partition: Querying Logs in Grafana -Logs can be explored in Grafana using the **Explore** view with the Loki data source. Useful label filters: +All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition. - `{partition=""}` — all logs from a partition - `{host=""}` — all logs from a specific host @@ -50,13 +67,13 @@ Logs can be explored in Grafana using the **Explore** view with the Loki data so - `{job="systemd-journal", unit=".service"}` — logs from a specific systemd unit - `{job="systemd-journal", level="error"}` — error-level journal entries across all units -:::note Migrating from promtail +:::note Migrating from Promtail + +The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the respective migration guides for step-by-step instructions: -The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the -[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail) -section of the partition alloy role's README and the -[Migration from promtail](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail) -section of the control-plane logging role's README for step-by-step instructions. +- [Partition](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail) — partition alloy role +- [Control-plane](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail) — control-plane logging role +- [Gardener](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md#migration-from-promtail) — gardener-logging role ::: @@ -65,31 +82,20 @@ section of the control-plane logging role's README for step-by-step instructions For monitoring we deploy the [kube-prometheus-stack](https://github.com/prometheus-operator/kube-prometheus) and a [Thanos](https://thanos.io/tip/thanos/getting-started.md/) instance in the -control plane. Metrics for the control plane are supplied by +control plane. -- `metal-metrics-exporter` -- `rethindb-exporter` -- `event-exporter` -- `gardener-metrics-exporter` +### Control-Plane Metrics -To query and visualize logs, metrics and alerts we deploy several grafana -dashboards to the control plane: +In-cluster components are scraped by Prometheus via `ServiceMonitor` resources (pull model). +Alloy self-metrics use a different approach: the control-plane Alloy DaemonSet and all Gardener seed Alloy instances push their metrics via `prometheus.remote_write` to Thanos Receive (`monitoring_thanos_receive_enabled: true`), removing the need for Prometheus to reach into each cluster. -- `grafana-dashboard-alertmanager` -- `grafana-dashboard-machine-capacity` -- `grafana-dashboard-metal-api` -- `grafana-dashboard-rethinkdb` -- `grafana-dashboard-sonic-exporter` - -and also some gardener related dashboards: +Metrics are supplied by -- `grafana-dashboard-gardener-overview` -- `grafana-dashboard-shoot-cluster` -- `grafana-dashboard-shoot-customizations` -- `grafana-dashboard-shoot-details` -- `grafana-dashboard-shoot-states` +- `metal-metrics-exporter` +- `rethinkdb-exporter` +- `gardener-metrics-exporter` -The following `ServiceMonitors` are also deployed: +The following `ServiceMonitors` are deployed: - `gardener-metrics-exporter` - `ipam-db` @@ -105,7 +111,9 @@ found in the control plane's [monitoring](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/monitoring/README.md) role. -Partition metrics are supplied by +### Partition Metrics + +Partition metrics are collected via Prometheus scraping (pull model). Exporters running on partition hosts supply the metrics: - `node-exporter` - `blackbox-exporter` @@ -113,16 +121,36 @@ Partition metrics are supplied by - `sonic-exporter` - `metal-core` - `frr-exporter` +- `alloy` -and scraped by Prometheus. For each of these exporters, the target hosts can be -defined by +Target hosts for each exporter are defined by - `prometheus_node_exporter_targets` - `prometheus_blackbox_exporter_targets` -- `prometheus_frr_exporter_targets` +- `prometheus_ipmi_exporter_targets` - `prometheus_sonic_exporter_targets` - `prometheus_metal_core_targets` - `prometheus_frr_exporter_targets` +- `prometheus_alloy_targets` + +### Dashboards + +To query and visualize logs, metrics and alerts we deploy several grafana +dashboards to the control plane: + +- `grafana-dashboard-alertmanager` +- `grafana-dashboard-machine-capacity` +- `grafana-dashboard-metal-api` +- `grafana-dashboard-rethinkdb` +- `grafana-dashboard-sonic-exporter` + +and also some Gardener related dashboards: + +- `grafana-dashboard-gardener-overview` +- `grafana-dashboard-shoot-cluster` +- `grafana-dashboard-shoot-customizations` +- `grafana-dashboard-shoot-details` +- `grafana-dashboard-shoot-states` ## Alerting diff --git a/docs/04-For Operators/monitoring-stack.drawio.svg b/docs/04-For Operators/monitoring-stack.drawio.svg new file mode 100644 index 00000000..6360cada --- /dev/null +++ b/docs/04-For Operators/monitoring-stack.drawio.svg @@ -0,0 +1,1387 @@ + + + + + + + + + + + + + + +
+
+
+ pull from +
+
+
+
+ + pull from + +
+
+
+ + + + + + + + +
+
+
+ push to +
+
+
+
+ + push to + +
+
+
+ + + + + + + + + + + + + + + +
+
+
+ + Management Servers + +
+
+
+
+ + Management Servers + +
+
+
+ + + + + + + +
+
+
+ Alloy +
+
+
+
+ + Alloy + +
+
+
+ + + + + + + +
+
+
+ Prometheus +
+
+
+
+ + Prometheus + +
+
+
+ + + + + + + + + + + +
+
+
+ node_exporter +
+
+
+
+ + node_exporter + +
+
+
+ + + + + + + +
+
+
+ ipmi_exporter +
+
+
+
+ + ipmi_exporter + +
+
+
+ + + + + + + +
+
+
+ blackbox_exporter +
+
+
+
+ + blackbox_exporter + +
+
+
+ + + + + + + +
+
+
+ Exporters +
+
+
+
+ + Exporters + +
+
+
+ + + + + + + + + + + + + + + + + + + +
+
+
+ Switches +
+
+
+
+ + Switches + +
+
+
+ + + + + + + +
+
+
+ Alloy +
+
+
+
+ + Alloy + +
+
+
+ + + + + + + + + + + +
+
+
+ Exporters +
+
+
+
+ + Exporters + +
+
+
+ + + + + + + +
+
+
+ node_exporter +
+
+
+
+ + node_exporter + +
+
+
+ + + + + + + +
+
+
+ sonic_exporter +
+
+
+
+ + sonic_exporter + +
+
+
+ + + + + + + +
+
+
+ blackbox_exporter +
+
+
+
+ + blackbox_exporter + +
+
+
+ + + + + + + + + + + +
+
+
+ Machines +
+
+
+
+ + Machines + +
+
+
+ + + + + + + +
+
+
+ BMC +
+
+
+
+ + BMC + +
+
+
+ + + + + + + + + + + + + + + + + + + +
+
+
+ Metal Partition +
+
+
+
+ + Metal Partition + +
+
+
+ + + + + + + + + + + + + + + + + + + + +
+
+
+ + GCS + +
+
+
+
+ + GCS + +
+
+
+ + + + + + + + + + + + + + + + + +
+
+
+ shoot-states +
+
+
+
+ + shoot-states + +
+
+
+ + + + + + + +
+
+
+ shoot-details +
+
+
+
+ + shoot-details + +
+
+
+ + + + + + + +
+
+
+ shoot-customizations +
+
+
+
+ + shoot-customizations + +
+
+
+ + + + + + + +
+
+
+ shoot-cluster +
+
+
+
+ + shoot-cluster + +
+
+
+ + + + + + + +
+
+
+ gardener-overview +
+
+
+
+ + gardener-overview + +
+
+
+ + + + + + + +
+
+
+ alertmanager +
+
+
+
+ + alertmanager + +
+
+
+ + + + + + + +
+
+
+ sonic-exporter +
+
+
+
+ + sonic-exporter + +
+
+
+ + + + + + + +
+
+
+ rethinkdb +
+
+
+
+ + rethinkdb + +
+
+
+ + + + + + + +
+
+
+ metal-api +
+
+
+
+ + metal-api + +
+
+
+ + + + + + + +
+
+
+ machine-capacity +
+
+
+
+ + machine-capacity + +
+
+
+ + + + + + + +
+
+
+ Gardener Dashboards +
+
+
+
+ + Gardener Dashboards + +
+
+
+ + + + + + + + +
+
+
+ Grafana Dashboards +
+
+
+
+ + Grafana Dashboards + +
+
+
+ + + + + + + + + + + +
+
+
+ Metal Control Plane +
+
+
+
+ + Metal Control Plane + +
+
+
+ + + + + + + + + + + + + +
+
+
+ Alloy +
+
+
+
+ + Alloy + +
+
+
+ + + + + + + + + + + + +
+
+
+ filesystem +
+
+
+
+ + filesystem + +
+
+
+ + + + + + + + + + + +
+
+
+ Loki +
+
+
+
+ + Loki + +
+
+
+ + + + + + + + + + + + + + +
+
+
+ Exporters +
+
+
+
+ + Exporters + +
+
+
+ + + + + + + +
+
+
+ gardener-metrics-exporter +
+
+
+
+ + gardener-metrics-exporter + +
+
+
+ + + + + + + +
+
+
+ metal-metrics-exporter +
+
+
+
+ + metal-metrics-exporter + +
+
+
+ + + + + + + +
+
+
+ + rethinkdb-exporter + +
+
+
+
+ + rethinkdb-exporter + +
+
+
+ + + + + + + + + + + +
+
+
+ ServiceMonitors +
+
+
+
+ + ServiceMonitors + +
+
+
+ + + + + + + +
+
+
+ gardener-metrics-exporter +
+
+
+
+ + gardener-metrics-exporter + +
+
+
+ + + + + + + +
+
+
+ ipam-db +
+
+
+
+ + ipam-db + +
+
+
+ + + + + + + +
+
+
+ masterdata-api +
+
+
+
+ + masterdata-api + +
+
+
+ + + + + + + +
+
+
+ masterdata-db +
+
+
+
+ + masterdata-db + +
+
+
+ + + + + + + +
+
+
+ metal-db +
+
+
+
+ + metal-db + +
+
+
+ + + + + + + +
+
+
+ rethinkdb-exporter +
+
+
+
+ + rethinkdb-exporter + +
+
+
+ + + + + + + +
+
+
+ metal-metrics-exporter +
+
+
+
+ + metal-metrics-exporter + +
+
+
+ + + + + + + +
+
+
+ metal-api +
+
+
+
+ + metal-api + +
+
+
+ + + + + + + + + + + +
+
+
+ prometheus-operator +
+
+
+
+ + prometheus-operator + +
+
+
+ + + + + + + + + + + +
+
+
+ kube-prometheus +
+
+
+
+ + kube-prometheus + +
+
+
+ + + + + + + +
+
+
+ node_exporter +
+
+
+
+ + node_exporter + +
+
+
+ + + + + + + +
+
+
+ blackbox_exporter +
+
+
+
+ + blackbox_exporter + +
+
+
+ + + + + + + +
+
+
+ prometheus-adapter +
+
+
+
+ + prometheus-adapter + +
+
+
+ + + + + + + +
+
+
+ Grafana +
+
+
+
+ + Grafana + +
+
+
+ + + + + + + +
+
+
+ kube-state-metrics +
+
+
+
+ + kube-state-metrics + +
+
+
+ + + + + + + +
+
+
+ Prometheus +
+
+
+
+ + Prometheus + +
+
+
+ + + + + + + +
+
+
+ alertmanager +
+
+
+
+ + alertmanager + +
+
+
+ + + + + + + +
+
+
+ Thanos +
+
+
+
+ + Thanos + +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + Text is not SVG - cannot display + + + +
diff --git a/docs/04-For Operators/monitoring-stack.svg b/docs/04-For Operators/monitoring-stack.svg index b661a9f3..fee49e9d 100644 --- a/docs/04-For Operators/monitoring-stack.svg +++ b/docs/04-For Operators/monitoring-stack.svg @@ -1 +1 @@ -
Management Servers
Management Servers
Alloy
Alloy
Prometheus
Prometheus
node_exporter
node_exporter
ipmi_exporter
ipmi_exporter
blackbox_exporter
blackbox_exporter
Exporters
Exporters
Switches
Switches
Alloy
Alloy
Exporters
Exporters
node_exporter
node_exporter
sonic_exporter
sonic_exporter
blackbox_exporter
blackbox_exporter
Machines
Machines
BMC
BMC
Metal Partition
Metal Partition
GCS
GCS
shoot-states
shoot-states
shoot-details
shoot-details
shoot-customizations
shoot-customizations
shoot-cluster
shoot-cluster
gardener-overview
gardener-overview
alertmanager
alertmanager
sonic-exporter
sonic-exporter
rethinkdb
rethinkdb
metal-api
metal-api
machine-capacity
machine-capacity
Gardener Dashboards
Gardener Dashboards
Grafana Dashboards
Grafana Dashboards
Metal Control Plane
Metal Control Plane
Alloy
Alloy
filesystem
filesystem
Loki
Loki
Exporters
Exporters
gardener-metrics-exporter
gardener-metrics-exporter
metal-metrics-exporter
metal-metrics-exporter
event-exporter
event-exporter
rethinkdb-exporter
rethinkdb-exporter
ServiceMonitors
ServiceMonitors
gardener-metrics-exporter
gardener-metrics-exporter
ipam-db
ipam-db
masterdata-api
masterdata-api
masterdata-db
masterdata-db
metal-db
metal-db
rethinkdb-exporter
rethinkdb-exporter
metal-metrics-exporter
metal-metrics-exporter
metal-api
metal-api
prometheus-operator
prometheus-operator
kube-prometheus
kube-prometheus
node_exporter
node_exporter
blackbox_exporter
blackbox_exporter
prometheus-adapter
prometheus-adapter
Grafana
Grafana
kube-state-metrics
kube-state-metrics
Prometheus
Prometheus
alertmanager
alertmanager
Thanos
Thanos
Text is not SVG - cannot display
+
 pull from 
 pull from 
 push to 
 push to 
Management Servers
Management Servers
Alloy
Alloy
Prometheus
Prometheus
node_exporter
node_exporter
ipmi_exporter
ipmi_exporter
blackbox_exporter
blackbox_exporter
Exporters
Exporters
Switches
Switches
Alloy
Alloy
Exporters
Exporters
node_exporter
node_exporter
sonic_exporter
sonic_exporter
blackbox_exporter
blackbox_exporter
Machines
Machines
BMC
BMC
Metal Partition
Metal Partition
GCS
GCS
shoot-states
shoot-states
shoot-details
shoot-details
shoot-customizations
shoot-customizations
shoot-cluster
shoot-cluster
gardener-overview
gardener-overview
alertmanager
alertmanager
sonic-exporter
sonic-exporter
rethinkdb
rethinkdb
metal-api
metal-api
machine-capacity
machine-capacity
Gardener Dashboards
Gardener Dashboards
Grafana Dashboards
Grafana Dashboards
Metal Control Plane
Metal Control Plane
Alloy
Alloy
filesystem
filesystem
Loki
Loki
Exporters
Exporters
gardener-metrics-exporter
gardener-metrics-exporter
metal-metrics-exporter
metal-metrics-exporter
rethinkdb-exporter
rethinkdb-exporter
ServiceMonitors
ServiceMonitors
gardener-metrics-exporter
gardener-metrics-exporter
ipam-db
ipam-db
masterdata-api
masterdata-api
masterdata-db
masterdata-db
metal-db
metal-db
rethinkdb-exporter
rethinkdb-exporter
metal-metrics-exporter
metal-metrics-exporter
metal-api
metal-api
prometheus-operator
prometheus-operator
kube-prometheus
kube-prometheus
node_exporter
node_exporter
blackbox_exporter
blackbox_exporter
prometheus-adapter
prometheus-adapter
Grafana
Grafana
kube-state-metrics
kube-state-metrics
Prometheus
Prometheus
alertmanager
alertmanager
Thanos
Thanos
Text is not SVG - cannot display
\ No newline at end of file From 1e9798b303c9b1925ee5e582ae01c277bc6009b5 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Wed, 20 May 2026 17:06:29 +0200 Subject: [PATCH 04/12] docs: control plane metrics details --- docs/04-For Operators/05-monitoring.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 40f71fea..e21b709c 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -87,7 +87,10 @@ control plane. ### Control-Plane Metrics In-cluster components are scraped by Prometheus via `ServiceMonitor` resources (pull model). -Alloy self-metrics use a different approach: the control-plane Alloy DaemonSet and all Gardener seed Alloy instances push their metrics via `prometheus.remote_write` to Thanos Receive (`monitoring_thanos_receive_enabled: true`), removing the need for Prometheus to reach into each cluster. +Alloy self-metrics use a different approach — no ServiceMonitor required: + +- The **control-plane Alloy** DaemonSet pushes metrics via `prometheus.remote_write` to the in-cluster Thanos Receive. Wired automatically when `monitoring_thanos_receive_enabled: true`. +- **Gardener seed Alloy** instances have no local Prometheus, so they push metrics to the control-plane Thanos Receive ingress instead. Wired automatically when `monitoring_thanos_receive_ingress_enabled: true`. Metrics are supplied by From a3677646d33ff7ebaa44240a61b34471cadab1b7 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Wed, 10 Jun 2026 09:27:00 +0200 Subject: [PATCH 05/12] docs: gardener logging and alloy metrics --- docs/04-For Operators/05-monitoring.md | 14 ++++++------ .../monitoring-stack.drawio.svg | 22 +++++++++---------- 2 files changed, 18 insertions(+), 18 deletions(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index e21b709c..6e55c996 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -25,7 +25,7 @@ Configuration parameters can be found in the partition's [alloy](https://github. ### Control-Plane Log Sources -In the control plane, Alloy runs as a Kubernetes DaemonSet and collects logs from two sources: +In the control plane, Alloy runs as a Kubernetes `DaemonSet` and collects logs from two sources: | Source | Description | Key labels | | ----------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | @@ -36,7 +36,9 @@ All control-plane log entries carry a `cluster` label (configured via `logging_a #### Gardener -The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an additional Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances collect pod logs from their respective clusters and forward them to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the shooted seed name, enabling per-seed filtering in Grafana. +Gardener ships with a built-in logging stack (Vali + fluent-bit per seed). The metal-stack deployment disables this stack and instead uses Alloy to forward all logs centrally — giving platform operators a single place to query infrastructure logs across all Gardener clusters. + +The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances collect pod logs and Kubernetes events from their respective clusters and forward them to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the cluster name (garden name or shooted seed name), enabling per-cluster filtering in Grafana. ### Control-Plane: Querying Logs in Grafana @@ -44,7 +46,7 @@ The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/co - `{namespace=""}` — all logs from a specific namespace - `{job="/"}` — logs from a specific application - `{job="monitoring/event-exporter"}` — Kubernetes events -- `{cluster=""}` — all logs from a specific Gardener shooted seed +- `{cluster=""}` — all logs from the Gardener garden cluster or a specific shooted seed ### Partition Log Sources @@ -87,16 +89,14 @@ control plane. ### Control-Plane Metrics In-cluster components are scraped by Prometheus via `ServiceMonitor` resources (pull model). -Alloy self-metrics use a different approach — no ServiceMonitor required: - -- The **control-plane Alloy** DaemonSet pushes metrics via `prometheus.remote_write` to the in-cluster Thanos Receive. Wired automatically when `monitoring_thanos_receive_enabled: true`. -- **Gardener seed Alloy** instances have no local Prometheus, so they push metrics to the control-plane Thanos Receive ingress instead. Wired automatically when `monitoring_thanos_receive_ingress_enabled: true`. Metrics are supplied by - `metal-metrics-exporter` - `rethinkdb-exporter` - `gardener-metrics-exporter` +- `alloy` (control-plane) — self-metrics, disabled by default; see the [logging role](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#meta-monitoring) for configuration +- `alloy` (gardens and seeds) — self-metrics, disabled by default; see the [gardener-logging role](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md#meta-monitoring) for configuration The following `ServiceMonitors` are deployed: diff --git a/docs/04-For Operators/monitoring-stack.drawio.svg b/docs/04-For Operators/monitoring-stack.drawio.svg index 6360cada..4c616975 100644 --- a/docs/04-For Operators/monitoring-stack.drawio.svg +++ b/docs/04-For Operators/monitoring-stack.drawio.svg @@ -1,4 +1,4 @@ - + @@ -12,7 +12,7 @@ -
+
pull from @@ -20,7 +20,7 @@
- + pull from @@ -34,7 +34,7 @@ -
+
push to @@ -42,7 +42,7 @@
- + push to @@ -431,7 +431,7 @@ - + @@ -1348,10 +1348,6 @@ - - - - @@ -1360,12 +1356,16 @@ + + + + - + From d09f1690486f293e2038706a47832321d05c82db Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Mon, 15 Jun 2026 13:48:41 +0200 Subject: [PATCH 06/12] chore: logging-common --- docs/04-For Operators/05-monitoring.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 6e55c996..c3cfe42f 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -27,10 +27,10 @@ Configuration parameters can be found in the partition's [alloy](https://github. In the control plane, Alloy runs as a Kubernetes `DaemonSet` and collects logs from two sources: -| Source | Description | Key labels | -| ----------------- | -------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | -| Pod logs | Collected from all pods via the Kubernetes API (`loki.source.kubernetes`) | `cluster`, `namespace`, `pod`, `container`, `pod_uid`, `node_name`, `app`, `instance`, `component`, `job` | -| Kubernetes events | Collected natively via `loki.source.kubernetes_events` — no separate event-exporter required | `cluster`, `job=monitoring/event-exporter`, `namespace` | +| Source | Description | Key labels | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | +| Pod logs | Read from the node filesystem (`/var/log/pods`, `loki.source.file`). Each DaemonSet pod collects only pods scheduled on its own node. | `cluster`, `namespace`, `pod`, `container`, `pod_uid`, `node_name`, `app`, `instance`, `component`, `job` | +| Kubernetes events | Collected natively via `loki.source.kubernetes_events` with clustering-based leader election — no separate event-exporter required | `cluster`, `job=monitoring/event-exporter`, `namespace` | All control-plane log entries carry a `cluster` label (configured via `logging_alloy_cluster_label`) identifying the control-plane stage. @@ -38,7 +38,7 @@ All control-plane log entries carry a `cluster` label (configured via `logging_a Gardener ships with a built-in logging stack (Vali + fluent-bit per seed). The metal-stack deployment disables this stack and instead uses Alloy to forward all logs centrally — giving platform operators a single place to query infrastructure logs across all Gardener clusters. -The [gardener-logging](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances collect pod logs and Kubernetes events from their respective clusters and forward them to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the cluster name (garden name or shooted seed name), enabling per-cluster filtering in Grafana. +The [gardener-logging](https://github.com/metal-stack/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances read pod logs from the node filesystem and collect Kubernetes events, forwarding everything to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the cluster name (garden name or shooted seed name), enabling per-cluster filtering in Grafana. ### Control-Plane: Querying Logs in Grafana @@ -74,8 +74,7 @@ All log entries carry the `host` and `partition` labels regardless of snippet, w The `promtail` role is deprecated and replaced by the `alloy` role. Refer to the respective migration guides for step-by-step instructions: - [Partition](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#migration-from-promtail) — partition alloy role -- [Control-plane](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#migration-from-promtail) — control-plane logging role -- [Gardener](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md#migration-from-promtail) — gardener-logging role +- [Control-plane and Gardener](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging-common/README.md#migration-from-promtail) — logging-common migration guide (applies to both the logging and gardener-logging roles) ::: @@ -95,8 +94,8 @@ Metrics are supplied by - `metal-metrics-exporter` - `rethinkdb-exporter` - `gardener-metrics-exporter` -- `alloy` (control-plane) — self-metrics, disabled by default; see the [logging role](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging/README.md#meta-monitoring) for configuration -- `alloy` (gardens and seeds) — self-metrics, disabled by default; see the [gardener-logging role](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/gardener-logging/README.md#meta-monitoring) for configuration +- `alloy` (control-plane) — self-metrics, disabled by default; see [logging-common](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging-common/README.md#meta-monitoring) for configuration +- `alloy` (gardens and seeds) — self-metrics, disabled by default, push-only (no ServiceMonitor); see [logging-common](https://github.com/metal-stack/metal-roles/blob/master/control-plane/roles/logging-common/README.md#meta-monitoring) for configuration The following `ServiceMonitors` are deployed: From 0b31d0df0e9ae0ea03e34729d5b44578f96c961d Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Mon, 15 Jun 2026 16:46:08 +0200 Subject: [PATCH 07/12] chore: rename alloy job monitoring/event-exporter to events --- docs/04-For Operators/05-monitoring.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index c3cfe42f..c5f04b48 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -30,7 +30,7 @@ In the control plane, Alloy runs as a Kubernetes `DaemonSet` and collects logs f | Source | Description | Key labels | | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | | Pod logs | Read from the node filesystem (`/var/log/pods`, `loki.source.file`). Each DaemonSet pod collects only pods scheduled on its own node. | `cluster`, `namespace`, `pod`, `container`, `pod_uid`, `node_name`, `app`, `instance`, `component`, `job` | -| Kubernetes events | Collected natively via `loki.source.kubernetes_events` with clustering-based leader election — no separate event-exporter required | `cluster`, `job=monitoring/event-exporter`, `namespace` | +| Kubernetes events | Collected natively via `loki.source.kubernetes_events` with clustering-based leader election — no separate event-exporter required | `cluster`, `job=events`, `namespace` | All control-plane log entries carry a `cluster` label (configured via `logging_alloy_cluster_label`) identifying the control-plane stage. @@ -45,7 +45,7 @@ The [gardener-logging](https://github.com/metal-stack/blob/master/control-plane/ - `{cluster=""}` — all logs from a control-plane stage - `{namespace=""}` — all logs from a specific namespace - `{job="/"}` — logs from a specific application -- `{job="monitoring/event-exporter"}` — Kubernetes events +- `{job="events"}` — Kubernetes events _(recently renamed from `monitoring/event-exporter`)_ - `{cluster=""}` — all logs from the Gardener garden cluster or a specific shooted seed ### Partition Log Sources From b3519c81c564a2aebc24f2dbf928ded0792efb2c Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Mon, 15 Jun 2026 17:22:41 +0200 Subject: [PATCH 08/12] docs: custom snippets --- docs/04-For Operators/05-monitoring.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index c5f04b48..960d08fc 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -50,7 +50,7 @@ The [gardener-logging](https://github.com/metal-stack/blob/master/control-plane/ ### Partition Log Sources -Alloy is configured through snippets that define what logs are collected. The following snippets are typically used: +Alloy is configured through snippets that define what logs are collected. The following built-in snippets are available: | Host type | Snippet | Description | Key labels | | ---------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------- | @@ -59,6 +59,8 @@ Alloy is configured through snippets that define what logs are collected. The fo | Hosts without journald | `syslog` | Tails `/var/log/syslog` | `job=syslog` | | Hosts running Docker | `docker` | Collects logs from all Docker containers via the Docker socket | `job=docker`, `container` | +Custom log sources can be added without modifying the role by providing your own Jinja2 snippet templates and referencing them via `alloy_config_custom_snippets` in your inventory. See the [alloy role](https://github.com/metal-stack/metal-roles/blob/master/partition/roles/alloy/README.md#customizing-the-config) for details. + ### Partition: Querying Logs in Grafana All log entries carry the `host` and `partition` labels regardless of snippet, which makes it easy to filter logs in Grafana Explore by host or partition. From 36765231f2054e19049f3b0e1928fbe32bd86fcd Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Mon, 15 Jun 2026 17:36:15 +0200 Subject: [PATCH 09/12] chore: review comments --- docs/04-For Operators/05-monitoring.md | 8 ++++---- docs/04-For Operators/monitoring-stack.drawio.svg | 10 +++++----- docs/04-For Operators/monitoring-stack.svg | 1 - 3 files changed, 9 insertions(+), 10 deletions(-) delete mode 100644 docs/04-For Operators/monitoring-stack.svg diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 960d08fc..776062fc 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -8,7 +8,7 @@ sidebar_position: 5 ## Overview -![Monitoring Stack](monitoring-stack.svg) +![Monitoring Stack](monitoring-stack.drawio.svg) The diagram above shows the full monitoring and logging stack: partition hosts ship logs to Loki and expose metrics for Prometheus scraping; control-plane and Gardener seed Alloy instances push both logs and self-metrics centrally; Grafana provides unified dashboards and alerting across all tiers. @@ -36,7 +36,8 @@ All control-plane log entries carry a `cluster` label (configured via `logging_a #### Gardener -Gardener ships with a built-in logging stack (Vali + fluent-bit per seed). The metal-stack deployment disables this stack and instead uses Alloy to forward all logs centrally — giving platform operators a single place to query infrastructure logs across all Gardener clusters. +Gardener ships with a built-in logging stack (Vali + fluent-bit per seed), that can be used as-is or replaced/complemented by the metal-stack's Alloy + Loki solution. +The metal-stack roles provide an own centralized logging stack based on Alloy and Loki — giving platform operators a single place to query infrastructure logs across all Gardener clusters. The [gardener-logging](https://github.com/metal-stack/blob/master/control-plane/roles/gardener-logging/README.md) role deploys an Alloy instance into each Gardener shooted seed and optionally into the garden cluster itself. These instances read pod logs from the node filesystem and collect Kubernetes events, forwarding everything to the same Loki instance in the metal-stack control plane. Logs carry a `cluster` label set to the cluster name (garden name or shooted seed name), enabling per-cluster filtering in Grafana. @@ -91,7 +92,7 @@ control plane. In-cluster components are scraped by Prometheus via `ServiceMonitor` resources (pull model). -Metrics are supplied by +Additional metrics are supplied by - `metal-metrics-exporter` - `rethinkdb-exporter` @@ -124,7 +125,6 @@ Partition metrics are collected via Prometheus scraping (pull model). Exporters - `ipmi-exporter` - `sonic-exporter` - `metal-core` -- `frr-exporter` - `alloy` Target hosts for each exporter are defined by diff --git a/docs/04-For Operators/monitoring-stack.drawio.svg b/docs/04-For Operators/monitoring-stack.drawio.svg index 4c616975..00d4a33f 100644 --- a/docs/04-For Operators/monitoring-stack.drawio.svg +++ b/docs/04-For Operators/monitoring-stack.drawio.svg @@ -1,4 +1,4 @@ - + @@ -431,8 +431,8 @@ - - + + @@ -1365,8 +1365,8 @@ - - + + diff --git a/docs/04-For Operators/monitoring-stack.svg b/docs/04-For Operators/monitoring-stack.svg deleted file mode 100644 index fee49e9d..00000000 --- a/docs/04-For Operators/monitoring-stack.svg +++ /dev/null @@ -1 +0,0 @@ -
 pull from 
 pull from 
 push to 
 push to 
Management Servers
Management Servers
Alloy
Alloy
Prometheus
Prometheus
node_exporter
node_exporter
ipmi_exporter
ipmi_exporter
blackbox_exporter
blackbox_exporter
Exporters
Exporters
Switches
Switches
Alloy
Alloy
Exporters
Exporters
node_exporter
node_exporter
sonic_exporter
sonic_exporter
blackbox_exporter
blackbox_exporter
Machines
Machines
BMC
BMC
Metal Partition
Metal Partition
GCS
GCS
shoot-states
shoot-states
shoot-details
shoot-details
shoot-customizations
shoot-customizations
shoot-cluster
shoot-cluster
gardener-overview
gardener-overview
alertmanager
alertmanager
sonic-exporter
sonic-exporter
rethinkdb
rethinkdb
metal-api
metal-api
machine-capacity
machine-capacity
Gardener Dashboards
Gardener Dashboards
Grafana Dashboards
Grafana Dashboards
Metal Control Plane
Metal Control Plane
Alloy
Alloy
filesystem
filesystem
Loki
Loki
Exporters
Exporters
gardener-metrics-exporter
gardener-metrics-exporter
metal-metrics-exporter
metal-metrics-exporter
rethinkdb-exporter
rethinkdb-exporter
ServiceMonitors
ServiceMonitors
gardener-metrics-exporter
gardener-metrics-exporter
ipam-db
ipam-db
masterdata-api
masterdata-api
masterdata-db
masterdata-db
metal-db
metal-db
rethinkdb-exporter
rethinkdb-exporter
metal-metrics-exporter
metal-metrics-exporter
metal-api
metal-api
prometheus-operator
prometheus-operator
kube-prometheus
kube-prometheus
node_exporter
node_exporter
blackbox_exporter
blackbox_exporter
prometheus-adapter
prometheus-adapter
Grafana
Grafana
kube-state-metrics
kube-state-metrics
Prometheus
Prometheus
alertmanager
alertmanager
Thanos
Thanos
Text is not SVG - cannot display
\ No newline at end of file From 4360c0273593695250d1b984475467566c283d59 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Tue, 16 Jun 2026 10:40:16 +0200 Subject: [PATCH 10/12] fix: new logo --- .../monitoring-stack.drawio.svg | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/04-For Operators/monitoring-stack.drawio.svg b/docs/04-For Operators/monitoring-stack.drawio.svg index 00d4a33f..42eb1d1d 100644 --- a/docs/04-For Operators/monitoring-stack.drawio.svg +++ b/docs/04-For Operators/monitoring-stack.drawio.svg @@ -1,4 +1,4 @@ - + @@ -12,7 +12,7 @@ -
+
pull from @@ -20,7 +20,7 @@
- + pull from @@ -34,7 +34,7 @@ -
+
push to @@ -42,7 +42,7 @@
- + push to @@ -406,9 +406,6 @@ - - - @@ -431,8 +428,11 @@ - - + + + + + @@ -758,7 +758,7 @@ - + @@ -1365,7 +1365,7 @@ - + From 6778f051f06adf18740d6627e5e2b5ef4dfec246 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Tue, 16 Jun 2026 15:16:08 +0200 Subject: [PATCH 11/12] docs: finalize introduction --- docs/04-For Operators/05-monitoring.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 776062fc..57703acd 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -10,7 +10,11 @@ sidebar_position: 5 ![Monitoring Stack](monitoring-stack.drawio.svg) -The diagram above shows the full monitoring and logging stack: partition hosts ship logs to Loki and expose metrics for Prometheus scraping; control-plane and Gardener seed Alloy instances push both logs and self-metrics centrally; Grafana provides unified dashboards and alerting across all tiers. +The diagram above shows the full monitoring and logging stack. metal-stack supports the deployment of a central monitoring control plane, with Grafana providing unified dashboards and alerting across all tiers. + +In a **partition**, hosts ship logs to Loki via Alloy. A partition-local Prometheus scrapes exporters in the switch plane and remote-writes the collected metrics to the centralized Thanos ingress, enabling long-term metric persistence and compaction. + +The **control-plane** and **Gardener** seed Alloy instances push both logs and self-metrics directly to the centralized monitoring control plane. ## Logging From 2e416985eb374453e4229e37d4f878157a88d159 Mon Sep 17 00:00:00 2001 From: Matthias Hartmann Date: Tue, 16 Jun 2026 15:21:23 +0200 Subject: [PATCH 12/12] chore: render svg --- docs/04-For Operators/05-monitoring.md | 2 +- docs/04-For Operators/monitoring-stack.svg | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) create mode 100644 docs/04-For Operators/monitoring-stack.svg diff --git a/docs/04-For Operators/05-monitoring.md b/docs/04-For Operators/05-monitoring.md index 57703acd..ab59812b 100644 --- a/docs/04-For Operators/05-monitoring.md +++ b/docs/04-For Operators/05-monitoring.md @@ -8,7 +8,7 @@ sidebar_position: 5 ## Overview -![Monitoring Stack](monitoring-stack.drawio.svg) +![Monitoring Stack](monitoring-stack.svg) The diagram above shows the full monitoring and logging stack. metal-stack supports the deployment of a central monitoring control plane, with Grafana providing unified dashboards and alerting across all tiers. diff --git a/docs/04-For Operators/monitoring-stack.svg b/docs/04-For Operators/monitoring-stack.svg new file mode 100644 index 00000000..3c22ba9f --- /dev/null +++ b/docs/04-For Operators/monitoring-stack.svg @@ -0,0 +1 @@ +
 pull from 
 pull from 
 push to 
 push to 
Management Servers
Management Servers
Alloy
Alloy
Prometheus
Prometheus
node_exporter
node_exporter
ipmi_exporter
ipmi_exporter
blackbox_exporter
blackbox_exporter
Exporters
Exporters
Switches
Switches
Alloy
Alloy
Exporters
Exporters
node_exporter
node_exporter
sonic_exporter
sonic_exporter
blackbox_exporter
blackbox_exporter
Machines
Machines
BMC
BMC
Metal Partition
Metal Partition
GCS
GCS
shoot-states
shoot-states
shoot-details
shoot-details
shoot-customizations
shoot-customizations
shoot-cluster
shoot-cluster
gardener-overview
gardener-overview
alertmanager
alertmanager
sonic-exporter
sonic-exporter
rethinkdb
rethinkdb
metal-api
metal-api
machine-capacity
machine-capacity
Gardener Dashboards
Gardener Dashboards
Grafana Dashboards
Grafana Dashboards
Metal Control Plane
Metal Control Plane
Alloy
Alloy
filesystem
filesystem
Loki
Loki
Exporters
Exporters
gardener-metrics-exporter
gardener-metrics-exporter
metal-metrics-exporter
metal-metrics-exporter
rethinkdb-exporter
rethinkdb-exporter
ServiceMonitors
ServiceMonitors
gardener-metrics-exporter
gardener-metrics-exporter
ipam-db
ipam-db
masterdata-api
masterdata-api
masterdata-db
masterdata-db
metal-db
metal-db
rethinkdb-exporter
rethinkdb-exporter
metal-metrics-exporter
metal-metrics-exporter
metal-api
metal-api
prometheus-operator
prometheus-operator
kube-prometheus
kube-prometheus
node_exporter
node_exporter
blackbox_exporter
blackbox_exporter
prometheus-adapter
prometheus-adapter
Grafana
Grafana
kube-state-metrics
kube-state-metrics
Prometheus
Prometheus
alertmanager
alertmanager
Thanos
Thanos
Text is not SVG - cannot display
\ No newline at end of file