Skip to content

HDDS-14825. Add Grafana Dashboard and Metrics for ZDU#10602

Draft
errose28 wants to merge 26 commits into
apache:HDDS-14496-zdufrom
errose28:worktree-version-metrics
Draft

HDDS-14825. Add Grafana Dashboard and Metrics for ZDU#10602
errose28 wants to merge 26 commits into
apache:HDDS-14496-zdufrom
errose28:worktree-version-metrics

Conversation

@errose28

@errose28 errose28 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Changes generated by Claude Code with a spec, reviews, and edits by me.

Metrics

  • Added build info: revision (git hash/string), component (string), and version (string)
    • This information was already present in every build, but was not exposed as metrics.
    • To expose the strings, they are added as labels to a gauge with a constant value of 1. Thanks @octachoron for the tip.
  • Added S3 gateway client version
    • S3 Gateway does not use software or apparent version since it is stateless. It does contain an Ozone client so expose the client version metric instead.

Software and apparent version metrics for all relevant components were already published.

This PR removes one @Metric annotation as a workaround for #10523 which is still pending merge on master. We should be able to merge either PR independently and reconcile them when they both land.

Dashboard

A Grafana dashboard was added to assist admins as they are orchestrating the upgrade. Since this depends on the new metrics, it will only be usable when upgrading from the initial version that supports ZDU (just like the ZDU feature itself). However once the metrics are present it could be helpful for even a non-rolling upgrade.

Since this dashboard was designed with admins in mind, it does not expose software version, apparent version, or client versions which are internal to the cluster. It only exposes admin facing properties like "finalized" as a boolean state and the build version string. Internal version info is still accessible with PromQL for more dev focused debugging as needed.

All panels were designed to account for large clusters so the dashboard remains readable even when there are 1000+ nodes. The tables are paginated and all other values are aggregates. The selectors at the top of the dashboard support drilling down to specific components as needed, while the banner at the top alerts that a filtered view is active.

image Screenshot 2026-06-24 at 7 00 16 PM Screenshot 2026-06-24 at 7 01 16 PM

What is the link to the Apache JIRA

HDDS-14825

How was this patch tested?

Unit tests for the new metrics were added.

The dashboard can be manually viewed from Grafana in a local docker environment:

cd hadoop-ozone/dist/target/ozone-*/compose/ozone
COMPOSE_FILE=docker-compose.yaml:monitoring.yaml docker compose up --scale datanode=3 -d
# Go to http://localhost:3000/dashboards and select "Ozone - Rolling Upgrade"
# To tear down:
COMPOSE_FILE=docker-compose.yaml:monitoring.yaml docker compose down

The dashboard will need a few seconds to populate the values. Also zoom in the time interval to the last few minutes since the default 30 minute window will be hard to read when the cluster has only been live for a few seconds.

By default this will run with all nodes finalized and in the same version. To see the dashboard with a simulated in progress upgrade, build Ozone with the following patch applied. Note that these injected values are only to demonstrate a range of possibilities on the dashboard at once. They do not reflect a realistic state for the cluster during an upgrade.

diff --git b/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java a/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
index f8ce4cbdc0..68aace6544 100644
--- b/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
+++ a/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
@@ -74,9 +74,18 @@ public static synchronized BuildInfoMetrics create(String component) {
   public void getMetrics(MetricsCollector collector, boolean all) {
     MetricsRecordBuilder builder = collector.addRecord(RECORD_NAME)
         .add(new MetricsTag(
-            Interns.info("component", "Ozone component name"), component)).add(new MetricsTag(Interns.info("revision", "Source control revision"), revision))
-        .add(new MetricsTag(Interns.info("version", "Ozone build version"), version))
+            Interns.info("component", "Ozone component name"), component))
+        .add(new MetricsTag(
+            Interns.info("revision", "Source control revision"), revision))
         .addGauge(Interns.info("BuildInfo", "Always 1; identifying info is in labels"), 1L);
+
+    if (component.equals("hddsDatanode")) {
+      builder.add(new MetricsTag(Interns.info("version", "Ozone build version"), "2.1.0-TEST"));
+    } else {
+      builder.add(new MetricsTag(Interns.info("version", "Ozone build version"), version));
+    }
+
+
     builder.endRecord();
   }
 }
diff --git b/hadoop-ozone/dist/src/main/compose/ozone/docker-config a/hadoop-ozone/dist/src/main/compose/ozone/docker-config
index ecca3a971c..0c16691f2d 100644
--- b/hadoop-ozone/dist/src/main/compose/ozone/docker-config
+++ a/hadoop-ozone/dist/src/main/compose/ozone/docker-config
@@ -67,3 +67,8 @@ no_proxy=om,scm,s3g,recon,kdc,localhost,127.0.0.1
 
 # Explicitly enable filesystem snapshot feature for this Docker compose cluster
 OZONE-SITE.XML_ozone.filesystem.snapshot.enabled=true
+
+# Testing overrides for ZDU dashboard verification: start with apparent < software
+# to demonstrate divergence rendering. Revert before running acceptance tests.
+OZONE-SITE.XML_testing.ozone.om.init.apparent.version=7
+OZONE-SITE.XML_testing.hdds.scm.init.apparent.version=8

@errose28 errose28 requested review from dombizita and sodonnel June 24, 2026 23:03
@errose28 errose28 added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Jun 24, 2026
@errose28 errose28 marked this pull request as draft June 24, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant