HDDS-14825. Add Grafana Dashboard and Metrics for ZDU#10602
Draft
errose28 wants to merge 26 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Changes generated by Claude Code with a spec, reviews, and edits by me.
Metrics
Software and apparent version metrics for all relevant components were already published.
This PR removes one
@Metricannotation as a workaround for #10523 which is still pending merge on master. We should be able to merge either PR independently and reconcile them when they both land.Dashboard
A Grafana dashboard was added to assist admins as they are orchestrating the upgrade. Since this depends on the new metrics, it will only be usable when upgrading from the initial version that supports ZDU (just like the ZDU feature itself). However once the metrics are present it could be helpful for even a non-rolling upgrade.
Since this dashboard was designed with admins in mind, it does not expose software version, apparent version, or client versions which are internal to the cluster. It only exposes admin facing properties like "finalized" as a boolean state and the build version string. Internal version info is still accessible with PromQL for more dev focused debugging as needed.
All panels were designed to account for large clusters so the dashboard remains readable even when there are 1000+ nodes. The tables are paginated and all other values are aggregates. The selectors at the top of the dashboard support drilling down to specific components as needed, while the banner at the top alerts that a filtered view is active.
What is the link to the Apache JIRA
HDDS-14825
How was this patch tested?
Unit tests for the new metrics were added.
The dashboard can be manually viewed from Grafana in a local docker environment:
The dashboard will need a few seconds to populate the values. Also zoom in the time interval to the last few minutes since the default 30 minute window will be hard to read when the cluster has only been live for a few seconds.
By default this will run with all nodes finalized and in the same version. To see the dashboard with a simulated in progress upgrade, build Ozone with the following patch applied. Note that these injected values are only to demonstrate a range of possibilities on the dashboard at once. They do not reflect a realistic state for the cluster during an upgrade.