Skip to content

HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit#9400

Merged
jojochuang merged 9 commits into
apache:masterfrom
sreejasahithi:HDDS-14039
Jan 13, 2026
Merged

HDDS-14039. Create Grafana dashboard for Ozone SCM safemode rules and exit#9400
jojochuang merged 9 commits into
apache:masterfrom
sreejasahithi:HDDS-14039

Conversation

@sreejasahithi

@sreejasahithi sreejasahithi commented Dec 1, 2025

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This patch introduces a new dashboard "SCM Safemode" in Grafana which contains a chart for each safemode rule displaying its target and actual value. It also displays if the SCM is in safemode or not by showing "In Safemode" in red and "Exited safemode" in green respectively.

What is the link to the Apache JIRA

HDDS-14039

How was this patch tested?

Green CI : https://github.com/sreejasahithi/ozone/actions/runs/19803545209

Tested over docker cluster:
Screenshot 2025-12-29 at 5 56 43 PM

@jojochuang jojochuang self-requested a review December 1, 2025 17:22
@jojochuang

Copy link
Copy Markdown
Contributor

@rnblough

@jojochuang

Copy link
Copy Markdown
Contributor

IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1.

@sreejasahithi

Copy link
Copy Markdown
Contributor Author

@errose28 could you please review this PR.

@rnblough

rnblough commented Dec 3, 2025

Copy link
Copy Markdown
Contributor

While clear to me, I expect a little confusion from the graph being labelled Binary but going up to 2. Can the Binary axis be limited to 1?

@sumitagrawl

sumitagrawl commented Dec 5, 2025

Copy link
Copy Markdown
Contributor

@sreejasahithi These metrics are applicable only during startup for safemode exit info. May be we do not need a dashboard for this. While debug, we can do JMX query to know the status.
cc: @errose28

@errose28

errose28 commented Dec 5, 2025

Copy link
Copy Markdown
Contributor

This is essential for Ozone cluster admins/operators to monitor how long it takes their SCM to come out of safemode, especially when doing a rolling restart. Raw jmx queries are poor for usability and do not show trends over time. The dashboard is in its own file and can be ignored without harm when it is not needed.

@Tejaskriya Tejaskriya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sreejasahithi, left a few suggestions below.

@Tejaskriya

Copy link
Copy Markdown
Contributor

Also, could you check why the CI seems to be failing?

@sreejasahithi

sreejasahithi commented Dec 10, 2025

Copy link
Copy Markdown
Contributor Author

Also, could you check why the CI seems to be failing?

I think the failure is not related to my changes.
Could you please help me re-trigger the test

@Tejaskriya

Copy link
Copy Markdown
Contributor

It seems to be failing at the "Download ozone binary tar" stage. I tried re-triggered it a couple of times, still fails. I'll trigger the full run again and lets see if it helps. Can you merge the master too when you push the next set of commits?

@errose28 errose28 self-requested a review December 11, 2025 18:18

@Tejaskriya Tejaskriya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @sreejasahithi , just a suggestion for the tests.

Comment on lines +156 to +157
GenericTestUtils.waitFor(() -> !scmSafeModeManager.getInSafeMode() &&
scmSafeModeManager.getSafeModeMetrics().getScmInSafeMode().value() == 0,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to separate out the metrics check from the waitFor block in all the occurances. That way if there was to be a failure in the metrics capturing logic and not in the actual status of scm, debugging is easier.

@errose28

Copy link
Copy Markdown
Contributor

Thanks for adding this. I pulled up the Grafana chart in docker to look around.

IMO it would be even better if you can display "In Safe Mode" and "Exited Safe Mode" instead of the numerical 0 and 1.

+1 to Wei-Chiu's comment here. We can have text labels and see enter/exit safemode trends over time with Grafana's state timeline. Can we switch the binary plot to use this instead? A red block would indicate when an SCM was in safemode, and a green block would indicate that it is out.

For the threshold to exit safemode on each rule, the two solid lines on top of each other are difficult to read. We can either use a dashed line for the target value, or use a gradient fill where the area at/above the threshold is green and the area below is red. Also, the thresholds are expected to be the same for all SCMs with HA. I think it would be easier to read if we just take the max of thresholds returned by each SCM as a way to reduce this to a single number, and plot that as the exit criteria without a corresponding hostname label.

Can you share screenshots of what the updated dashboars look like in an SCM HA cluster?

@sumitagrawl

Copy link
Copy Markdown
Contributor

@errose28 Do we need add metrics for config such as threshold value? just for metrics which is there for certain duration during startup ... I think this is not the correct approach for this grafana dashbord.

@errose28

Copy link
Copy Markdown
Contributor

@sumitagrawl we need to be able to see the rule counters progress towards their target value. The target value is a configuration known by the SCM process. SCM communicates this to Prometheus and Grafana through metrics. If you have a different way to achieve this same goal we can look into it, but we cannot drop the target from the dashboard otherwise the rule lines provide little value.

I think this is not the correct approach for this grafana dashbord.

Why do you say this? This is a pretty standard dashboard that tracks system progress towards a desired goal/threshold over time.

@sreejasahithi

Copy link
Copy Markdown
Contributor Author

Can you share screenshots of what the updated dashboars look like in an SCM HA cluster?

This is a screenshot of the updated dashboard in SCM HA
Screenshot 2025-12-29 at 5 56 43 PM

@sreejasahithi

Copy link
Copy Markdown
Contributor Author

The test TestSCMSafeModeManager looks flaky, I will fix it.

@jojochuang

Copy link
Copy Markdown
Contributor

The safe mode chart looks good!

@errose28 errose28 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, this will be great to have going forward! I just triggered the CI on this PR.

@Tejaskriya Tejaskriya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sreejasahithi , LGTM!

@jojochuang jojochuang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 merging it now

@jojochuang jojochuang merged commit 7e4e5f3 into apache:master Jan 13, 2026
83 of 84 checks passed
@jojochuang

Copy link
Copy Markdown
Contributor

Thanks @sreejasahithi @Tejaskriya @errose28 @sumitagrawl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants