Skip to content

HDDS-15552. Ratis events should not be published as metrics#10523

Draft
jojochuang wants to merge 5 commits into
apache:masterfrom
jojochuang:HDDS-15552
Draft

HDDS-15552. Ratis events should not be published as metrics#10523
jojochuang wants to merge 5 commits into
apache:masterfrom
jojochuang:HDDS-15552

Conversation

@jojochuang

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

HDDS-15552. Ratis events should not be published as metrics

Please describe your PR in detail:

  • Treat getRatisEvents() as a normal method in the Metrics classes. Drop the Metrics annotation.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15552

How was this patch tested?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses HDDS-15552 by stopping Ratis state machine events from being exported via the Hadoop Metrics2 system and instead exposing them via the existing *Info,component=ServerRuntime JMX (MXBean) endpoints used for service runtime information.

Changes:

  • Removed the @Metric annotation from getRatisEvents() in OM/SCM metrics classes so events are no longer published as metrics.
  • Added getRatisEvents() to OMMXBean/SCMMXBean and implemented it in OzoneManager/StorageContainerManager by delegating to their metrics instances.
  • Updated OM/SCM web UIs (and an SCM integration test) to read RatisEvents from the *Info,component=ServerRuntime MXBeans instead of tag.RatisEvents from the metrics beans.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
hadoop-ozone/ozone-manager/src/main/resources/webapps/ozoneManager/ozoneManager.js Switch OM UI Ratis events query from OMMetrics to OzoneManagerInfo MXBean.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java Expose RatisEvents via OM MXBean by delegating to OMMetrics.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMMXBean.java Add getRatisEvents() attribute to OM MXBean contract.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMMetrics.java Stop exporting Ratis events as a Metrics2 metric by removing @Metric.
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestSCMMXBean.java Assert SCM MXBean exposes RatisEvents.
hadoop-hdds/server-scm/src/main/resources/webapps/scm/scm.js Switch SCM UI Ratis events query from SCMMetrics to StorageContainerManagerInfo MXBean.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java Expose RatisEvents via SCM MXBean by delegating to SCMMetrics.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMMXBean.java Add getRatisEvents() attribute to SCM MXBean contract.
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/metrics/SCMMetrics.java Stop exporting Ratis events as a Metrics2 metric by removing @Metric.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 178 to +179
var metrics = result.data.beans[0];
var rawEvents = metrics['tag.RatisEvents'] ? metrics['tag.RatisEvents'].split('\n') : [];
var rawEvents = (metrics && metrics['RatisEvents']) ? metrics['RatisEvents'].split('\n') : [];
Comment on lines 35 to +36
var metrics = result.data.beans[0];
var rawEvents = metrics['tag.RatisEvents'] ? metrics['tag.RatisEvents'].split('\n') : [];
var rawEvents = (metrics && metrics['RatisEvents']) ? metrics['RatisEvents'].split('\n') : [];
*/
String getHostname();

String getRatisEvents();
*/
String getHostname();

String getRatisEvents();
Comment on lines +3310 to +3313
@Override
public String getRatisEvents() {
return metrics != null ? metrics.getRatisEvents() : "";
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a legit suggestion

Comment on lines +90 to +91
String ratisEvents = (String) mbs.getAttribute(bean, "RatisEvents");
assertEquals(scm.getMetrics().getRatisEvents(), ratisEvents);

@errose28 errose28 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @jojochuang. To ensure this is no longer being published to Prometheus we should implement this from the Jira as well:

Additionally to verify this change, we should add an acceptance test call to GET http://:9090/api/v1/targets and ensure that health=up for each component to prevent future regressions like this.

This should be a small addition to any existing acceptance test that uses Prometheus.

Change-Id: I06634562bad09e1cf308b3d2f9ec93d4b6c078fe
@jojochuang jojochuang marked this pull request as ready for review June 22, 2026 21:42
@jojochuang jojochuang requested a review from errose28 June 23, 2026 17:34
@adoroszlai

Copy link
Copy Markdown
Contributor
Key delete passthrough                                                | FAIL |
4 != 0

failed here and in first attempt in fork. Unfortunately, check artifact is not available due to:

The path for one of the files in artifact is not valid: /ozone/ozone-prometheus-1_runtime exec failed: exec failed: unable to start container process: exec: "bash": executable file not found in $PATH
.stack. Contains the following character:  Double quote "

Originally reported it as follow-up (HDDS-15660), but looks like it would be better to fix it right in this PR. Please filter prometheus here:

## @description Create stack dump of each java process in each container
create_stack_dumps() {
local c pid procname
for c in $(docker-compose ps | cut -f1 -d' ' | grep -e datanode -e om -e recon -e s3g -e scm); do
while read -r pid procname; do
echo "jstack $pid > ${RESULT_DIR}/${c}_${procname}.stack"
docker exec "${c}" bash -c "jstack $pid" > "${RESULT_DIR}/${c}_${procname}.stack"
done < <(docker exec "${c}" bash -c "jps | grep -v Jps" || true)
done
}

like:

diff --git hadoop-ozone/dist/src/main/compose/testlib.sh hadoop-ozone/dist/src/main/compose/testlib.sh
index 903fdad7de..3c31de9b48 100755
--- hadoop-ozone/dist/src/main/compose/testlib.sh
+++ hadoop-ozone/dist/src/main/compose/testlib.sh
@@ -290,7 +290,7 @@ reorder_om_nodes() {
 ## @description Create stack dump of each java process in each container
 create_stack_dumps() {
   local c pid procname
-  for c in $(docker-compose ps | cut -f1 -d' ' | grep -e datanode -e om -e recon -e s3g -e scm); do
+  for c in $(docker-compose ps | cut -f1 -d' ' | grep -e datanode -e om -e recon -e s3g -e scm | grep -v -e prometheus); do
     while read -r pid procname; do
       echo "jstack $pid > ${RESULT_DIR}/${c}_${procname}.stack"
       docker exec "${c}" bash -c "jstack $pid" > "${RESULT_DIR}/${c}_${procname}.stack"

With that fix, we can get an artifact for the test run if it fails again, which may help determine if the failure is related to this change or not.

Change-Id: I3845a01fbd97f6e51106d64037d00cc2d4fc9073
@adoroszlai

Copy link
Copy Markdown
Contributor

Thanks @jojochuang for fixing the stack dump problem, now we get acceptance-unsecure artifact. log.html shows that test failure is caused by unexpected log in ozone sh key list output that jq cannot parse as JSON:

$ ozone sh key list o3://om:9862/21377-with-del/bfso
[ {
  "volumeName" : "21377-with-del",
  "bucketName" : "bfso",
  "name" : ".Trash/",
   ...
} ]
Jun 24, 2026 6:33:48 PM io.opentelemetry.sdk.common.internal.ThrottlingLogger doLog
SEVERE: Failed to export spans. The request could not be executed.
java.io.InterruptedIOException: executor rejected
	at okhttp3.internal.connection.RealCall$AsyncCall.failRejected$okhttp(RealCall.kt:563)
	at okhttp3.internal.connection.RealCall$AsyncCall.executeOn(RealCall.kt:554)
	at okhttp3.Dispatcher.promoteAndExecute(Dispatcher.kt:252)
	at okhttp3.Dispatcher.promoteAndExecute$default(Dispatcher.kt:163)
	at okhttp3.Dispatcher.finished$okhttp(Dispatcher.kt:269)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:597)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.util.concurrent.RejectedExecutionException: Task okhttp3.internal.connection.RealCall$AsyncCall@611166bd rejected from java.util.concurrent.ThreadPoolExecutor@73163d48[Running, pool size = 5, active threads = 5, queued tasks = 0, completed tasks = 5]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376)
	at okhttp3.internal.connection.RealCall$AsyncCall.executeOn(RealCall.kt:551)
	... 7 more

parse error: Invalid numeric literal at line 70, column 4

The problems is intermittent, passed in PR run, failed in fork run.

Change-Id: Ie72dd58955082f77262aa380c6e4e8e35c87c30d
Change-Id: Ib3a9a17b0459ea1bea679ebde8543d304342e485
@jojochuang

Copy link
Copy Markdown
Contributor Author

Ignoring the opentelemetry error by removing stderr output.

@adoroszlai adoroszlai marked this pull request as draft June 25, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants