HDDS-15552. Ratis events should not be published as metrics#10523
HDDS-15552. Ratis events should not be published as metrics#10523jojochuang wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses HDDS-15552 by stopping Ratis state machine events from being exported via the Hadoop Metrics2 system and instead exposing them via the existing *Info,component=ServerRuntime JMX (MXBean) endpoints used for service runtime information.
Changes:
- Removed the
@Metricannotation fromgetRatisEvents()in OM/SCM metrics classes so events are no longer published as metrics. - Added
getRatisEvents()toOMMXBean/SCMMXBeanand implemented it inOzoneManager/StorageContainerManagerby delegating to their metrics instances. - Updated OM/SCM web UIs (and an SCM integration test) to read
RatisEventsfrom the*Info,component=ServerRuntimeMXBeans instead oftag.RatisEventsfrom the metrics beans.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| hadoop-ozone/ozone-manager/src/main/resources/webapps/ozoneManager/ozoneManager.js | Switch OM UI Ratis events query from OMMetrics to OzoneManagerInfo MXBean. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java | Expose RatisEvents via OM MXBean by delegating to OMMetrics. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMMXBean.java | Add getRatisEvents() attribute to OM MXBean contract. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMMetrics.java | Stop exporting Ratis events as a Metrics2 metric by removing @Metric. |
| hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestSCMMXBean.java | Assert SCM MXBean exposes RatisEvents. |
| hadoop-hdds/server-scm/src/main/resources/webapps/scm/scm.js | Switch SCM UI Ratis events query from SCMMetrics to StorageContainerManagerInfo MXBean. |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java | Expose RatisEvents via SCM MXBean by delegating to SCMMetrics. |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMMXBean.java | Add getRatisEvents() attribute to SCM MXBean contract. |
| hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/placement/metrics/SCMMetrics.java | Stop exporting Ratis events as a Metrics2 metric by removing @Metric. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var metrics = result.data.beans[0]; | ||
| var rawEvents = metrics['tag.RatisEvents'] ? metrics['tag.RatisEvents'].split('\n') : []; | ||
| var rawEvents = (metrics && metrics['RatisEvents']) ? metrics['RatisEvents'].split('\n') : []; |
| var metrics = result.data.beans[0]; | ||
| var rawEvents = metrics['tag.RatisEvents'] ? metrics['tag.RatisEvents'].split('\n') : []; | ||
| var rawEvents = (metrics && metrics['RatisEvents']) ? metrics['RatisEvents'].split('\n') : []; |
| */ | ||
| String getHostname(); | ||
|
|
||
| String getRatisEvents(); |
| */ | ||
| String getHostname(); | ||
|
|
||
| String getRatisEvents(); |
| @Override | ||
| public String getRatisEvents() { | ||
| return metrics != null ? metrics.getRatisEvents() : ""; | ||
| } |
There was a problem hiding this comment.
This looks like a legit suggestion
| String ratisEvents = (String) mbs.getAttribute(bean, "RatisEvents"); | ||
| assertEquals(scm.getMetrics().getRatisEvents(), ratisEvents); |
errose28
left a comment
There was a problem hiding this comment.
Thanks for the quick fix @jojochuang. To ensure this is no longer being published to Prometheus we should implement this from the Jira as well:
Additionally to verify this change, we should add an acceptance test call to GET http://:9090/api/v1/targets and ensure that health=up for each component to prevent future regressions like this.
This should be a small addition to any existing acceptance test that uses Prometheus.
Change-Id: I06634562bad09e1cf308b3d2f9ec93d4b6c078fe
failed here and in first attempt in fork. Unfortunately, check artifact is not available due to: Originally reported it as follow-up (HDDS-15660), but looks like it would be better to fix it right in this PR. Please filter ozone/hadoop-ozone/dist/src/main/compose/testlib.sh Lines 290 to 299 in d1905e1 like: diff --git hadoop-ozone/dist/src/main/compose/testlib.sh hadoop-ozone/dist/src/main/compose/testlib.sh
index 903fdad7de..3c31de9b48 100755
--- hadoop-ozone/dist/src/main/compose/testlib.sh
+++ hadoop-ozone/dist/src/main/compose/testlib.sh
@@ -290,7 +290,7 @@ reorder_om_nodes() {
## @description Create stack dump of each java process in each container
create_stack_dumps() {
local c pid procname
- for c in $(docker-compose ps | cut -f1 -d' ' | grep -e datanode -e om -e recon -e s3g -e scm); do
+ for c in $(docker-compose ps | cut -f1 -d' ' | grep -e datanode -e om -e recon -e s3g -e scm | grep -v -e prometheus); do
while read -r pid procname; do
echo "jstack $pid > ${RESULT_DIR}/${c}_${procname}.stack"
docker exec "${c}" bash -c "jstack $pid" > "${RESULT_DIR}/${c}_${procname}.stack"With that fix, we can get an artifact for the test run if it fails again, which may help determine if the failure is related to this change or not. |
Change-Id: I3845a01fbd97f6e51106d64037d00cc2d4fc9073
|
Thanks @jojochuang for fixing the stack dump problem, now we get The problems is intermittent, passed in PR run, failed in fork run. |
Change-Id: Ie72dd58955082f77262aa380c6e4e8e35c87c30d
Change-Id: Ib3a9a17b0459ea1bea679ebde8543d304342e485
|
Ignoring the opentelemetry error by removing stderr output. |
What changes were proposed in this pull request?
HDDS-15552. Ratis events should not be published as metrics
Please describe your PR in detail:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15552
How was this patch tested?