Add Cassandra JMX metrics target system#19080
Conversation
|
CC: @SylvainJuge, @robsunday |
There was a problem hiding this comment.
Pull request overview
This PR adds Cassandra as a predefined JMX target system in the jmx-metrics library, advancing the JMX feature-parity effort (#12158) and resolving #14277. It introduces a semconv-aligned set of Cassandra metrics (compaction, storage, and client-request count/error/latency) as a YAML rule file, with supporting documentation and an integration test using the OTLP capture harness.
Changes:
- New
cassandra.yamlrule file defining compaction, storage, and client-request metrics, with theoperationdimension modeled as a metric attribute and latency percentiles exposed via.p50/.p99/.maxsuffixes (using YAML anchors to share error-metric definitions). - New
CassandraTestintegration test asserting each metric's name, type, unit, description, and attributes against acassandra:5.0.2container. - Documentation and bookkeeping: new
library/cassandra.mdmetrics table, registration inREADME.md, and aCHANGELOG.mdentry.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml |
Defines the Cassandra JMX-to-OTel metric mappings. |
instrumentation/jmx-metrics/library/src/test/java/io/opentelemetry/instrumentation/jmx/rules/CassandraTest.java |
Integration test validating the emitted Cassandra metrics. |
instrumentation/jmx-metrics/library/cassandra.md |
Documents the Cassandra metrics table. |
instrumentation/jmx-metrics/README.md |
Registers cassandra in the supported target list (contains a stray empty bullet / missing blank line). |
CHANGELOG.md |
Adds an Unreleased entry (uses a placeholder PR #0 link). |
Updated the changelog to reflect the addition of Cassandra JMX metrics target system.
|
This PR has review comments. Review suggestions, whether from maintainers or automated reviewers, aren't always correct or required. Please evaluate each comment on its merits, then make sure each thread has a clear outcome. For example, link to the commit if you applied a suggestion, explain why it wasn't applied, or ask a follow-up question. Automation flags a PR for human review once every review thread has a reply or is marked as resolved. Status across open PRs is visible on the pull request dashboard. |
| - org.apache.cassandra.metrics:type=ClientRequest,scope=RangeSlice,name=Timeouts | ||
| - org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts |
There was a problem hiding this comment.
Modeling cassandra client error like this assumes that all the values we capture for the Timeouts, Failures, Unavailables form a partition and do not overlap.
Do we have any guarantees in the implementation about this ? For example, when there is an increment in the timeout, does it also increments the failure count ?
If there is never any overlap, then we could even use the name mbean parameter as a metric attribute (for example cassandra.error.type, so querying the cassandra.client.request.error.count metric without any attribute would be the total number of client errors, and the cassandra.error.type attribute could provide breakdown on errors.
Also, when looking at cassandra 5.x documentation, we can see that there are a few other possible values not covered here: https://cassandra.apache.org/doc/5.0/cassandra/managing/operating/metrics.html#client-request-metrics
If we don't have guarantees about the overlap/partition, then capturing those as individual metrics (likely with a common prefix) would probably be more relevant. Also, even if it is not the goal of this PR, it could help dealing with previous/recent versions that might not provide the exact same mapping.
There was a problem hiding this comment.
In Cassandra, Timeouts, Failures, and Unavailables are tracked as separate, non-overlapping counters — a timeout is not also counted as a failure or unavailable. So modelling them as a single metric with a cassandra.status attribute is safe and gives a clean per-operation breakdown without double-counting.
Agreed that there are additional error types in Cassandra 5.x docs not covered here (e.g. SpeculativeRetries, per-keyspace variants). The goal of this PR is to cover the three most commonly observed error categories; additional types can be added in follow-up PRs once we have more operational experience with the broader set.
There was a problem hiding this comment.
Given the way cassandra client error metrics are structured, we can assume that any new type of error would likely be captured in the same way: no overlap with other counters and a dedicated counter with its own name and scope values.
What could be done here is to use a wildcard on the MBean name (and maybe scope) to simply allow all values without having to map them individually:
org.apache.cassandra.metrics:type=ClientRequest,scope=*,name=*- with that, the actual metric breakdown will differ from version to version, but we would not have to maintain it, also the total number of client errors (when aggregating without metric attributes) would always match the breakdown, whereas if we have an explicit list then will miss error types that are not explicitly listed.
There was a problem hiding this comment.
I'm afraid name=* would be too broad — every scope also registers Latency, TotalLatency, Aborts, LocalRequests, etc., all of which expose a Count attribute and would incorrectly end up in cassandra.client.request.error.
The safe approach is scope=* with explicit error names:
- beans:
- org.apache.cassandra.metrics:type=ClientRequest,scope=*,name=UnavailablesThis also has a nice side effect of covering CASRead, CASWrite, and ViewWrite scopes that the current explicit list misses. Should we expand the scope in this PR or a follow-up?
There was a problem hiding this comment.
Ok, then let's keep the explicit name with client error types.
For adding new scopes like CASRead or CASWrite, I am not sure if those would fit the definition of "errors" as they most likely relate to performance metrics and behavior than actual client errors, so it's probably better to deal with that with a dedicated follow-up PR.
|
One additional question we need to answer here is the value we plan to use for
|
Co-authored-by: SylvainJuge <763082+SylvainJuge@users.noreply.github.com>
92de191 to
015a710
Compare
I think shipping as plain |
|
|
||
| | Metric Name | Type | Unit | Attributes | Description | | ||
| | ----------------------------------------- | ------------- | --------- | ------------------------------------- | ---------------------------------------------------------------- | | ||
| | cassandra.client.request.count | Counter | {request} | cassandra.operation | Number of requests by operation. | |
There was a problem hiding this comment.
Just to make it more future-proof attributes prefix could be cassandra.client instead of just cassandra. It may help avoiding naming collisions in the future.
There was a problem hiding this comment.
I'd prefer to keep cassandra.operation and cassandra.status as-is. Using a single-segment namespace (cassandra.) is consistent with how other JMX target systems name their attributes in this repo — none of them use multi-segment prefixes. Adding .client. would also diverge further from the contrib jmx-scraper which uses operation/status (without any prefix), making future alignment harder.
Here are my thoughts trying to decide what is best here: Those revised cassandra metrics are new, and thus adding them in instrumentation with The usage of With JMX Scraper, in https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/jmx-scraper, the So, if we ship it in instrumentation with
Doing this will make Also, in the future I think we should be able to:
So, even if I don't like having to ship it as |
|
@SylvainJuge Already renamed the YAML and updated the README accordingly. Please check. |
Description
Adds Cassandra as a predefined JMX target system in the
jmx-metricslibrary, contributing to the JMX feature-parity effort (#12158) and resolving #14277.Cassandra metrics were previously only defined in
jmx-scraper(in opentelemetry-java-contrib) and had never been migrated upstream. This PR brings a curated, semconv-aligned set of Cassandra metrics into instrumentation as the source of truth, so that jmx-scraper can later inherit them instead of maintaining a divergent copy.The metrics cover the essential Cassandra MBeans:
Alignment notes
Following the definition recommendations in the jmx-metrics README:
cassandra.prefix (e.g.cassandra.operation,cassandra.status).operationdimension (Read / Write / RangeSlice) is modeled as a metric attribute rather than baked into metric names, keeping the request count, error count, and latency metrics consistent with each other..p50,.p99,.max) per the recommendation to capture pre-aggregated values with a.{aggregation}suffix, rather than as an attribute. This also avoids any metric name being a prefix of another.Open questions for reviewers
A couple of alignment decisions I'd appreciate direction on:
Latency units and naming. The README recommends time metrics prefer seconds (with unit conversion) and prefer
durationovertime/latency. I've keptusand thelatencyname to stay close to Cassandra's MBean semantics (...,name=Latency, microsecond percentiles), but the engine supportssourceUnit: us→unit: sconversion. Would you prefercassandra.client.request.duration.{p50,p99,max}in seconds instead?operationattribute values. ThescopeMBean property values come through verbatim as PascalCase (Read,Write,RangeSlice). The YAML engine can't transform these. If normalized values are preferred, that would require splitting the wildcard rules into per-scope rules withconst(...)values — happy to do that if desired.Testing
CassandraTestthat asserts every metric's name, type, unit, description, and attributes via the OTLP capture harness.Relates to #12158
Resolves #14277