Add Cassandra JMX metrics target system by jkoronaAtCisco · Pull Request #19080 · open-telemetry/opentelemetry-java-instrumentation

jkoronaAtCisco · 2026-06-25T17:35:14Z

Description

Adds Cassandra as a predefined JMX target system in the jmx-metrics library, contributing to the JMX feature-parity effort (#12158) and resolving #14277.

Cassandra metrics were previously only defined in jmx-scraper (in opentelemetry-java-contrib) and had never been migrated upstream. This PR brings a curated, semconv-aligned set of Cassandra metrics into instrumentation as the source of truth, so that jmx-scraper can later inherit them instead of maintaining a divergent copy.

The metrics cover the essential Cassandra MBeans:

Compaction — completed/pending compaction tasks
Storage — on-disk load, total hints, in-progress hints
Client requests — request count, error count (by error type), and request latency percentiles, all broken down by operation (Read / Write / RangeSlice)

Alignment notes

Following the definition recommendations in the jmx-metrics README:

Metric names and metric attributes are namespaced with the cassandra. prefix (e.g. cassandra.operation, cassandra.status).
The operation dimension (Read / Write / RangeSlice) is modeled as a metric attribute rather than baked into metric names, keeping the request count, error count, and latency metrics consistent with each other.
Latency percentiles are exposed with an aggregation suffix (.p50, .p99, .max) per the recommendation to capture pre-aggregated values with a .{aggregation} suffix, rather than as an attribute. This also avoids any metric name being a prefix of another.

Open questions for reviewers

A couple of alignment decisions I'd appreciate direction on:

Latency units and naming. The README recommends time metrics prefer seconds (with unit conversion) and prefer duration over time/latency. I've kept us and the latency name to stay close to Cassandra's MBean semantics (...,name=Latency, microsecond percentiles), but the engine supports sourceUnit: us → unit: s conversion. Would you prefer cassandra.client.request.duration.{p50,p99,max} in seconds instead?
operation attribute values. The scope MBean property values come through verbatim as PascalCase (Read, Write, RangeSlice). The YAML engine can't transform these. If normalized values are preferred, that would require splitting the wildcard rules into per-scope rules with const(...) values — happy to do that if desired.

Testing

Added CassandraTest that asserts every metric's name, type, unit, description, and attributes via the OTLP capture harness.

Relates to #12158
Resolves #14277

jkoronaAtCisco · 2026-06-25T17:36:26Z

CC: @SylvainJuge, @robsunday

Copilot

Pull request overview

This PR adds Cassandra as a predefined JMX target system in the jmx-metrics library, advancing the JMX feature-parity effort (#12158) and resolving #14277. It introduces a semconv-aligned set of Cassandra metrics (compaction, storage, and client-request count/error/latency) as a YAML rule file, with supporting documentation and an integration test using the OTLP capture harness.

Changes:

New cassandra.yaml rule file defining compaction, storage, and client-request metrics, with the operation dimension modeled as a metric attribute and latency percentiles exposed via .p50/.p99/.max suffixes (using YAML anchors to share error-metric definitions).
New CassandraTest integration test asserting each metric's name, type, unit, description, and attributes against a cassandra:5.0.2 container.
Documentation and bookkeeping: new library/cassandra.md metrics table, registration in README.md, and a CHANGELOG.md entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml`	Defines the Cassandra JMX-to-OTel metric mappings.
`instrumentation/jmx-metrics/library/src/test/java/io/opentelemetry/instrumentation/jmx/rules/CassandraTest.java`	Integration test validating the emitted Cassandra metrics.
`instrumentation/jmx-metrics/library/cassandra.md`	Documents the Cassandra metrics table.
`instrumentation/jmx-metrics/README.md`	Registers cassandra in the supported target list (contains a stray empty bullet / missing blank line).
`CHANGELOG.md`	Adds an Unreleased entry (uses a placeholder PR `#0` link).

Updated the changelog to reflect the addition of Cassandra JMX metrics target system.

trask · 2026-06-25T20:05:15Z

cc @breedx-splk @PeterF778 @robsunday @SylvainJuge

opentelemetry-pr-dashboard · 2026-06-29T11:27:58Z

This PR has review comments. Review suggestions, whether from maintainers or automated reviewers, aren't always correct or required. Please evaluate each comment on its merits, then make sure each thread has a clear outcome.

For example, link to the commit if you applied a suggestion, explain why it wasn't applied, or ask a follow-up question.

Automation flags a PR for human review once every review thread has a reply or is marked as resolved.

Status across open PRs is visible on the pull request dashboard.

SylvainJuge · 2026-06-30T12:20:55Z

+      - org.apache.cassandra.metrics:type=ClientRequest,scope=RangeSlice,name=Timeouts
+      - org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts


Modeling cassandra client error like this assumes that all the values we capture for the Timeouts, Failures, Unavailables form a partition and do not overlap.

Do we have any guarantees in the implementation about this ? For example, when there is an increment in the timeout, does it also increments the failure count ?
If there is never any overlap, then we could even use the name mbean parameter as a metric attribute (for example cassandra.error.type, so querying the cassandra.client.request.error.count metric without any attribute would be the total number of client errors, and the cassandra.error.type attribute could provide breakdown on errors.

Also, when looking at cassandra 5.x documentation, we can see that there are a few other possible values not covered here: https://cassandra.apache.org/doc/5.0/cassandra/managing/operating/metrics.html#client-request-metrics

If we don't have guarantees about the overlap/partition, then capturing those as individual metrics (likely with a common prefix) would probably be more relevant. Also, even if it is not the goal of this PR, it could help dealing with previous/recent versions that might not provide the exact same mapping.

In Cassandra, Timeouts, Failures, and Unavailables are tracked as separate, non-overlapping counters — a timeout is not also counted as a failure or unavailable. So modelling them as a single metric with a cassandra.status attribute is safe and gives a clean per-operation breakdown without double-counting.

Agreed that there are additional error types in Cassandra 5.x docs not covered here (e.g. SpeculativeRetries, per-keyspace variants). The goal of this PR is to cover the three most commonly observed error categories; additional types can be added in follow-up PRs once we have more operational experience with the broader set.

Given the way cassandra client error metrics are structured, we can assume that any new type of error would likely be captured in the same way: no overlap with other counters and a dedicated counter with its own name and scope values.

What could be done here is to use a wildcard on the MBean name (and maybe scope) to simply allow all values without having to map them individually:

org.apache.cassandra.metrics:type=ClientRequest,scope=*,name=*

with that, the actual metric breakdown will differ from version to version, but we would not have to maintain it, also the total number of client errors (when aggregating without metric attributes) would always match the breakdown, whereas if we have an explicit list then will miss error types that are not explicitly listed.

I'm afraid name=* would be too broad — every scope also registers Latency, TotalLatency, Aborts, LocalRequests, etc., all of which expose a Count attribute and would incorrectly end up in cassandra.client.request.error.

The safe approach is scope=* with explicit error names:

- beans: - org.apache.cassandra.metrics:type=ClientRequest,scope=*,name=Unavailables

This also has a nice side effect of covering CASRead, CASWrite, and ViewWrite scopes that the current explicit list misses. Should we expand the scope in this PR or a follow-up?

Ok, then let's keep the explicit name with client error types.

For adding new scopes like CASRead or CASWrite, I am not sure if those would fit the definition of "errors" as they most likely relate to performance metrics and behavior than actual client errors, so it's probably better to deal with that with a dedicated follow-up PR.

SylvainJuge · 2026-06-30T13:25:50Z

One additional question we need to answer here is the value we plan to use for otel.jmx.target.system:

use cassandra if we consider the current refinement in this PR is enough to make those metrics "stable" (for Tomcat and similar, it was easier to decide as we had good knowledge on the underlying system, for Cassandra I don't really have much experience with it)
use experimental-cassandra as it avoids stability expectations, but would work in the future to promote them to stable
add another fallback to experimental-cassandra with warning when cassandra is used, it could even be made generic to cover existing values with experimental- prefix, this also helps providing some compatibility with the existing cassandra value in jmx-scraper.

Co-authored-by: SylvainJuge <763082+SylvainJuge@users.noreply.github.com>

jkoronaAtCisco · 2026-07-03T08:26:08Z

One additional question we need to answer here is the value we plan to use for otel.jmx.target.system:

use cassandra if we consider the current refinement in this PR is enough to make those metrics "stable" (for Tomcat and similar, it was easier to decide as we had good knowledge on the underlying system, for Cassandra I don't really have much experience with it)

use experimental-cassandra as it avoids stability expectations, but would work in the future to promote them to stable

add another fallback to experimental-cassandra with warning when cassandra is used, it could even be made generic to cover existing values with experimental- prefix, this also helps providing some compatibility with the existing cassandra value in jmx-scraper.

I think shipping as plain cassandra here would silently produce different metrics for anyone migrating from the contrib version. That's why I would lean toward option 3 (accept cassandra with a deprecation warning).

robsunday · 2026-07-03T11:49:19Z

+
+| Metric Name                               | Type          | Unit      | Attributes                            | Description                                                      |
+| ----------------------------------------- | ------------- | --------- | ------------------------------------- | ---------------------------------------------------------------- |
+| cassandra.client.request.count            | Counter       | {request} | cassandra.operation                   | Number of requests by operation.                                 |


Just to make it more future-proof attributes prefix could be cassandra.client instead of just cassandra. It may help avoiding naming collisions in the future.

I'd prefer to keep cassandra.operation and cassandra.status as-is. Using a single-segment namespace (cassandra.) is consistent with how other JMX target systems name their attributes in this repo — none of them use multi-segment prefixes. Adding .client. would also diverge further from the contrib jmx-scraper which uses operation/status (without any prefix), making future alignment harder.

SylvainJuge · 2026-07-03T12:18:33Z

I think shipping as plain cassandra here would silently produce different metrics for anyone migrating from the contrib version. That's why I would lean toward option 3 (accept cassandra with a deprecation warning).

Here are my thoughts trying to decide what is best here:

Those revised cassandra metrics are new, and thus adding them in instrumentation with experimental- prefix is the most logical next step as they are by definition not stable and probably in the "development" maturity level.

The usage of experimental- as prefix to indicate lack of stability is a recent change with #18971.
In the related issue #16016 there was a discussion about removing otel.jmx.target.system and replace it with a maturity toggle to include only stable metrics by default, but it hasn't been implemented yet.

With JMX Scraper, in https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/jmx-scraper, the cassandra value is only supported when otel.jmx.target.source = legacy or auto (default).

So, if we ship it in instrumentation with experimental-cassandra name, then it will be usable with jmx-scraper with experimental-cassandra and there isn't any confusion on the contrib side, it's either:

otel.jmx.target.system = cassandra with otel.jmx.target.source = legacy or auto for the existing definitions (will not work with instrumentation).
otel.jmx.target.system = experimental-cassandra with otel.jmx.target.source = auto or instrumentation (will not work with legacy).

Doing this will make experimental-cassandra similar to experimental-kafka-connect.
If we ship (or provide a fallback for) it with cassandra, then it might cause some unexpected metrics to be captured, this would effectively be a breaking change but has an existing work-around by setting otel.jmx.target.source to legacy.

Also, in the future I think we should be able to:

define maturity level on all of the JMX metrics in their YAML definition
remove the otel.jmx.target.system configuration option and capture all metrics that are available on a given system
filter captured metrics to only include stable metrics by default
have a way to opt-in for non-stable metrics
(in jmx-scraper) have a way to force using legacy metrics definitions

So, even if I don't like having to ship it as experimental-cassandra for now, I think that adding any kind of automatic fallback might introduce more issues and confusion and would not be consistent with what is stored in instrumentation.

jkoronaAtCisco · 2026-07-03T15:51:35Z

@SylvainJuge
Agreed — experimental-cassandra with no fallback is the right call. The separation in jmx-scraper between legacy/instrumentation sources makes the boundary clean, and adding a fallback would only muddy it.

Already renamed the YAML and updated the README accordingly. Please check.

Add Cassandra JMX metrics target system

69cd9c9

Copilot AI review requested due to automatic review settings June 25, 2026 17:35

Copilot started reviewing on behalf of jkoronaAtCisco June 25, 2026 17:35 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread instrumentation/jmx-metrics/README.md Outdated

Comment thread CHANGELOG.md Outdated

jkoronaAtCisco requested a review from a team as a code owner June 25, 2026 17:43

opentelemetry-pr-dashboard Bot mentioned this pull request Jun 25, 2026

Pull Request Dashboard #18435

Open

Add Cassandra JMX metrics to CHANGELOG

1420508

Updated the changelog to reflect the addition of Cassandra JMX metrics target system.

jkoronaAtCisco added 2 commits June 26, 2026 10:26

reformat markdown

cb07ec4

reformat markdown

f155538

laurit reviewed Jun 29, 2026

View reviewed changes

Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated

laurit reviewed Jun 29, 2026

View reviewed changes

Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated

SylvainJuge reviewed Jun 30, 2026

View reviewed changes

Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated

SylvainJuge reviewed Jun 30, 2026

View reviewed changes

jkoronaAtCisco and others added 4 commits July 2, 2026 12:32

Apply suggestions from code review

d6c7126

Co-authored-by: SylvainJuge <763082+SylvainJuge@users.noreply.github.com>

resolve code review comments

dac5293

resolve code review comments

bc25cae

spotless fix

015a710

jkoronaAtCisco force-pushed the cassandra-jmx-alignment branch from 92de191 to 015a710 Compare July 2, 2026 17:10

robsunday reviewed Jul 3, 2026

View reviewed changes

laurit added this to the v2.30.0 milestone Jul 3, 2026

jkoronaAtCisco added 2 commits July 3, 2026 17:30

Applied code review suggestions

9ab5b59

Rename target from cassandra to experimental-cassandra

bdfd58f

		- org.apache.cassandra.metrics:type=ClientRequest,scope=RangeSlice,name=Timeouts
		- org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts

Uh oh!

Conversation

jkoronaAtCisco commented Jun 25, 2026

Description

Alignment notes

Open questions for reviewers

Testing

Uh oh!

jkoronaAtCisco commented Jun 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

trask commented Jun 25, 2026

Uh oh!

Uh oh!

opentelemetry-pr-dashboard Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SylvainJuge Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

jkoronaAtCisco Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

SylvainJuge Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

jkoronaAtCisco Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

SylvainJuge Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SylvainJuge commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkoronaAtCisco commented Jul 3, 2026

Uh oh!

robsunday Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

jkoronaAtCisco Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SylvainJuge commented Jul 3, 2026

Uh oh!

jkoronaAtCisco commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SylvainJuge commented Jun 30, 2026 •

edited

Loading