Skip to content

Add Cassandra JMX metrics target system#19080

Open
jkoronaAtCisco wants to merge 10 commits into
open-telemetry:mainfrom
jkoronaAtCisco:cassandra-jmx-alignment
Open

Add Cassandra JMX metrics target system#19080
jkoronaAtCisco wants to merge 10 commits into
open-telemetry:mainfrom
jkoronaAtCisco:cassandra-jmx-alignment

Conversation

@jkoronaAtCisco

Copy link
Copy Markdown

Description

Adds Cassandra as a predefined JMX target system in the jmx-metrics library, contributing to the JMX feature-parity effort (#12158) and resolving #14277.

Cassandra metrics were previously only defined in jmx-scraper (in opentelemetry-java-contrib) and had never been migrated upstream. This PR brings a curated, semconv-aligned set of Cassandra metrics into instrumentation as the source of truth, so that jmx-scraper can later inherit them instead of maintaining a divergent copy.

The metrics cover the essential Cassandra MBeans:

  • Compaction — completed/pending compaction tasks
  • Storage — on-disk load, total hints, in-progress hints
  • Client requests — request count, error count (by error type), and request latency percentiles, all broken down by operation (Read / Write / RangeSlice)

Alignment notes

Following the definition recommendations in the jmx-metrics README:

  • Metric names and metric attributes are namespaced with the cassandra. prefix (e.g. cassandra.operation, cassandra.status).
  • The operation dimension (Read / Write / RangeSlice) is modeled as a metric attribute rather than baked into metric names, keeping the request count, error count, and latency metrics consistent with each other.
  • Latency percentiles are exposed with an aggregation suffix (.p50, .p99, .max) per the recommendation to capture pre-aggregated values with a .{aggregation} suffix, rather than as an attribute. This also avoids any metric name being a prefix of another.

Open questions for reviewers

A couple of alignment decisions I'd appreciate direction on:

  1. Latency units and naming. The README recommends time metrics prefer seconds (with unit conversion) and prefer duration over time/latency. I've kept us and the latency name to stay close to Cassandra's MBean semantics (...,name=Latency, microsecond percentiles), but the engine supports sourceUnit: usunit: s conversion. Would you prefer cassandra.client.request.duration.{p50,p99,max} in seconds instead?

  2. operation attribute values. The scope MBean property values come through verbatim as PascalCase (Read, Write, RangeSlice). The YAML engine can't transform these. If normalized values are preferred, that would require splitting the wildcard rules into per-scope rules with const(...) values — happy to do that if desired.

Testing

  • Added CassandraTest that asserts every metric's name, type, unit, description, and attributes via the OTLP capture harness.

Relates to #12158
Resolves #14277

Copilot AI review requested due to automatic review settings June 25, 2026 17:35
@jkoronaAtCisco

Copy link
Copy Markdown
Author

CC: @SylvainJuge, @robsunday

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Cassandra as a predefined JMX target system in the jmx-metrics library, advancing the JMX feature-parity effort (#12158) and resolving #14277. It introduces a semconv-aligned set of Cassandra metrics (compaction, storage, and client-request count/error/latency) as a YAML rule file, with supporting documentation and an integration test using the OTLP capture harness.

Changes:

  • New cassandra.yaml rule file defining compaction, storage, and client-request metrics, with the operation dimension modeled as a metric attribute and latency percentiles exposed via .p50/.p99/.max suffixes (using YAML anchors to share error-metric definitions).
  • New CassandraTest integration test asserting each metric's name, type, unit, description, and attributes against a cassandra:5.0.2 container.
  • Documentation and bookkeeping: new library/cassandra.md metrics table, registration in README.md, and a CHANGELOG.md entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Defines the Cassandra JMX-to-OTel metric mappings.
instrumentation/jmx-metrics/library/src/test/java/io/opentelemetry/instrumentation/jmx/rules/CassandraTest.java Integration test validating the emitted Cassandra metrics.
instrumentation/jmx-metrics/library/cassandra.md Documents the Cassandra metrics table.
instrumentation/jmx-metrics/README.md Registers cassandra in the supported target list (contains a stray empty bullet / missing blank line).
CHANGELOG.md Adds an Unreleased entry (uses a placeholder PR #0 link).

Comment thread instrumentation/jmx-metrics/README.md Outdated
Comment thread CHANGELOG.md Outdated
@jkoronaAtCisco jkoronaAtCisco requested a review from a team as a code owner June 25, 2026 17:43
Updated the changelog to reflect the addition of Cassandra JMX metrics target system.
@trask

trask commented Jun 25, 2026

Copy link
Copy Markdown
Member

Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
@opentelemetry-pr-dashboard

Copy link
Copy Markdown

This PR has review comments. Review suggestions, whether from maintainers or automated reviewers, aren't always correct or required. Please evaluate each comment on its merits, then make sure each thread has a clear outcome.

For example, link to the commit if you applied a suggestion, explain why it wasn't applied, or ask a follow-up question.

Automation flags a PR for human review once every review thread has a reply or is marked as resolved.

Status across open PRs is visible on the pull request dashboard.

Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment thread instrumentation/jmx-metrics/library/src/main/resources/jmx/rules/cassandra.yaml Outdated
Comment on lines +96 to +97
- org.apache.cassandra.metrics:type=ClientRequest,scope=RangeSlice,name=Timeouts
- org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Timeouts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modeling cassandra client error like this assumes that all the values we capture for the Timeouts, Failures, Unavailables form a partition and do not overlap.

Do we have any guarantees in the implementation about this ? For example, when there is an increment in the timeout, does it also increments the failure count ?
If there is never any overlap, then we could even use the name mbean parameter as a metric attribute (for example cassandra.error.type, so querying the cassandra.client.request.error.count metric without any attribute would be the total number of client errors, and the cassandra.error.type attribute could provide breakdown on errors.

Also, when looking at cassandra 5.x documentation, we can see that there are a few other possible values not covered here: https://cassandra.apache.org/doc/5.0/cassandra/managing/operating/metrics.html#client-request-metrics

If we don't have guarantees about the overlap/partition, then capturing those as individual metrics (likely with a common prefix) would probably be more relevant. Also, even if it is not the goal of this PR, it could help dealing with previous/recent versions that might not provide the exact same mapping.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Cassandra, Timeouts, Failures, and Unavailables are tracked as separate, non-overlapping counters — a timeout is not also counted as a failure or unavailable. So modelling them as a single metric with a cassandra.status attribute is safe and gives a clean per-operation breakdown without double-counting.

Agreed that there are additional error types in Cassandra 5.x docs not covered here (e.g. SpeculativeRetries, per-keyspace variants). The goal of this PR is to cover the three most commonly observed error categories; additional types can be added in follow-up PRs once we have more operational experience with the broader set.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the way cassandra client error metrics are structured, we can assume that any new type of error would likely be captured in the same way: no overlap with other counters and a dedicated counter with its own name and scope values.

What could be done here is to use a wildcard on the MBean name (and maybe scope) to simply allow all values without having to map them individually:

  • org.apache.cassandra.metrics:type=ClientRequest,scope=*,name=*
  • with that, the actual metric breakdown will differ from version to version, but we would not have to maintain it, also the total number of client errors (when aggregating without metric attributes) would always match the breakdown, whereas if we have an explicit list then will miss error types that are not explicitly listed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid name=* would be too broad — every scope also registers Latency, TotalLatency, Aborts, LocalRequests, etc., all of which expose a Count attribute and would incorrectly end up in cassandra.client.request.error.

The safe approach is scope=* with explicit error names:

- beans:
  - org.apache.cassandra.metrics:type=ClientRequest,scope=*,name=Unavailables

This also has a nice side effect of covering CASRead, CASWrite, and ViewWrite scopes that the current explicit list misses. Should we expand the scope in this PR or a follow-up?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then let's keep the explicit name with client error types.

For adding new scopes like CASRead or CASWrite, I am not sure if those would fit the definition of "errors" as they most likely relate to performance metrics and behavior than actual client errors, so it's probably better to deal with that with a dedicated follow-up PR.

Comment thread instrumentation/jmx-metrics/library/cassandra.md Outdated
@SylvainJuge

SylvainJuge commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

One additional question we need to answer here is the value we plan to use for otel.jmx.target.system:

  • use cassandra if we consider the current refinement in this PR is enough to make those metrics "stable" (for Tomcat and similar, it was easier to decide as we had good knowledge on the underlying system, for Cassandra I don't really have much experience with it)
  • use experimental-cassandra as it avoids stability expectations, but would work in the future to promote them to stable
  • add another fallback to experimental-cassandra with warning when cassandra is used, it could even be made generic to cover existing values with experimental- prefix, this also helps providing some compatibility with the existing cassandra value in jmx-scraper.

@jkoronaAtCisco jkoronaAtCisco force-pushed the cassandra-jmx-alignment branch from 92de191 to 015a710 Compare July 2, 2026 17:10
@jkoronaAtCisco

Copy link
Copy Markdown
Author

One additional question we need to answer here is the value we plan to use for otel.jmx.target.system:

  • use cassandra if we consider the current refinement in this PR is enough to make those metrics "stable" (for Tomcat and similar, it was easier to decide as we had good knowledge on the underlying system, for Cassandra I don't really have much experience with it)
  • use experimental-cassandra as it avoids stability expectations, but would work in the future to promote them to stable
  • add another fallback to experimental-cassandra with warning when cassandra is used, it could even be made generic to cover existing values with experimental- prefix, this also helps providing some compatibility with the existing cassandra value in jmx-scraper.

I think shipping as plain cassandra here would silently produce different metrics for anyone migrating from the contrib version. That's why I would lean toward option 3 (accept cassandra with a deprecation warning).


| Metric Name | Type | Unit | Attributes | Description |
| ----------------------------------------- | ------------- | --------- | ------------------------------------- | ---------------------------------------------------------------- |
| cassandra.client.request.count | Counter | {request} | cassandra.operation | Number of requests by operation. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it more future-proof attributes prefix could be cassandra.client instead of just cassandra. It may help avoiding naming collisions in the future.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep cassandra.operation and cassandra.status as-is. Using a single-segment namespace (cassandra.) is consistent with how other JMX target systems name their attributes in this repo — none of them use multi-segment prefixes. Adding .client. would also diverge further from the contrib jmx-scraper which uses operation/status (without any prefix), making future alignment harder.

Comment thread instrumentation/jmx-metrics/library/cassandra.md Outdated
@SylvainJuge

Copy link
Copy Markdown
Contributor

I think shipping as plain cassandra here would silently produce different metrics for anyone migrating from the contrib version. That's why I would lean toward option 3 (accept cassandra with a deprecation warning).

Here are my thoughts trying to decide what is best here:

Those revised cassandra metrics are new, and thus adding them in instrumentation with experimental- prefix is the most logical next step as they are by definition not stable and probably in the "development" maturity level.

The usage of experimental- as prefix to indicate lack of stability is a recent change with #18971.
In the related issue #16016 there was a discussion about removing otel.jmx.target.system and replace it with a maturity toggle to include only stable metrics by default, but it hasn't been implemented yet.

With JMX Scraper, in https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/jmx-scraper, the cassandra value is only supported when otel.jmx.target.source = legacy or auto (default).

So, if we ship it in instrumentation with experimental-cassandra name, then it will be usable with jmx-scraper with experimental-cassandra and there isn't any confusion on the contrib side, it's either:

  • otel.jmx.target.system = cassandra with otel.jmx.target.source = legacy or auto for the existing definitions (will not work with instrumentation).
  • otel.jmx.target.system = experimental-cassandra with otel.jmx.target.source = auto or instrumentation (will not work with legacy).

Doing this will make experimental-cassandra similar to experimental-kafka-connect.
If we ship (or provide a fallback for) it with cassandra, then it might cause some unexpected metrics to be captured, this would effectively be a breaking change but has an existing work-around by setting otel.jmx.target.source to legacy.

Also, in the future I think we should be able to:

  • define maturity level on all of the JMX metrics in their YAML definition
  • remove the otel.jmx.target.system configuration option and capture all metrics that are available on a given system
  • filter captured metrics to only include stable metrics by default
  • have a way to opt-in for non-stable metrics
  • (in jmx-scraper) have a way to force using legacy metrics definitions

So, even if I don't like having to ship it as experimental-cassandra for now, I think that adding any kind of automatic fallback might introduce more issues and confusion and would not be consistent with what is stored in instrumentation.

@laurit laurit added this to the v2.30.0 milestone Jul 3, 2026
@jkoronaAtCisco

Copy link
Copy Markdown
Author

@SylvainJuge
Agreed — experimental-cassandra with no fallback is the right call. The separation in jmx-scraper between legacy/instrumentation sources makes the boundary clean, and adding a fallback would only muddy it.

Already renamed the YAML and updated the README accordingly. Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jmx cassandra metrics update and align with semconv

6 participants