Support OTLP runtime metrics with OTel-native naming#11318
Support OTLP runtime metrics with OTel-native naming#11318
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 953c8710a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| meter | ||
| .upDownCounterBuilder("jvm.memory.used") | ||
| .setDescription("Measure of memory used.") | ||
| .setUnit("By") | ||
| .buildWithCallback( |
There was a problem hiding this comment.
Fix runtime metrics accumulating on every export
When OTLP runtime metrics are enabled (DD_RUNTIME_METRICS_ENABLED=true, DD_METRICS_OTEL_ENABLED=true, DD_METRICS_OTEL_EXPORTER=otlp), these MXBean callbacks record point-in-time values through observable sum instruments. In the current shim, OtelMetricStorage.shouldResetOnCollect leaves OBSERVABLE_UP_DOWN_COUNTER non-resetting and OtelLongSum adds each observation, so every OTLP collection adds the current heap/thread/etc. value to the previous export instead of replacing it; the same pattern affects observable counters like class and CPU time under the existing temporality handling. This makes exported JVM metrics monotonically inflate after the first flush, so the callbacks need value/last-observation semantics or the observable storage needs to handle async sum instruments correctly.
Useful? React with 👍 / 👎.
| if (cfg.isRuntimeMetricsEnabled() | ||
| && InstrumenterConfig.get().isMetricsOtelEnabled() | ||
| && cfg.isMetricsOtlpExporterEnabled()) { | ||
| startOtlpRuntimeMetrics(); |
There was a problem hiding this comment.
JMX has some unfortunate side-effects which mean we can't start it at the same time as the tracer.
I would move JvmOtlpRuntimeMetrics out from otel-shim and into the agent-jmxfetch module. That way you can start in from JMXFetch along with the other runtime metrics. This would also let you benefit from the existing code that delays starting JMXFetch until the appropriate time.
| * Registers JVM runtime metrics with OTel-native names against the agent's MeterProvider. See | ||
| * https://opentelemetry.io/docs/specs/semconv/runtime/jvm-metrics/. | ||
| */ | ||
| public final class JvmOtlpRuntimeMetrics { |
There was a problem hiding this comment.
We need to move this class to another module. The responsibility of the otel-shim module is to bridge between the OTel API and internal services. This means there will be multiple copies of the otel-shim code at runtime - one for the bootstrap class-path to support extensions and internal code, and one or more for every class-loader that needs this shim.
The best place atm to put this is under the agent-jmxfetch module - you'll need to add otel-bootstrap as a dependency (at build time we vendor-in/repackage the OTel API for anything using otel-bootstrap so this won't conflict with anything else in the customer app)
What Does This Do
Adds an OTLP runtime-metrics path that emits JVM runtime metrics with OTel semantic-convention names (
jvm.*) through the agent'sMeterProvider, instead of the proprietary DogStatsD names (jvm.heap_memory,jvm.thread_count, …).When the three flags below are set together,
JvmOtlpRuntimeMetrics.start()is invoked fromAgent.installDatadogTracer()and registers 15 instruments backed byjava.lang.managementMXBean callbacks. They flow through the existing OTLP exporter — no new transport, no JMXFetch.DD_RUNTIME_METRICS_ENABLEDtruetrueDD_METRICS_OTEL_ENABLEDtruefalseDD_METRICS_OTEL_EXPORTERotlpInstruments registered (15 total —
Recommended+Developmentper the OTel JVM semconv):jvm.memory.used,jvm.memory.committed,jvm.memory.limit,jvm.memory.init,jvm.memory.used_after_last_gcjvm.buffer.memory.used,jvm.buffer.memory.limit,jvm.buffer.countjvm.thread.countjvm.class.loaded,jvm.class.count,jvm.class.unloadedjvm.cpu.time,jvm.cpu.count,jvm.cpu.recent_utilizationjvm.gc.durationis intentionally deferred. The spec requires a Histogram of per-collection pause durations, butGarbageCollectorMXBeanonly exposes cumulative collection time. Populating the histogram requires either subscribing toGarbageCollectionNotificationInfovia JMX (blocked by the bootstrap-class-loading constraints indocs/bootstrap_design_guidelines.md) or consuming JFRGarbageCollectionevents. Tracked as a follow-up.Related system tests PR enabling tests: DataDog/system-tests#6800
Motivation
Customers running with
DD_METRICS_OTEL_EXPORTER=otlproute their telemetry to an OTel collector — there may not be a Datadog Agent on the path, and therefore nothing listening on the DogStatsD socket. Today the tracer's runtime metrics still emit through DogStatsD with proprietary names (jvm.heap_memory, …), so in those deployments runtime metrics silently go nowhere.This change emits the same runtime metric data as OTLP instruments with OTel semantic-convention names through the OTel
MeterProvider, so it travels the same OTLP pipeline the customer already configured. Customers who haven't opted into OTLP metrics see no change — the existing DogStatsD path is untouched.Additional Notes
start()is single-shot: anAtomicBooleanCAS guards against re-entry from re-init, and on failure we log and stop (partial registration is worse than a silentretry).
java.lang.management.*pluscom.sun.management.OperatingSystemMXBeanfor CPU. CPU instruments are skipped at registration time on JVMs where thecom.sunbean isn't present. Nojavax.management.*is touched, keeping the constraints indocs/bootstrap_design_guidelines.mdintact.JvmOtlpRuntimeMetricsis registered inMETA-INF/native-image/.../reflect-config.json(using its post-shadow-relocation FQN,datadog.trace.bootstrap.otel.shim.metrics.JvmOtlpRuntimeMetrics) so AOT/native-image builds can resolve it reflectively fromAgent.java.JvmOtlpRuntimeMetricsTest(JUnit 5) covers instrument surface, attribute keys (jvm.memory.type=heap|non_heap), positive values for live metrics (jvm.memory.used,jvm.thread.count), and idempotency of repeatedstart()calls.