When Kyma Telemetry module is present on the cluster, automatically create a MetricPipeline CR that scrapes DCGM metrics from the gpu-operator namespace and forwards them to the user's existing metrics backend.
NVIDIA DCGM Exporter already ships GPU metrics (utilization, memory, temperature) inside the gpu-operator namespace. Today they go nowhere. With this feature, enabling the GPU module is enough - metrics appear in the user's dashboards automatically with zero additional configuration.
How
- On every reconcile, check if MetricPipeline CRD exists on the cluster
- If Telemetry is present, create a MetricPipeline pointing at gpu-operator namespace, forwarding to Kyma's internal OTel Collector
- Delete the pipeline when GPU module is deleted
Open question
Confirm the internal Telemetry OTel Collector service endpoint - this is what we point the pipeline output at so metrics flow to whatever backend the user already configured in Telemetry.
Acceptance criteria
- Telemetry not installed -> no error, no pipeline, silent skip
- Telemetry installed -> MetricPipeline created automatically on GPU module install
- GPU module deleted -> MetricPipeline cleaned up
- Telemetry installed after GPU module -> next reconcile picks it up
- GPU metrics visible in user's backend with no manual steps
When Kyma Telemetry module is present on the cluster, automatically create a MetricPipeline CR that scrapes DCGM metrics from the gpu-operator namespace and forwards them to the user's existing metrics backend.
NVIDIA DCGM Exporter already ships GPU metrics (utilization, memory, temperature) inside the gpu-operator namespace. Today they go nowhere. With this feature, enabling the GPU module is enough - metrics appear in the user's dashboards automatically with zero additional configuration.
How
Open question
Confirm the internal Telemetry OTel Collector service endpoint - this is what we point the pipeline output at so metrics flow to whatever backend the user already configured in Telemetry.
Acceptance criteria