Troubleshoot problems related to the Telemetry module and its pipelines.
If you can't find a solution, don't hesitate to create a GitHub issue.
-
No data arrive at the backend.
-
In the pipeline status, the
TelemetryFlowHealthycondition has status GatewayAllTelemetryDataDropped or AgentAllTelemetryDataDropped.
The pipeline cannot connect to the backend and drops all data, typically because of one of the following reasons:
-
Authentication Error: The credentials in your
MetricPipelineoutput are incorrect. -
Network Unreachable: The backend URL is wrong, a firewall is blocking the connection, or there's a DNS issue preventing the agent or gateway from reaching the backend.
-
Backend is Down: The observability backend itself is not running or is unhealthy.
-
Identify the failing component.
-
If the status is GatewayAllTelemetryDataDropped, the problem is with the gateway.
-
If the status is AgentAllTelemetryDataDropped, the problem is with the agent.
-
-
To check the failing component's logs, call
kubectl logs -n kyma-system <POD_NAME>:-
For the gateway, check Pod
telemetry-<log/trace/metric>-gateway. -
For the agent, check Pod
telemetry-<log/metric>-agent.
Look for errors related to authentication, connectivity, and DNS.
-
-
Check if the backend is up and reachable.
-
Based on the log messages, fix the
outputsection of your pipeline and re-apply it.
-
The backend is reachable and the connection is properly configured, but some data points are refused.
-
In the pipeline status, the
TelemetryFlowHealthycondition has status GatewaySomeTelemetryDataDropped or AgentSomeTelemetryDataDropped.
This status indicates that the telemetry gateway or agent is successfully sending data, but the backend is rejecting some of it. Common reasons are:
-
Rate Limiting: Your backend is rejecting requests because you're sending too much data at once.
-
Invalid Data: Your backend is rejecting specific data due to incorrect formatting, invalid labels, or other schema violations.
-
Check the error logs for the affected Pod by calling
kubectl logs -n kyma-system {POD_NAME}:-
For GatewaySomeTelemetryDataDropped, check Pod
telemetry-<log/trace/metric>-gateway. -
For AgentSomeTelemetryDataDropped, check Pod
telemetry-<log/metric>-agent.
-
-
Go to your observability backend and investigate potential causes.
-
If the backend is limiting the rate by refusing data, try the following options:
-
Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
-
Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying filters).
-
Reduce emitted data in your applications.
-
-
Otherwise, fix the issues as indicated in the logs.
In the pipeline status, the TelemetryFlowHealthy condition has status GatewayThrottling.
The gateway is receiving data faster than it can process and forward it.
Manually scale out the capacity by increasing the number of replicas for the affected gateway. For details, see Telemetry CRD.
In the LogPipeline status, the TelemetryFlowHealthy condition has status AgentBufferFillingUp.
The backend ingestion rate is too low compared to the export rate of the log agent, causing data to accumulate in its buffer.
You can either increase the capacity of your backend or reduce the volume of log data being sent. Try one of the following options:
-
Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
-
Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying namespace filters).
-
Reduce the amount of log data generated by your applications.
-
In the pipeline status, you see the condition
ConfigurationGeneratedwith status False and reason OTTLSpecInvalid. -
The pipeline configuration fails with unclear error messages, for example, mentioning “unexpected token
<EOF>” or EOF (End of File) parsing errors, such as the following example:'Invalid FilterSpec: condition has invalid syntax: 1:64: unexpected token "<EOF>" (expected <opcomparison> Value)'
If you get a generic EOF error instead of a specific error message, there's usually a syntax error in your OTTL transformation or filter rules. It occurs when the parser cannot diagnose the error precisely.
The following example uses the incorrect function name isMatch (it should be IsMatch, because he parser is case-sensitive):
# ...
filter:
- conditions:
- 'isMatch(resource.attributes["k8s.namespace.name"], ".*-system")'
Review the syntax of your transform and filter rules and ensure that the names of OTTL functions are spelled correctly (for example, IsMatch() instead of isMatch()).
You have configured a transform or filter section in your pipeline, but the data arriving at your backend is not modified, or data you expect to be dropped is still present.
This usually happens for one of the following reasons:
-
Incorrect execution order: You're filtering data based on a field's original value, but a transformation rule has already changed it. Transformation rules always run before filter rules.
-
Condition never met: The condition in your rule is valid (otherwise, you'd see pipeline condition
ConfigurationGenerated: False with the reason OTTLSpecInvalid) but never finds a match in the data. This is often due to a case-sensitive value mismatch or a flawed regular expression.
-
Review your rules and verify the execution order. For example, if you have a
transformrule that renamesresource.attributes["foo"]toresource.attributes["bar"], yourfilterrule must check for “bar”, not “foo”. -
Test your regex separately. Simplify complex conditions to a single comparison and re-apply.
-
To test your rules, temporarily remove all but one rule to confirm it works as expected. Then, add your other rules incrementally and isolate the rule that is causing the issue.
-
Your custom Prometheus metrics don't appear in your observability backend.
-
In the metric agent (OTel Collector) logs, you see entries saying Failed to scrape Prometheus endpoint like the following:
2023-08-29T09:53:07.123Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/app-pods", "data_type": "metrics", "scrape_timestamp": 1693302787120, "target_labels": "{__name__=\"up\", instance=\"10.42.0.18:8080\", job=\"app-pods\"}"}
There's a configuration or network issue between the metric agent and your application, such as:
-
The Service that exposes your metrics port doesn't specify the application protocol.
-
The workload is not configured to use
STRICTmTLS mode, which the metric agent uses by default. -
A deny-all
NetworkPolicyin your application's namespace prevents the agent from scraping metrics from annotated workloads.
-
Define the application protocol in the Service port definition by either prefixing the port name with the protocol, or define the
appProtocolattribute. -
If the issue is with mTLS, either configure your workload to use
STRICTmTLS, or switch to unencrypted scraping by adding theprometheus.io/scheme: "http"annotation to your workload. -
Create a new
NetworkPolicyto explicitly allow ingress traffic from the metric agent; such as the following example:apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-traffic-from-agent spec: podSelector: matchLabels: app.kubernetes.io/name: "annotated-workload" # <your workload here> ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kyma-system podSelector: matchLabels: telemetry.kyma-project.io/metric-scrape: "true" policyTypes: - Ingress
You see traces generated by the Istio service mesh, but traces from your own application code (custom spans) are missing.
The OpenTelemetry (OTel) SDK version used in your application is incompatible with the OTel Collector version.
-
Check which SDK version you're using for instrumentation.
-
Investigate whether it's compatible with the OTel Collector version.
-
If necessary, upgrade to a supported SDK version.
The observability backend shows significantly fewer traces than the number of requests your application receives.
By default, Istio samples only 1% of requests for tracing to minimize performance overhead (see Configure Istio Tracing).
For example, in low-traffic environments (for development or testing) or for low-traffic services, the request volume can be so low that a 1% sample rate may result in capturing zero traces.
-
To see more traces in the backend, increase the percentage of requests that are sampled (see Configure the Sampling Rate).
-
Alternatively, to trace a single request, force sampling by adding a
traceparentHTTP header to your client request. This header contains a sampled flag that instructs the system to capture the trace, bypassing the global sampling rate (see Trace Context: Sampled Flag).