Troubleshooting for the Telemetry Module

Troubleshoot problems related to the Telemetry module and its pipelines.

If you can't find a solution, don't hesitate to create a GitHub issue.

No Data Arrive at the Backend

Symptom

No data arrive at the backend.
In the pipeline status, the TelemetryFlowHealthy condition has status GatewayAllTelemetryDataDropped or AgentAllTelemetryDataDropped.

Cause

The pipeline cannot connect to the backend and drops all data, typically because of one of the following reasons:

Authentication Error: The credentials in your MetricPipeline output are incorrect.
Network Unreachable: The backend URL is wrong, a firewall is blocking the connection, or there's a DNS issue preventing the agent or gateway from reaching the backend.
Backend is Down: The observability backend itself is not running or is unhealthy.

Solution

Identify the failing component.
- If the status is GatewayAllTelemetryDataDropped, the problem is with the gateway.
- If the status is AgentAllTelemetryDataDropped, the problem is with the agent.
To check the failing component's logs, call kubectl logs -n kyma-system <POD_NAME>:
- For the gateway, check Pod telemetry-<log/trace/metric>-gateway.
- For the agent, check Pod telemetry-<log/metric>-agent.
Look for errors related to authentication, connectivity, and DNS.
Check if the backend is up and reachable.
Based on the log messages, fix the output section of your pipeline and re-apply it.

Not All Data Arrive at the Backend

Symptom

The backend is reachable and the connection is properly configured, but some data points are refused.
In the pipeline status, the TelemetryFlowHealthy condition has status GatewaySomeTelemetryDataDropped or AgentSomeTelemetryDataDropped.

Cause

This status indicates that the telemetry gateway or agent is successfully sending data, but the backend is rejecting some of it. Common reasons are:

Rate Limiting: Your backend is rejecting requests because you're sending too much data at once.
Invalid Data: Your backend is rejecting specific data due to incorrect formatting, invalid labels, or other schema violations.

Solution

Check the error logs for the affected Pod by calling kubectl logs -n kyma-system {POD_NAME}:
- For GatewaySomeTelemetryDataDropped, check Pod telemetry-<log/trace/metric>-gateway.
- For AgentSomeTelemetryDataDropped, check Pod telemetry-<log/metric>-agent.
Go to your observability backend and investigate potential causes.
If the backend is limiting the rate by refusing data, try the following options:
- Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
- Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying filters).
- Reduce emitted data in your applications.
Otherwise, fix the issues as indicated in the logs.

Gateway Throttling

Symptom

In the pipeline status, the TelemetryFlowHealthy condition has status GatewayThrottling.

Cause

The gateway is receiving data faster than it can process and forward it.

Solution

Manually scale out the capacity by increasing the number of replicas for the affected gateway. For details, see Telemetry CRD.

Log Buffer Filling Up

Symptom

In the LogPipeline status, the TelemetryFlowHealthy condition has status AgentBufferFillingUp.

Cause

The backend ingestion rate is too low compared to the export rate of the log agent, causing data to accumulate in its buffer.

Solution

You can either increase the capacity of your backend or reduce the volume of log data being sent. Try one of the following options:

Increase the ingestion rate of your backend (for example, by scaling out your SAP Cloud Logging instances).
Reduce emitted data by re-configuring the pipeline (for example, by disabling certain inputs or applying namespace filters).
Reduce the amount of log data generated by your applications.

OTTL Spec Invalid with Unspecific Error Message

Symptom

In the pipeline status, you see the condition ConfigurationGenerated with status False and reason OTTLSpecInvalid.
The pipeline configuration fails with unclear error messages, for example, mentioning “unexpected token <EOF>” or EOF (End of File) parsing errors, such as the following example:
Output Code:
```
'Invalid FilterSpec: condition has invalid syntax: 1:64: unexpected token
      "<EOF>" (expected <opcomparison> Value)'
```

Cause

If you get a generic EOF error instead of a specific error message, there's usually a syntax error in your OTTL transformation or filter rules. It occurs when the parser cannot diagnose the error precisely.

The following example uses the incorrect function name isMatch (it should be IsMatch, because he parser is case-sensitive):

# ...
filter:
    - conditions:
        - 'isMatch(resource.attributes["k8s.namespace.name"], ".*-system")'

Solution

Review the syntax of your transform and filter rules and ensure that the names of OTTL functions are spelled correctly (for example, IsMatch() instead of isMatch()).

Transformation or Filter Rule Has No Effect

Symptom

You have configured a transform or filter section in your pipeline, but the data arriving at your backend is not modified, or data you expect to be dropped is still present.

Cause

This usually happens for one of the following reasons:

Incorrect execution order: You're filtering data based on a field's original value, but a transformation rule has already changed it. Transformation rules always run before filter rules.
Condition never met: The condition in your rule is valid (otherwise, you'd see pipeline condition ConfigurationGenerated: False with the reason OTTLSpecInvalid) but never finds a match in the data. This is often due to a case-sensitive value mismatch or a flawed regular expression.

Solution

Review your rules and verify the execution order. For example, if you have a transform rule that renames resource.attributes["foo"] to resource.attributes["bar"], your filter rule must check for “bar”, not “foo”.
Test your regex separately. Simplify complex conditions to a single comparison and re-apply.
To test your rules, temporarily remove all but one rule to confirm it works as expected. Then, add your other rules incrementally and isolate the rule that is causing the issue.

Log Entry: Failed to Scrape Prometheus Endpoint

Symptom

Your custom Prometheus metrics don't appear in your observability backend.

In the metric agent (OTel Collector) logs, you see entries saying Failed to scrape Prometheus endpoint like the following:

Output Code:

2023-08-29T09:53:07.123Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/app-pods", "data_type": "metrics", "scrape_timestamp": 1693302787120, "target_labels": "{__name__=\"up\", instance=\"10.42.0.18:8080\", job=\"app-pods\"}"}

Cause

There's a configuration or network issue between the metric agent and your application, such as:

The Service that exposes your metrics port doesn't specify the application protocol.
The workload is not configured to use STRICT mTLS mode, which the metric agent uses by default.
A deny-all NetworkPolicy in your application's namespace prevents the agent from scraping metrics from annotated workloads.

Solution

Define the application protocol in the Service port definition by either prefixing the port name with the protocol, or define the appProtocol attribute.
If the issue is with mTLS, either configure your workload to use STRICT mTLS, or switch to unencrypted scraping by adding the prometheus.io/scheme: "http" annotation to your workload.

Create a new NetworkPolicy to explicitly allow ingress traffic from the metric agent; such as the following example:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-traffic-from-agent
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: "annotated-workload" # <your workload here>
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kyma-system
      podSelector:
        matchLabels:
          telemetry.kyma-project.io/metric-scrape: "true"
  policyTypes:
  - Ingress

Custom Spans Don’t Arrive at the Backend, but Istio Spans Do

Symptom

You see traces generated by the Istio service mesh, but traces from your own application code (custom spans) are missing.

Cause

The OpenTelemetry (OTel) SDK version used in your application is incompatible with the OTel Collector version.

Solution

Check which SDK version you're using for instrumentation.
Investigate whether it's compatible with the OTel Collector version.
If necessary, upgrade to a supported SDK version.

Too Few Traces Arrive at the Backend

Symptom

The observability backend shows significantly fewer traces than the number of requests your application receives.

Cause

By default, Istio samples only 1% of requests for tracing to minimize performance overhead (see Configure Istio Tracing).

For example, in low-traffic environments (for development or testing) or for low-traffic services, the request volume can be so low that a 1% sample rate may result in capturing zero traces.

Solution

To see more traces in the backend, increase the percentage of requests that are sampled (see Configure the Sampling Rate).
Alternatively, to trace a single request, force sampling by adding a traceparent HTTP header to your client request. This header contains a sampled flag that instructs the system to capture the trace, bypassing the global sampling rate (see Trace Context: Sampled Flag).

FilesExpand file tree

troubleshooting-for-the-telemetry-module-b86d7cb.md

Latest commit

History

troubleshooting-for-the-telemetry-module-b86d7cb.md

File metadata and controls

Troubleshooting for the Telemetry Module

No Data Arrive at the Backend

Symptom

Cause

Solution

Not All Data Arrive at the Backend

Symptom

Cause

Solution

Gateway Throttling

Symptom

Cause

Solution

Log Buffer Filling Up

Symptom

Cause

Solution

OTTL Spec Invalid with Unspecific Error Message

Symptom

Output Code:

Cause

Solution

Transformation or Filter Rule Has No Effect

Symptom

Cause

Solution

Log Entry: Failed to Scrape Prometheus Endpoint

Symptom

Output Code:

Cause

Solution

Custom Spans Don’t Arrive at the Backend, but Istio Spans Do

Symptom

Cause

Solution

Too Few Traces Arrive at the Backend

Symptom

Cause

Solution