Skip to content

Commit 0a7aab9

Browse files
ci-operatorzakisk
authored andcommitted
fix(tracing): direct OTel SDK setup for chain-coherent sampling
Knative's config-observability ConfigMap only exposes a flat tracing-sampling-rate, so at fractional rates each service in the chain rolls independently — PaC can drop a trace while Tekton keeps it, leaving execution spans whose parent_spanID points at nothing. Switching to the OTel SDK opens up OTEL_TRACES_SAMPLER's parentbased_* family, which honors the root span's sample decision in the W3C traceparent flag so the whole chain is kept or dropped together. Controller and watcher call tracing.New() at startup. Tracing is opt-in: both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set, otherwise PaC falls back to noop (matching the prior tracing-sampling-rate "0" default). Resource service.name is pipelines-as-code. Propagator is W3C TraceContext only; Baggage is intentionally not honored per Konflux-CI ADR 0061. otlptracegrpc and otlptracehttp promoted from indirect to direct dependencies. Assisted-by: Claude Code Signed-off-by: Josiah England <jengland@redhat.com>
1 parent 74939ef commit 0a7aab9

7 files changed

Lines changed: 441 additions & 39 deletions

File tree

config/305-config-observability.yaml

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -52,20 +52,6 @@ data:
5252
# Only applicable for grpc and http/protobuf protocols.
5353
# metrics-export-interval: "30s"
5454
55-
# tracing-protocol specifies the trace export protocol.
56-
# Supported values: "grpc", "http/protobuf", "none".
57-
# Default is "none" (tracing disabled).
58-
# tracing-protocol: "none"
59-
60-
# tracing-endpoint specifies the OTLP collector endpoint.
61-
# Required when tracing-protocol is "grpc" or "http/protobuf".
62-
# The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set.
63-
# tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
64-
65-
# tracing-sampling-rate controls the fraction of traces sampled.
66-
# 0.0 = none, 1.0 = all. Default is 0 (none).
67-
# tracing-sampling-rate: "1.0"
68-
6955
# runtime-profiling enables/disables the pprof profiling server on port 8008.
7056
# Supported values: "enabled", "disabled" (default).
7157
# runtime-profiling: "disabled"

docs/content/docs/operations/tracing.md

Lines changed: 32 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,39 +3,48 @@ title: Distributed Tracing
33
weight: 5
44
---
55

6-
This page describes how to enable OpenTelemetry distributed tracing for Pipelines-as-Code. When enabled, PaC emits trace spans for webhook event processing and PipelineRun lifecycle timing.
6+
Pipelines-as-Code emits trace spans for webhook event processing and PipelineRun lifecycle timing.
77

88
## Enabling tracing
99

10-
The ConfigMap `pipelines-as-code-config-observability` controls tracing configuration. It must exist in the same namespace as the Pipelines-as-Code controller and watcher deployments. See [config/305-config-observability.yaml](https://github.com/tektoncd/pipelines-as-code/blob/main/config/305-config-observability.yaml) for the full example.
10+
Two configuration paths can enable tracing.
1111

12-
It contains the following tracing fields:
12+
### Via OpenTelemetry environment variables
1313

14-
* `tracing-protocol`: Export protocol. Supported values: `grpc`, `http/protobuf`, `none`. Default is `none` (tracing disabled).
15-
* `tracing-endpoint`: OTLP collector endpoint. Required when protocol is not `none`. The `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable takes precedence if set.
16-
* `tracing-sampling-rate`: Fraction of traces to sample. `0.0` = none, `1.0` = all. Default is `0`.
14+
Set on the controller and watcher pods:
1715

18-
### Example
16+
* `OTEL_EXPORTER_OTLP_ENDPOINT` - OTLP collector endpoint URL. Required.
17+
* `OTEL_TRACES_SAMPLER` - Sampler family. Required. Supported: `always_on`, `always_off`, `traceidratio`, `parentbased_always_on`, `parentbased_always_off`, `parentbased_traceidratio`.
18+
* `OTEL_TRACES_SAMPLER_ARG` - Numeric argument for ratio samplers. Example: `0.1` with `parentbased_traceidratio` samples 10% of root traces while keeping the chain coherent.
19+
* `OTEL_EXPORTER_OTLP_PROTOCOL` (or traces-specific `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`) - OTLP transport: `grpc` or `http/protobuf`. Default: `grpc`.
1920

20-
```yaml
21-
apiVersion: v1
22-
kind: ConfigMap
23-
metadata:
24-
name: pipelines-as-code-config-observability
25-
namespace: pipelines-as-code
26-
data:
27-
tracing-protocol: grpc
28-
tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
29-
tracing-sampling-rate: "1.0"
30-
```
21+
Both `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_TRACES_SAMPLER` must be set. Inbound `traceparent` headers on webhook requests are honored via the W3C TraceContext propagator. Changes take effect on the next pod restart.
3122

32-
Changes to `tracing-protocol`, `tracing-endpoint`, and `tracing-sampling-rate` require restarting the controller and watcher pods. The trace exporter is created once at startup from the ConfigMap values at that time. Set `tracing-protocol` to `none` or remove the tracing keys to disable tracing.
23+
#### Sampler choice and chain coherency
3324

34-
The controller and watcher locate this ConfigMap by name via the `CONFIG_OBSERVABILITY_NAME` environment variable set in their deployment manifests. Operator-based installations may manage this differently; consult the operator documentation for details.
25+
The `parentbased_*` sampler family honors the parent span's sample decision carried in the W3C `traceparent` flag bit. When every service in the delivery chain uses parent-based samplers, the root span's sampling decision propagates end to end: each service either keeps its spans or drops them based on what the root chose. Flat-rate samplers (`traceidratio` without parent-based) cause each service to roll independently, which at fractional sampling fragments the chain into orphaned spans whose `parent_spanID` references a span that was dropped. `parentbased_always_on` keeps everything; `parentbased_traceidratio` with a numeric argument samples a coherent fraction.
26+
27+
### Via Knative observability ConfigMap
28+
29+
Set in `pipelines-as-code-config-observability`:
30+
31+
* `tracing-protocol` - `grpc`, `http/protobuf`, `stdout`, or `none`. Default: `none`.
32+
* `tracing-endpoint` - Collector endpoint for `grpc` or `http/protobuf`.
33+
* `tracing-sampling-rate` - Sample fraction. Per-component independent.
34+
35+
Changes to Knative's tracing config require restarting the controller and watcher pods. The tracer is built once at startup.
36+
37+
### When both are configured
38+
39+
OpenTelemetry takes precedence: all spans flow through the OpenTelemetry exporter. The Knative tracer is initialized at startup but unused.
40+
41+
To use only OpenTelemetry, set `tracing-protocol: none` in `pipelines-as-code-config-observability`.
42+
43+
To use only Knative, unset `OTEL_EXPORTER_OTLP_ENDPOINT` on the controller and watcher pods.
3544

3645
## Emitted spans
3746

38-
The controller emits a `PipelinesAsCode:ProcessEvent` span for each webhook event. The watcher emits `waitDuration` and `executeDuration` spans for completed PipelineRuns.
47+
The controller emits a `PipelinesAsCode:ProcessEvent` span for each webhook event. The watcher emits `waitDuration` and `executeDuration` spans for completed PipelineRuns. The OTel resource attribute `service.name` on all emitted spans is `pipelines-as-code`.
3948

4049
### Webhook event span (`PipelinesAsCode:ProcessEvent`)
4150

@@ -103,13 +112,13 @@ Unlike the observability ConfigMap above (which requires a pod restart), changes
103112

104113
## Trace context propagation
105114

106-
When Pipelines-as-Code creates a PipelineRun, it sets the `tekton.dev/pipelinerunSpanContext` annotation with a JSON-encoded OTel TextMapCarrier containing the W3C `traceparent`. PaC tracing works independently you get PaC spans regardless of whether Tekton Pipelines has tracing enabled.
115+
When Pipelines-as-Code creates a PipelineRun, it sets the `tekton.dev/pipelinerunSpanContext` annotation with a JSON-encoded OTel TextMapCarrier containing the W3C `traceparent`. PaC tracing works independently - you get PaC spans regardless of whether Tekton Pipelines has tracing enabled.
107116

108117
If Tekton Pipelines is also configured with tracing pointing at the same collector, its reconciler spans appear as children of the PaC span, providing a single end-to-end trace from webhook receipt through task execution. See the [Tekton Pipelines tracing documentation](https://github.com/tektoncd/pipeline/blob/main/docs/developers/tracing.md) for Tekton's independent tracing setup.
109118

110119
## Deploying a trace collector
111120

112-
Pipelines-as-Code exports traces using the standard OpenTelemetry Protocol (OTLP). You need a running OTLP-compatible collector for the `tracing-endpoint` to point to. Common options include:
121+
Pipelines-as-Code exports traces using the standard OpenTelemetry Protocol (OTLP). You need a running OTLP-compatible collector for `OTEL_EXPORTER_OTLP_ENDPOINT` to point to. Common options include:
113122

114123
* [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) -- the vendor-neutral reference collector
115124
* [Jaeger](https://www.jaegertracing.io/docs/latest/getting-started/) -- supports OTLP ingestion natively since v1.35

go.mod

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ require (
3030
github.com/tektoncd/pipeline v1.13.1
3131
gitlab.com/gitlab-org/api/client-go v1.46.0
3232
go.opentelemetry.io/otel v1.44.0
33+
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0
34+
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0
3335
go.opentelemetry.io/otel/metric v1.44.0
3436
go.opentelemetry.io/otel/sdk v1.43.0
3537
go.opentelemetry.io/otel/sdk/metric v1.43.0
@@ -90,8 +92,6 @@ require (
9092
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0 // indirect
9193
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0 // indirect
9294
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 // indirect
93-
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0 // indirect
94-
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0 // indirect
9595
go.opentelemetry.io/otel/exporters/prometheus v0.65.0 // indirect
9696
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0 // indirect
9797
go.opentelemetry.io/proto/otlp v1.10.0 // indirect

pkg/adapter/adapter.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,15 @@ func New(run *params.Run, k *kubeinteraction.Interaction) adapter.AdapterConstru
7979
}
8080

8181
func (l *listener) Start(ctx context.Context) error {
82+
tp := tracing.New(l.logger)
83+
defer func() {
84+
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
85+
defer cancel()
86+
if err := tp.Shutdown(shutdownCtx); err != nil {
87+
l.logger.Errorw("failed to shut down tracer provider", "error", err)
88+
}
89+
}()
90+
8291
adapterPort := globalAdapterPort
8392
envAdapterPort := os.Getenv("PAC_CONTROLLER_PORT")
8493
if envAdapterPort != "" {

pkg/reconciler/controller.go

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ package reconciler
33
import (
44
"context"
55
"path"
6+
"time"
67

78
"github.com/openshift-pipelines/pipelines-as-code/pkg/apis/pipelinesascode"
89
"github.com/openshift-pipelines/pipelines-as-code/pkg/apis/pipelinesascode/keys"
@@ -14,6 +15,7 @@ import (
1415
"github.com/openshift-pipelines/pipelines-as-code/pkg/params/info"
1516
prmetrics "github.com/openshift-pipelines/pipelines-as-code/pkg/pipelinerunmetrics"
1617
queuepkg "github.com/openshift-pipelines/pipelines-as-code/pkg/queue"
18+
"github.com/openshift-pipelines/pipelines-as-code/pkg/tracing"
1719
tektonv1 "github.com/tektoncd/pipeline/pkg/apis/pipeline/v1"
1820
tektonPipelineRunInformerv1 "github.com/tektoncd/pipeline/pkg/client/injection/informers/pipeline/v1/pipelinerun"
1921
tektonPipelineRunReconcilerv1 "github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1/pipelinerun"
@@ -30,6 +32,17 @@ func NewController() func(context.Context, configmap.Watcher) *controller.Impl {
3032
ctx = info.StoreNS(ctx, system.Namespace())
3133
log := logging.FromContext(ctx)
3234

35+
tp := tracing.New(log)
36+
// linter false positive: fresh Background is required because outer ctx is cancelled past <-ctx.Done().
37+
go func() { //nolint:gosec
38+
<-ctx.Done()
39+
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
40+
defer cancel()
41+
if err := tp.Shutdown(shutdownCtx); err != nil {
42+
log.Errorw("failed to shut down tracer provider", "error", err)
43+
}
44+
}()
45+
3346
run := params.New()
3447
err := run.Clients.NewClients(ctx, &run.Info)
3548
if err != nil {

pkg/tracing/provider.go

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
package tracing
2+
3+
import (
4+
"context"
5+
"os"
6+
"strconv"
7+
8+
"go.opentelemetry.io/otel"
9+
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
10+
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
11+
"go.opentelemetry.io/otel/propagation"
12+
"go.opentelemetry.io/otel/sdk/resource"
13+
sdktrace "go.opentelemetry.io/otel/sdk/trace"
14+
semconv "go.opentelemetry.io/otel/semconv/v1.40.0"
15+
"go.opentelemetry.io/otel/trace/noop"
16+
"go.uber.org/zap"
17+
knativetracing "knative.dev/pkg/observability/tracing"
18+
)
19+
20+
const (
21+
EnvOTLPEndpoint = "OTEL_EXPORTER_OTLP_ENDPOINT"
22+
EnvOTLPProtocol = "OTEL_EXPORTER_OTLP_PROTOCOL"
23+
EnvOTLPTracesProtocol = "OTEL_EXPORTER_OTLP_TRACES_PROTOCOL"
24+
EnvTracesSampler = "OTEL_TRACES_SAMPLER"
25+
EnvTracesSamplerArg = "OTEL_TRACES_SAMPLER_ARG"
26+
27+
protocolGRPC = "grpc"
28+
protocolHTTP = "http/protobuf"
29+
)
30+
31+
type TracerProvider struct {
32+
shutdown func(context.Context) error
33+
}
34+
35+
func New(logger *zap.SugaredLogger) *TracerProvider {
36+
otelConfigured := os.Getenv(EnvOTLPEndpoint) != "" && os.Getenv(EnvTracesSampler) != ""
37+
if otelConfigured && !globalIsNoop() {
38+
logger.Warn("OpenTelemetry and Knative tracing both configured; spans go through OpenTelemetry, Knative's tracer is unused. Set `tracing-protocol: none` in `pipelines-as-code-config-observability` to disable Knative, or unset `OTEL_EXPORTER_OTLP_ENDPOINT` to disable OpenTelemetry.")
39+
}
40+
41+
if os.Getenv(EnvOTLPEndpoint) == "" {
42+
logger.Info("OpenTelemetry not configured (OTLP endpoint missing)")
43+
return passthroughProvider()
44+
}
45+
if os.Getenv(EnvTracesSampler) == "" {
46+
logger.Info("OpenTelemetry not configured (sampler missing)")
47+
return passthroughProvider()
48+
}
49+
50+
proto := protocolFromEnv()
51+
exporter, err := newExporter(context.Background(), logger, proto)
52+
if err != nil {
53+
logger.Errorw("failed to create OTLP exporter", "error", err)
54+
return passthroughProvider()
55+
}
56+
57+
res, err := resource.Merge(
58+
resource.Default(),
59+
resource.NewWithAttributes(
60+
semconv.SchemaURL,
61+
semconv.ServiceName(TracerName),
62+
),
63+
)
64+
if err != nil {
65+
logger.Errorw("failed to create resource", "error", err)
66+
res = resource.Default()
67+
}
68+
69+
tp := sdktrace.NewTracerProvider(
70+
sdktrace.WithBatcher(exporter),
71+
sdktrace.WithResource(res),
72+
sdktrace.WithSampler(samplerFromEnv(logger)),
73+
)
74+
75+
otel.SetTracerProvider(tp)
76+
otel.SetTextMapPropagator(propagation.TraceContext{})
77+
78+
logger.Infow("tracing initialized", "endpoint", os.Getenv(EnvOTLPEndpoint), "protocol", proto)
79+
80+
return &TracerProvider{shutdown: tp.Shutdown}
81+
}
82+
83+
func passthroughProvider() *TracerProvider {
84+
return &TracerProvider{shutdown: func(context.Context) error { return nil }}
85+
}
86+
87+
func globalIsNoop() bool {
88+
tp := otel.GetTracerProvider()
89+
if _, ok := tp.(noop.TracerProvider); ok {
90+
return true
91+
}
92+
// Knative wraps noop in its own TracerProvider when tracing-protocol is none/absent.
93+
if knativeProvider, ok := tp.(*knativetracing.TracerProvider); ok {
94+
_, isNoop := knativeProvider.TracerProvider.(noop.TracerProvider)
95+
return isNoop
96+
}
97+
return false
98+
}
99+
100+
func protocolFromEnv() string {
101+
if v := os.Getenv(EnvOTLPTracesProtocol); v != "" {
102+
return v
103+
}
104+
if v := os.Getenv(EnvOTLPProtocol); v != "" {
105+
return v
106+
}
107+
return protocolGRPC
108+
}
109+
110+
func newExporter(ctx context.Context, logger *zap.SugaredLogger, proto string) (sdktrace.SpanExporter, error) {
111+
endpoint := os.Getenv(EnvOTLPEndpoint)
112+
switch proto {
113+
case protocolHTTP:
114+
return otlptracehttp.New(ctx, otlptracehttp.WithEndpointURL(endpoint))
115+
case protocolGRPC:
116+
return otlptracegrpc.New(ctx, otlptracegrpc.WithEndpointURL(endpoint))
117+
default:
118+
logger.Errorw("unsupported OTLP protocol; falling back to grpc", "protocol", proto)
119+
return otlptracegrpc.New(ctx, otlptracegrpc.WithEndpointURL(endpoint))
120+
}
121+
}
122+
123+
func (tp *TracerProvider) Shutdown(ctx context.Context) error {
124+
if tp.shutdown != nil {
125+
return tp.shutdown(ctx)
126+
}
127+
return nil
128+
}
129+
130+
func samplerFromEnv(logger *zap.SugaredLogger) sdktrace.Sampler {
131+
name := os.Getenv(EnvTracesSampler)
132+
argStr := os.Getenv(EnvTracesSamplerArg)
133+
arg, err := strconv.ParseFloat(argStr, 64)
134+
if err != nil && argStr != "" {
135+
logger.Errorw("ignoring malformed sampler argument; defaulting to 0% sampling", "env", EnvTracesSamplerArg, "value", argStr)
136+
}
137+
if argStr == "" && (name == "traceidratio" || name == "parentbased_traceidratio") {
138+
logger.Infow("ratio sampler selected without "+EnvTracesSamplerArg+"; defaulting to 0% sampling", "env", EnvTracesSampler, "value", name)
139+
}
140+
switch name {
141+
case "always_on":
142+
return sdktrace.AlwaysSample()
143+
case "always_off":
144+
return sdktrace.NeverSample()
145+
case "traceidratio":
146+
return sdktrace.TraceIDRatioBased(arg)
147+
case "parentbased_always_on":
148+
return sdktrace.ParentBased(sdktrace.AlwaysSample())
149+
case "parentbased_always_off":
150+
return sdktrace.ParentBased(sdktrace.NeverSample())
151+
case "parentbased_traceidratio":
152+
return sdktrace.ParentBased(sdktrace.TraceIDRatioBased(arg))
153+
}
154+
logger.Warnw("unrecognized OTEL_TRACES_SAMPLER value; falling back to never sample", "value", name)
155+
return sdktrace.NeverSample()
156+
}

0 commit comments

Comments
 (0)