fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756
fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756ci-operator wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request transitions Pipelines-as-Code tracing configuration from a custom ConfigMap to standard OpenTelemetry environment variables (such as OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER). It introduces a new tracing package to initialize and manage the tracer provider lifecycle, updating both the controller and adapter to initialize and shut down the provider correctly. A critical issue was identified in pkg/tracing/provider.go where passing the application's cancellation context to the OTLP exporter prevents final spans from being flushed during shutdown; using context.Background() is recommended instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| return noopProvider() | ||
| } | ||
|
|
||
| exporter, err := newExporter(ctx, logger) |
There was a problem hiding this comment.
Passing the application's cancellation context (ctx) to the OTLP exporter creation means that when the application context is cancelled to initiate shutdown, the exporter's connection/client is also cancelled immediately. This prevents the final flush of spans during tp.Shutdown(shutdownCtx) from succeeding, leading to lost spans. Using context.Background() ensures the exporter remains functional during the shutdown flush.
| exporter, err := newExporter(ctx, logger) | |
| exporter, err := newExporter(context.Background(), logger) |
There was a problem hiding this comment.
newExporter uses context.Background() so the exporter's connection stays open through tp.Shutdown(shutdownCtx) and queued spans flush. The ctx parameter on tracing.New is dropped since nothing else used it.
ff868b1 to
50ad9c1
Compare
| # tracing-protocol specifies the trace export protocol. | ||
| # Supported values: "grpc", "http/protobuf", "none". | ||
| # Default is "none" (tracing disabled). | ||
| # tracing-protocol: "none" | ||
| # tracing-endpoint specifies the OTLP collector endpoint. | ||
| # Required when tracing-protocol is "grpc" or "http/protobuf". | ||
| # The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set. | ||
| # tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317" | ||
| # tracing-sampling-rate controls the fraction of traces sampled. | ||
| # 0.0 = none, 1.0 = all. Default is 0 (none). | ||
| # tracing-sampling-rate: "1.0" | ||
There was a problem hiding this comment.
While the configmap keys are removed, I see no change in controller. Does that mean the configmap based tracing continue to work even with these changes?
There was a problem hiding this comment.
The Knative tracing wrapper that consumed these _example keys is replaced; the new tracer in pkg/tracing/provider.go reads the OTel-standard env vars directly. The _example block was just documentation for keys nothing reads anymore.
This is not the same ConfigMap as pipelines-as-code (the main one), which holds the tracing-label-action|application|component operator label-name mappings.
On the "no change in controller" observation - we verified end-to-end that with neither OTEL_EXPORTER_OTLP_ENDPOINT nor OTEL_TRACES_SAMPLER set, PaC falls back to a noop tracer and emits no spans. If you're seeing tracing behavior unchanged from the pre-PR state, could you share how to reproduce so we can dig in?
There was a problem hiding this comment.
By changes in controller, I intended to ask if the knative base (eventing/adapter) that pac uses, reads these observability configmap keys for any arbitrary configuration/operations. Ref.
There was a problem hiding this comment.
Yes - the Knative eventing-adapter still consumes config-observability for non-tracing observability (metrics/profiling/etc. via evadapter.NewObservabilityConfiguratorFromConfigMap() at the line you linked, logging is read from config-logging separately). What this PR changes is specifically the tracing portion: PaC's old Knative tracing wrapper (added in bd9f468) read tracing-protocol / tracing-endpoint / tracing-sampling-rate from this ConfigMap; the new pkg/tracing/provider.go reads the OTel-standard env vars directly and that wrapper is gone, so those three keys went with the _example block. The Knative-base reads for non-tracing observability are unchanged.
There was a problem hiding this comment.
@ci-operator as it looks breaking change and @theakshaypant is considerate about, can't we introduce env based configuration as fallback to configmap so that we're keeping both, considering other users may want to use configmap?
There was a problem hiding this comment.
If we read those keys, the tracing-sampling-rate value would have to map to either traceidratio (which preserves the flat-sampling behavior this PR is fixing) or parentbased_traceidratio (which silently changes existing users' behavior). It's a breaking change either way - a ConfigMap fallback wouldn't actually preserve the old behavior. We're going env-vars-only and calling out the break in the tracing doc.
There was a problem hiding this comment.
What happens if a user still sets these keys in their configmap?
IIUC knative's sharedmain.SetupObservabilityOrDie is still called by the watcher which starts tracing. Ref sharedmain, Ref watcher
How would these "conflicting" tracing configs be running? Would this cause a leak or is it a functional bug? Or have I completely missed the mark 🤨
There was a problem hiding this comment.
Yeah, this would be an issue. Both tracers initialize when an operator has set the Knative tracing keys. OpenTelemetry's setup runs after Knative's and overrides the global tracer provider, so spans flow through OpenTelemetry; Knative's tracer is configured but receives no spans.
There's now a startup warning when both are configured, the noopProvider helper was renamed to passthroughProvider (it never installed a noop globally), and the info-log messages no longer claim otherwise. The docs cover all this in a new When both are configured subsection.
Whether a leak or a functional bug, a tracer receiving no spans is wasted-resource either way. There's now a startup warning, the log messages are clearer, and the docs have a dedicated section for this case.
50ad9c1 to
5b41561
Compare
|
/ok-to-test |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2756 +/- ##
==========================================
+ Coverage 60.12% 60.18% +0.06%
==========================================
Files 210 211 +1
Lines 21151 21269 +118
==========================================
+ Hits 12716 12801 +85
- Misses 7642 7671 +29
- Partials 793 797 +4 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
This feelds vibe coded, have you tested this properly? have you properly reviewed it? |
5b41561 to
311babd
Compare
|
Sorry for the churn here. A number of findings surfaced while working through the integration-service and release-service side of this work, and the OTel SDK swap in this PR turned out to be necessary to avoid a fractional-sampling fragmentation bug (Knative's tracing wrapper made per-service sampling decisions independently of the root span). The unrelated cleanups landed in the same commit, but the overall behavior is functionally identical to what shipped in #2605 (#2605): same spans, same attribute schema, same opt-in env vars, just plumbed through the OTel SDK directly so The PR was validated end-to-end across more than a dozen scenarios covering each fix and each error class it mitigates:
|
311babd to
43cabca
Compare
There was a problem hiding this comment.
[AI-assisted review]
Solid work — well-structured, correct, and good test coverage. Two minor items inline.
What this does: Replaces Knative's config-observability ConfigMap-based tracing (which uses a flat tracing-sampling-rate causing per-service independent sampling) with direct OTel SDK initialization. This enables parentbased_* samplers that honor the root span's W3C traceparent flag bit, keeping the entire trace chain coherent.
What's good:
- Clean self-contained
tracingpackage with a clearNew()/Shutdown()lifecycle - Correct use of
context.Background()for the exporter — the gRPC/HTTP connection survives context cancellation so theBatchSpanProcessorcan flush during shutdown - Dual opt-in guard (both
OTEL_EXPORTER_OTLP_ENDPOINTandOTEL_TRACES_SAMPLERmust be set) with clear log messages on the noop fallback path - Thorough test suite: all 6 sampler families, noop fallback paths, protocol precedence, shutdown idempotency. The
saveGlobalProviderhelper properly restores global OTel state - Uses
gotest.tools/v3/assertand table-driven tests per project conventions samplerFromEnvlogs a helpful warning when a ratio sampler is selected withoutOTEL_TRACES_SAMPLER_ARG(defaults to 0%)
Findings (1 warning, 1 nit):
docs/content/docs/operations/tracing.md:23— warning/docs: "integration-service" and "release-service" are Konflux-specific service names that survived the earlier Konflux-reference cleanup. Suggest generalizing to "every service in the delivery chain."pkg/tracing/provider.go:70— nit/code-quality: RedundantprotocolFromEnv()call was acknowledged by the author as fixed ("Hoisted to a local.") but the fix hasn't been pushed yet.
Prior review feedback status:
- ✅ Gemini's
context.Background()for exporter creation — fixed - ✅ Konflux ADR link removed, Baggage paragraph rewritten
- ✅ ConfigMap clarification — Knative base still reads non-tracing keys
⚠️ RedundantprotocolFromEnv()— acknowledged but not yet pushed
43cabca to
ee2356a
Compare
|
/ok-to-test |
ee2356a to
d495d68
Compare
|
/ok-to-test |
d495d68 to
6820cf1
Compare
|
/ok-to-test |
6820cf1 to
7fe6ead
Compare
|
Pushed gosec suppression on OTel shutdown goroutine and a small cleanup. |
| } | ||
|
|
||
| proto := protocolFromEnv() | ||
| exporter, err := newExporter(context.Background(), logger, proto) |
There was a problem hiding this comment.
as per some AI review, does timeout handling needed here??
There was a problem hiding this comment.
No - otlptracegrpc.New doesn't actually open a connection, so there's nothing here that can hang. The gRPC connection is lazy (opens on the first span export), and shutdown already has a 5s timeout.
|
@ci-operator the whole point to emit traces via OTEL Sdk and drop the knative configmap is to use |
If yes, so then AFAIK, hierarchy of services is PaC controller -> Tekon Controller -> PaC watcher (please correct if I am wrong...). and I think tektoncd/pipeline still uses knative based tracing and using |
7fe6ead to
fd632f9
Compare
|
Yes - Regarding the hierarchy: our delivery trace becomes the parent of Tekton's execution traces, but that's orthogonal to our goal of getting delivery trace span timing. |
you mean we don't care about Tekton controller sampling from PaC's perspective?? |
fd632f9 to
4960005
Compare
|
Right - we're starting the delivery chain, which is orthogonal to Tekton's execution chain. Tekton's sampler just decides whether their execution spans honor our sample decision or are sampled independently - that doesn't affect our delivery chain, but it determines how much of Tekton's execution detail ends up in our trace. Either way, our delivery chain stays coherent under our own parent-based sampling - any Tekton detail in the trace is enrichment, not required. |
| # tracing-protocol specifies the trace export protocol. | ||
| # Supported values: "grpc", "http/protobuf", "none". | ||
| # Default is "none" (tracing disabled). | ||
| # tracing-protocol: "none" | ||
| # tracing-endpoint specifies the OTLP collector endpoint. | ||
| # Required when tracing-protocol is "grpc" or "http/protobuf". | ||
| # The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set. | ||
| # tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317" | ||
| # tracing-sampling-rate controls the fraction of traces sampled. | ||
| # 0.0 = none, 1.0 = all. Default is 0 (none). | ||
| # tracing-sampling-rate: "1.0" | ||
There was a problem hiding this comment.
What happens if a user still sets these keys in their configmap?
IIUC knative's sharedmain.SetupObservabilityOrDie is still called by the watcher which starts tracing. Ref sharedmain, Ref watcher
How would these "conflicting" tracing configs be running? Would this cause a leak or is it a functional bug? Or have I completely missed the mark 🤨
4960005 to
af2090f
Compare
|
Squashed in these changes:
|
theakshaypant
left a comment
There was a problem hiding this comment.
Thanks @ci-operator for you patience so far!
One final review from me for the last set of changes.
af2090f to
0288113
Compare
| func New(logger *zap.SugaredLogger) *TracerProvider { | ||
| otelConfigured := os.Getenv(EnvOTLPEndpoint) != "" && os.Getenv(EnvTracesSampler) != "" | ||
| if otelConfigured && !globalIsNoop() { | ||
| logger.Warn("OpenTelemetry and Knative tracing both configured; spans go through OpenTelemetry, Knative's tracer is unused. Set `tracing-protocol: none` in `pipelines-as-code-config-observability` to disable Knative, or unset `OTEL_EXPORTER_OTLP_ENDPOINT` to disable OpenTelemetry.") |
There was a problem hiding this comment.
Set
tracing-protocol: noneinpipelines-as-code-config-observabilityto disable Knative, or unsetOTEL_EXPORTER_OTLP_ENDPOINTto disable OpenTelemetry."
@ci-operator does it mean configmap can be used as fallback?
There was a problem hiding this comment.
Yes. When OTEL_EXPORTER_OTLP_ENDPOINT is unset, New() returns without calling otel.SetTracerProvider, so Knative's tracer provider (set up earlier by sharedmain from the ConfigMap) stays as the global. With tracing-protocol of grpc/http/protobuf/stdout, spans flow through Knative's exporter; with none the noop-wrapped provider stays in place. Updated the doc to spell out both sides alongside the precedence note.
0288113 to
0a7aab9
Compare
|
/ok-to-test |
0a7aab9 to
9f17b51
Compare
theakshaypant
left a comment
There was a problem hiding this comment.
Changes LGTM.
The configmap configuration confusion I had initially has been adequately addressed in the documentation.
All other review comments have been resolved as well.
|
/ok-to-test |
Knative's config-observability ConfigMap only exposes a flat tracing-sampling-rate, so at fractional rates each service in the chain rolls independently — PaC can drop a trace while Tekton keeps it, leaving execution spans whose parent_spanID points at nothing. Switching to the OTel SDK opens up OTEL_TRACES_SAMPLER's parentbased_* family, which honors the root span's sample decision in the W3C traceparent flag so the whole chain is kept or dropped together. Controller and watcher call tracing.New() at startup. Tracing is opt-in: both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set, otherwise PaC falls back to noop (matching the prior tracing-sampling-rate "0" default). Resource service.name is pipelines-as-code. Propagator is W3C TraceContext only; Baggage is intentionally not honored per Konflux-CI ADR 0061. otlptracegrpc and otlptracehttp promoted from indirect to direct dependencies. Assisted-by: Claude Code Signed-off-by: Josiah England <jengland@redhat.com>
9f17b51 to
7d989de
Compare
|
/ok-to-test |
📝 Description of the Change
PR #2605 ran PaC's tracing through Knative's
config-observabilityflattracing-sampling-rate. At fractional rates each service in the chain rollsindependently — PaC can drop a trace while Tekton keeps it, leaving execution
spans whose
parent_spanIDpoints at nothing.This PR switches to the OTel SDK directly.
OTEL_TRACES_SAMPLERopens theparentbased_*family, which honors the root span's sample decision in theW3C
traceparentflag bit so the whole chain is kept or dropped together.Controller and watcher call
tracing.New()at startup; tracing is opt-in(both
OTEL_EXPORTER_OTLP_ENDPOINTandOTEL_TRACES_SAMPLERmust be set).The
tracing-*keys are dropped from the_exampleConfigMap block sincethey no longer apply. Resource
service.nameispipelines-as-code.Propagator is W3C TraceContext only; Baggage is intentionally not honored
per Konflux-CI ADR 0061.
🔗 Linked GitHub Issue
Fixes #
🧪 Testing Strategy
🤖 AI Assistance
AI assistance can be used for various tasks, such as code generation,
documentation, or testing.
Please indicate whether you have used AI assistance
for this PR and provide details if applicable.
Important
Slop will be simply rejected, if you are using AI assistance you need to make sure you
understand the code generated and that it meets the project's standards. you
need at least know how to run the code and deploy it (if needed). See
startpaac to make it easy
to deploy and test your code changes.
If the majority of the code in this PR was generated by an AI, please add a
Co-authored-bytrailer to your commit message.For example:
Co-authored-by: Claude noreply@anthropic.com
✅ Submitter Checklist
fix:,feat:) matches the "Type of Change" I selected above.make testandmake lintlocally to check for and fix anyissues. For an efficient workflow, I have considered installing
pre-commit and running
pre-commit installtoautomate these checks.