Skip to content

fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756

Open
ci-operator wants to merge 1 commit into
tektoncd:mainfrom
ci-operator:distributed-tracing
Open

fix(tracing): direct OTel SDK setup for chain-coherent sampling#2756
ci-operator wants to merge 1 commit into
tektoncd:mainfrom
ci-operator:distributed-tracing

Conversation

@ci-operator

Copy link
Copy Markdown
Contributor

📝 Description of the Change

PR #2605 ran PaC's tracing through Knative's config-observability flat
tracing-sampling-rate. At fractional rates each service in the chain rolls
independently — PaC can drop a trace while Tekton keeps it, leaving execution
spans whose parent_spanID points at nothing.

This PR switches to the OTel SDK directly. OTEL_TRACES_SAMPLER opens the
parentbased_* family, which honors the root span's sample decision in the
W3C traceparent flag bit so the whole chain is kept or dropped together.

Controller and watcher call tracing.New() at startup; tracing is opt-in
(both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set).
The tracing-* keys are dropped from the _example ConfigMap block since
they no longer apply. Resource service.name is pipelines-as-code.
Propagator is W3C TraceContext only; Baggage is intentionally not honored
per Konflux-CI ADR 0061.

🔗 Linked GitHub Issue

Fixes #

🧪 Testing Strategy

  • Unit tests
  • Integration tests
  • End-to-end tests
  • Manual testing
  • Not Applicable

🤖 AI Assistance

AI assistance can be used for various tasks, such as code generation,
documentation, or testing.

Please indicate whether you have used AI assistance
for this PR and provide details if applicable.

  • I have not used any AI assistance for this PR.
  • I have used AI assistance for this PR.

Important

Slop will be simply rejected, if you are using AI assistance you need to make sure you
understand the code generated and that it meets the project's standards. you
need at least know how to run the code and deploy it (if needed). See
startpaac to make it easy
to deploy and test your code changes.

If the majority of the code in this PR was generated by an AI, please add a Co-authored-by trailer to your commit message.
For example:

Co-authored-by: Claude noreply@anthropic.com

✅ Submitter Checklist

  • 📝 My commit messages are clear, informative, and follow the project's How to write a git commit message guide. The Gitlint linter ensures in CI it's properly validated
  • ✨ I have ensured my commit message prefix (e.g., fix:, feat:) matches the "Type of Change" I selected above.
  • ♽ I have run make test and make lint locally to check for and fix any
    issues. For an efficient workflow, I have considered installing
    pre-commit and running pre-commit install to
    automate these checks.
  • 📖 I have added or updated documentation for any user-facing changes.
  • 🧪 I have added sufficient unit tests for my code changes.
  • 🎁 I have added end-to-end tests where feasible. See README for more details.
  • 🔎 I have addressed any CI test flakiness or provided a clear reason to bypass it.
  • If adding a provider feature, I have filled in the following and updated the provider documentation:
    • GitHub App
    • GitHub Webhook
    • Gitea/Forgejo
    • GitLab
    • Bitbucket Cloud
    • Bitbucket Data Center

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request transitions Pipelines-as-Code tracing configuration from a custom ConfigMap to standard OpenTelemetry environment variables (such as OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER). It introduces a new tracing package to initialize and manage the tracer provider lifecycle, updating both the controller and adapter to initialize and shut down the provider correctly. A critical issue was identified in pkg/tracing/provider.go where passing the application's cancellation context to the OTLP exporter prevents final spans from being flushed during shutdown; using context.Background() is recommended instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/tracing/provider.go Outdated
return noopProvider()
}

exporter, err := newExporter(ctx, logger)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Passing the application's cancellation context (ctx) to the OTLP exporter creation means that when the application context is cancelled to initiate shutdown, the exporter's connection/client is also cancelled immediately. This prevents the final flush of spans during tp.Shutdown(shutdownCtx) from succeeding, leading to lost spans. Using context.Background() ensures the exporter remains functional during the shutdown flush.

Suggested change
exporter, err := newExporter(ctx, logger)
exporter, err := newExporter(context.Background(), logger)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newExporter uses context.Background() so the exporter's connection stays open through tp.Shutdown(shutdownCtx) and queued spans flush. The ctx parameter on tracing.New is dropped since nothing else used it.

@ci-operator ci-operator force-pushed the distributed-tracing branch from ff868b1 to 50ad9c1 Compare June 3, 2026 19:12
Comment on lines -55 to -68
# tracing-protocol specifies the trace export protocol.
# Supported values: "grpc", "http/protobuf", "none".
# Default is "none" (tracing disabled).
# tracing-protocol: "none"
# tracing-endpoint specifies the OTLP collector endpoint.
# Required when tracing-protocol is "grpc" or "http/protobuf".
# The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set.
# tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
# tracing-sampling-rate controls the fraction of traces sampled.
# 0.0 = none, 1.0 = all. Default is 0 (none).
# tracing-sampling-rate: "1.0"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the configmap keys are removed, I see no change in controller. Does that mean the configmap based tracing continue to work even with these changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Knative tracing wrapper that consumed these _example keys is replaced; the new tracer in pkg/tracing/provider.go reads the OTel-standard env vars directly. The _example block was just documentation for keys nothing reads anymore.

This is not the same ConfigMap as pipelines-as-code (the main one), which holds the tracing-label-action|application|component operator label-name mappings.

On the "no change in controller" observation - we verified end-to-end that with neither OTEL_EXPORTER_OTLP_ENDPOINT nor OTEL_TRACES_SAMPLER set, PaC falls back to a noop tracer and emits no spans. If you're seeing tracing behavior unchanged from the pre-PR state, could you share how to reproduce so we can dig in?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changes in controller, I intended to ask if the knative base (eventing/adapter) that pac uses, reads these observability configmap keys for any arbitrary configuration/operations. Ref.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the Knative eventing-adapter still consumes config-observability for non-tracing observability (metrics/profiling/etc. via evadapter.NewObservabilityConfiguratorFromConfigMap() at the line you linked, logging is read from config-logging separately). What this PR changes is specifically the tracing portion: PaC's old Knative tracing wrapper (added in bd9f468) read tracing-protocol / tracing-endpoint / tracing-sampling-rate from this ConfigMap; the new pkg/tracing/provider.go reads the OTel-standard env vars directly and that wrapper is gone, so those three keys went with the _example block. The Knative-base reads for non-tracing observability are unchanged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ci-operator as it looks breaking change and @theakshaypant is considerate about, can't we introduce env based configuration as fallback to configmap so that we're keeping both, considering other users may want to use configmap?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we read those keys, the tracing-sampling-rate value would have to map to either traceidratio (which preserves the flat-sampling behavior this PR is fixing) or parentbased_traceidratio (which silently changes existing users' behavior). It's a breaking change either way - a ConfigMap fallback wouldn't actually preserve the old behavior. We're going env-vars-only and calling out the break in the tracing doc.

@theakshaypant theakshaypant Jun 25, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a user still sets these keys in their configmap?
IIUC knative's sharedmain.SetupObservabilityOrDie is still called by the watcher which starts tracing. Ref sharedmain, Ref watcher
How would these "conflicting" tracing configs be running? Would this cause a leak or is it a functional bug? Or have I completely missed the mark 🤨

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this would be an issue. Both tracers initialize when an operator has set the Knative tracing keys. OpenTelemetry's setup runs after Knative's and overrides the global tracer provider, so spans flow through OpenTelemetry; Knative's tracer is configured but receives no spans.

There's now a startup warning when both are configured, the noopProvider helper was renamed to passthroughProvider (it never installed a noop globally), and the info-log messages no longer claim otherwise. The docs cover all this in a new When both are configured subsection.

Whether a leak or a functional bug, a tracer receiving no spans is wasted-resource either way. There's now a startup warning, the log messages are clearer, and the docs have a dedicated section for this case.

Comment thread docs/content/docs/operations/tracing.md Outdated
Comment thread docs/content/docs/operations/tracing.md Outdated
Comment thread pkg/tracing/provider.go Outdated
@zakisk zakisk force-pushed the distributed-tracing branch from 50ad9c1 to 5b41561 Compare June 4, 2026 05:34
@zakisk

zakisk commented Jun 4, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@codecov-commenter

codecov-commenter commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.03390% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.18%. Comparing base (e1a2f48) to head (7d989de).

Files with missing lines Patch % Lines
pkg/tracing/provider.go 83.33% 13 Missing and 4 partials ⚠️
pkg/reconciler/controller.go 0.00% 9 Missing ⚠️
pkg/adapter/adapter.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2756      +/-   ##
==========================================
+ Coverage   60.12%   60.18%   +0.06%     
==========================================
  Files         210      211       +1     
  Lines       21151    21269     +118     
==========================================
+ Hits        12716    12801      +85     
- Misses       7642     7671      +29     
- Partials      793      797       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chmouel

chmouel commented Jun 4, 2026

Copy link
Copy Markdown
Member

This feelds vibe coded, have you tested this properly? have you properly reviewed it?

@ci-operator ci-operator force-pushed the distributed-tracing branch from 5b41561 to 311babd Compare June 4, 2026 17:23
@ci-operator

Copy link
Copy Markdown
Contributor Author

Sorry for the churn here. A number of findings surfaced while working through the integration-service and release-service side of this work, and the OTel SDK swap in this PR turned out to be necessary to avoid a fractional-sampling fragmentation bug (Knative's tracing wrapper made per-service sampling decisions independently of the root span). The unrelated cleanups landed in the same commit, but the overall behavior is functionally identical to what shipped in #2605 (#2605): same spans, same attribute schema, same opt-in env vars, just plumbed through the OTel SDK directly so parentbased_* samplers behave coherently across the delivery chain. Let me know if this needs greater alignment with your code conventions.

The PR was validated end-to-end across more than a dozen scenarios covering each fix and each error class it mitigates:

  • Chain-coherency under fractional sampling (per-service-independent sample decisions fragmenting chains - the failure mode the PR exists to fix).
  • Dual opt-in: noop fallback when either ENDPOINT or SAMPLER is unset.
  • All sampler families plus the ratio-arg defaulting path.
  • Transport selection between gRPC and HTTP/protobuf.
  • W3C TraceContext only; Baggage explicitly suppressed.
  • Resource attribute resolution (no unknown_service fallbacks).
  • Concurrent webhook handling.
  • BSP flush behavior under graceful and abrupt controller termination.

Comment thread docs/content/docs/operations/tracing.md Outdated
Comment thread pkg/tracing/provider.go Outdated

@theakshaypant theakshaypant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[AI-assisted review]
Solid work — well-structured, correct, and good test coverage. Two minor items inline.

What this does: Replaces Knative's config-observability ConfigMap-based tracing (which uses a flat tracing-sampling-rate causing per-service independent sampling) with direct OTel SDK initialization. This enables parentbased_* samplers that honor the root span's W3C traceparent flag bit, keeping the entire trace chain coherent.

What's good:

  • Clean self-contained tracing package with a clear New() / Shutdown() lifecycle
  • Correct use of context.Background() for the exporter — the gRPC/HTTP connection survives context cancellation so the BatchSpanProcessor can flush during shutdown
  • Dual opt-in guard (both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set) with clear log messages on the noop fallback path
  • Thorough test suite: all 6 sampler families, noop fallback paths, protocol precedence, shutdown idempotency. The saveGlobalProvider helper properly restores global OTel state
  • Uses gotest.tools/v3/assert and table-driven tests per project conventions
  • samplerFromEnv logs a helpful warning when a ratio sampler is selected without OTEL_TRACES_SAMPLER_ARG (defaults to 0%)

Findings (1 warning, 1 nit):

  • docs/content/docs/operations/tracing.md:23warning/docs: "integration-service" and "release-service" are Konflux-specific service names that survived the earlier Konflux-reference cleanup. Suggest generalizing to "every service in the delivery chain."
  • pkg/tracing/provider.go:70nit/code-quality: Redundant protocolFromEnv() call was acknowledged by the author as fixed ("Hoisted to a local.") but the fix hasn't been pushed yet.

Prior review feedback status:

  • ✅ Gemini's context.Background() for exporter creation — fixed
  • ✅ Konflux ADR link removed, Baggage paragraph rewritten
  • ✅ ConfigMap clarification — Knative base still reads non-tracing keys
  • ⚠️ Redundant protocolFromEnv() — acknowledged but not yet pushed

@zakisk

zakisk commented Jun 11, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@theakshaypant

Copy link
Copy Markdown
Member

/ok-to-test

@theakshaypant

Copy link
Copy Markdown
Member

/ok-to-test

Comment thread pkg/reconciler/controller.go Outdated
@ci-operator ci-operator force-pushed the distributed-tracing branch from 6820cf1 to 7fe6ead Compare June 17, 2026 14:24
@ci-operator

Copy link
Copy Markdown
Contributor Author

Pushed gosec suppression on OTel shutdown goroutine and a small cleanup.

Comment thread pkg/tracing/provider.go
}

proto := protocolFromEnv()
exporter, err := newExporter(context.Background(), logger, proto)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per some AI review, does timeout handling needed here??

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - otlptracegrpc.New doesn't actually open a connection, so there's nothing here that can hang. The gRPC connection is lazy (opens on the first span export), and shutdown already has a 5s timeout.

@zakisk

zakisk commented Jun 18, 2026

Copy link
Copy Markdown
Member

@ci-operator the whole point to emit traces via OTEL Sdk and drop the knative configmap is to use parentbased_* field and disable traces emitted from knative??

@zakisk

zakisk commented Jun 18, 2026

Copy link
Copy Markdown
Member

@ci-operator the whole point to emit traces via OTEL Sdk and drop the knative configmap is to use parentbased_* field and disable traces emitted from knative??

If yes, so then AFAIK, hierarchy of services is PaC controller -> Tekon Controller -> PaC watcher (please correct if I am wrong...). and I think tektoncd/pipeline still uses knative based tracing and using tracing-sampling-rate for sampling so it wouldn't be using parentbased_* sampling.

@ci-operator ci-operator force-pushed the distributed-tracing branch from 7fe6ead to fd632f9 Compare June 23, 2026 19:04
@ci-operator

Copy link
Copy Markdown
Contributor Author

Yes - parentbased_* is why we're replacing Knative's tracing setup in PaC. Knative's tracing setup code has no parentbased_* sampler support, so at fractional rates PaC controller and PaC watcher sample independently and the delivery chain fragments into orphans. Direct OTel SDK gives us parentbased_traceidratio, which keeps our delivery trace coherent.

Regarding the hierarchy: our delivery trace becomes the parent of Tekton's execution traces, but that's orthogonal to our goal of getting delivery trace span timing.

@zakisk

zakisk commented Jun 24, 2026

Copy link
Copy Markdown
Member

but that's orthogonal to our goal of getting delivery trace span timing.

you mean we don't care about Tekton controller sampling from PaC's perspective??

@ci-operator ci-operator force-pushed the distributed-tracing branch from fd632f9 to 4960005 Compare June 24, 2026 14:47
@ci-operator

Copy link
Copy Markdown
Contributor Author

Right - we're starting the delivery chain, which is orthogonal to Tekton's execution chain. Tekton's sampler just decides whether their execution spans honor our sample decision or are sampled independently - that doesn't affect our delivery chain, but it determines how much of Tekton's execution detail ends up in our trace.

Either way, our delivery chain stays coherent under our own parent-based sampling - any Tekton detail in the trace is enrichment, not required.

Comment thread pkg/adapter/adapter.go Outdated
Comment thread pkg/tracing/provider.go Outdated
Comment on lines -55 to -68
# tracing-protocol specifies the trace export protocol.
# Supported values: "grpc", "http/protobuf", "none".
# Default is "none" (tracing disabled).
# tracing-protocol: "none"
# tracing-endpoint specifies the OTLP collector endpoint.
# Required when tracing-protocol is "grpc" or "http/protobuf".
# The OTEL_EXPORTER_OTLP_ENDPOINT env var takes precedence if set.
# tracing-endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
# tracing-sampling-rate controls the fraction of traces sampled.
# 0.0 = none, 1.0 = all. Default is 0 (none).
# tracing-sampling-rate: "1.0"

@theakshaypant theakshaypant Jun 25, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a user still sets these keys in their configmap?
IIUC knative's sharedmain.SetupObservabilityOrDie is still called by the watcher which starts tracing. Ref sharedmain, Ref watcher
How would these "conflicting" tracing configs be running? Would this cause a leak or is it a functional bug? Or have I completely missed the mark 🤨

@ci-operator ci-operator force-pushed the distributed-tracing branch from 4960005 to af2090f Compare June 25, 2026 21:13
@ci-operator

Copy link
Copy Markdown
Contributor Author

Squashed in these changes:

  • Adapter shutdown defer now logs the error.
  • Sampler-arg malformed log includes the consequence: defaulting to 0% sampling.
  • Startup warning when OpenTelemetry env vars and Knative tracing config are both set; noopProvider renamed to passthroughProvider; info-log messages rewritten to no longer claim a noop install.
  • Doc now has two backend subsections (OpenTelemetry / Knative) plus a When both are configured subsection covering precedence and cleanup. The hint warning at the top of tracing.md is no longer needed.

@theakshaypant theakshaypant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ci-operator for you patience so far!

One final review from me for the last set of changes.

Comment thread pkg/tracing/provider.go Outdated
Comment thread docs/content/docs/operations/tracing.md Outdated
Comment thread docs/content/docs/operations/tracing.md Outdated
Comment thread pkg/tracing/provider.go
func New(logger *zap.SugaredLogger) *TracerProvider {
otelConfigured := os.Getenv(EnvOTLPEndpoint) != "" && os.Getenv(EnvTracesSampler) != ""
if otelConfigured && !globalIsNoop() {
logger.Warn("OpenTelemetry and Knative tracing both configured; spans go through OpenTelemetry, Knative's tracer is unused. Set `tracing-protocol: none` in `pipelines-as-code-config-observability` to disable Knative, or unset `OTEL_EXPORTER_OTLP_ENDPOINT` to disable OpenTelemetry.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set tracing-protocol: none in pipelines-as-code-config-observability to disable Knative, or unset OTEL_EXPORTER_OTLP_ENDPOINT to disable OpenTelemetry."

@ci-operator does it mean configmap can be used as fallback?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When OTEL_EXPORTER_OTLP_ENDPOINT is unset, New() returns without calling otel.SetTracerProvider, so Knative's tracer provider (set up earlier by sharedmain from the ConfigMap) stays as the global. With tracing-protocol of grpc/http/protobuf/stdout, spans flow through Knative's exporter; with none the noop-wrapped provider stays in place. Updated the doc to spell out both sides alongside the precedence note.

@zakisk zakisk force-pushed the distributed-tracing branch from 0288113 to 0a7aab9 Compare June 29, 2026 10:06
@zakisk

zakisk commented Jun 29, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@theakshaypant theakshaypant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.
The configmap configuration confusion I had initially has been adequately addressed in the documentation.
All other review comments have been resolved as well.

@theakshaypant

Copy link
Copy Markdown
Member

/ok-to-test

Knative's config-observability ConfigMap only exposes a flat
tracing-sampling-rate, so at fractional rates each service in the chain
rolls independently — PaC can drop a trace while Tekton keeps it, leaving
execution spans whose parent_spanID points at nothing. Switching to the
OTel SDK opens up OTEL_TRACES_SAMPLER's parentbased_* family, which honors
the root span's sample decision in the W3C traceparent flag so the whole
chain is kept or dropped together.

Controller and watcher call tracing.New() at startup. Tracing is opt-in:
both OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_TRACES_SAMPLER must be set,
otherwise PaC falls back to noop (matching the prior tracing-sampling-rate
"0" default). Resource service.name is pipelines-as-code. Propagator is
W3C TraceContext only; Baggage is intentionally not honored per Konflux-CI
ADR 0061. otlptracegrpc and otlptracehttp promoted from indirect to direct
dependencies.

Assisted-by: Claude Code
Signed-off-by: Josiah England <jengland@redhat.com>
@zakisk

zakisk commented Jun 29, 2026

Copy link
Copy Markdown
Member

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants